Abstract
Issues of mismatch between databases remain a major challenge in performing emotion recognition on target unlabeled corpus from labeled source data. While studies have shown that by means of aligning source and target data distribution to learn a common feature space can mitigate these issues partially, they neglect the effect of distortion in emotion semantics across different databases. This distortion is especially crucial when regressing higher level emotion attribute such as valence. In this work, we propose a maximum regression discrepancy (MRD) network, which enforces cross corpus semantic consistency by learning a common acoustic feature space that minimizes discrepancy on those maximally-distorted samples through adversarial training. We evaluate our framework on two large emotion corpus, the USC IEMOCAP and the MSP-IMPROV, for the task of cross corpus valence regression from speech. Our MRD demonstrates a significant 10% and 5% improvement in concordance correlation coefficients (CCC) compared to using baseline source-only methods, and we also show that it outperforms two state-of-art domain adaptation techniques. Further analysis reveals that our model is more effective in reducing semantic distortion on low valence than high valence samples.