RESEARCH

HOME RESEARCH
Behavior Computing
Other: Signal Modeling for Understanding
States and Traits
Enforcing Semantic Consistency for Cross Corpus Valence Regression from Speech Using Adversarial Discrepancy Learning
Abstract
Issues of mismatch between databases remain a major challenge in performing emotion recognition on target unlabeled corpus from labeled source data. While studies have shown that by means of aligning source and target data distribution to learn a common feature space can mitigate these issues partially, they neglect the effect of distortion in emotion semantics across different databases. This distortion is especially crucial when regressing higher level emotion attribute such as valence. In this work, we propose a maximum regression discrepancy (MRD) network, which enforces cross corpus semantic consistency by learning a common acoustic feature space that minimizes discrepancy on those maximally-distorted samples through adversarial training. We evaluate our framework on two large emotion corpus, the USC IEMOCAP and the MSP-IMPROV, for the task of cross corpus valence regression from speech. Our MRD demonstrates a significant 10% and 5% improvement in concordance correlation coefficients (CCC) compared to using baseline source-only methods, and we also show that it outperforms two state-of-art domain adaptation techniques. Further analysis reveals that our model is more effective in reducing semantic distortion on low valence than high valence samples.
Figures
Adversarial training steps ofour MRD. Step1 learns two diverse valence regressors on the source data. Step2 maximizes the discrepancy by changing the regressors to detect those highly-distorted target representations. Step3 learns the encoder to minimize the discrepancy through adjusting the projected common space to reduce emotional semantic distortion. After MRD training, we finally regress the valence value oftarget domain sample as the average ofthe two regressors.
Adversarial training steps ofour MRD. Step1 learns two diverse valence regressors on the source data. Step2 maximizes the discrepancy by changing the regressors to detect those highly-distorted target representations. Step3 learns the encoder to minimize the discrepancy through adjusting the projected common space to reduce emotional semantic distortion. After MRD training, we finally regress the valence value oftarget domain sample as the average ofthe two regressors.
Keywords
valence | domain adaptation | adversarial learning | cross corpus | semantic consistency
Authors
Yun-Shao Lin Chun-Min Chang Chi-Chun Lee
Publication Date
2019/09/15
Conference
Interspeech
Interspeech 2019
DOI
10.21437/Interspeech.2019-2037
Publisher
ISCA