Enforcing Semantic Consistency for Cross Corpus Emotion Prediction using Adversarial Discrepancy Learning

States and Traits

Download PDF IEEE Xplore

Abstract

Mismatch between databases entails a challenge in performing emotion recognition on a practical-condition unlabeled database with labeled source data. The alignment between the source and target is crucial for conventional neural network; therefore, many studies have mapped two domains in a common feature space. However, the effect of distortion in emotion semantics across different conditions has been neglected in such work, and a sample from the target may be considered a high emotional annotation in the target but as low in the source. In this work, we propose the maximum regression discrepancy (MRD) network, which enforces semantic consistency in a source and target by adjusting the acoustic feature encoder to minimize discrepancy in maximally distorted samples through adversarial training. We show our framework in several experiments using three databases (the USC IEMOCAP, MSP-Improv, and MSP-Podcast) for cross corpus emotion prediction. Compared to the Source-only neural network and DANN, MRD network demonstrates a significant improvement between 5% and 10% in the concordance correlation coefficient (CCC) in cross-corpus prediction and between 3% and 10% for evaluation on MSP-PODCAST. We also visualize the effect of MRD on feature representation to shows the efficacy of the MRD structure we designed.

Figures

Adversarial discrepancy learning procedure of MRD network.

The t-SNE algorithm are employed to plot feature representation transformed by the encoder from the MRD network, DANN and SoNN for activation.

Keywords

speech emotion recognition ｜ generative adversarial network ｜ cross corpus learning ｜ semantic consistency ｜ domain adaptation

Authors