RESEARCH

HOME RESEARCH
Behavior Computing
Speech and Language
States and Traits
Other: Signal Modeling for Understanding
A Conditional Cycle Emotion Gan for Cross Corpus Speech Emotion Recognition
Abstract
Speech emotion recognition (SER) is important in enabling personalized services and multimedia applications in our life. It also becomes a prevalent topic of research with its potential in creating a better user experience across many modern technologies. However, the highly contextualized scenario and expensive emotion labeling required cause a severe mismatch between already limited-in-scale speech emotional corpora; this hinders the wide adoption of SER. In this work, instead of conventionally learning a common feature space between corpora, we take a novel approach in enhancing the variability of the source (labeled) corpus that is target (unlabeled) data-aware by generating synthetic source domain data using a conditional cycle emotion generative adversarial network (CCEmoGAN). Note that no target samples with label are used during whole training process. We evaluate our framework in cross corpus emotion recognition tasks and obtain a three classes valence recognition accuracy of 47.56%, 50.11% and activation accuracy of 51.13%, 65.7% when transferring from the IEMOCAP to the CIT dataset, and the IEMOCAP to the MSP-IMPROV dataset respectively. The benefit of increasing target domain-aware variability in the source domain to improve emotion discriminability in cross corpus emotion recognition is further visualized in our augmented data space.
Figures
Architecture of cross corpus speech emotion recognition using our proposed conditional cycle emotion GAN data augmentation.
Architecture of cross corpus speech emotion recognition using our proposed conditional cycle emotion GAN data augmentation.
Visualization of three different types of synthetic data in the setting from the IEMOCAP to the CIT. The top figure shows a visualization of activation class low, mid and high respectively, and below is for valence dimension. Blue, red, green, yellow, purple stand for samples of original source corpus, type A, type B, type C and target corpus respectively.
Visualization of three different types of synthetic data in the setting from the IEMOCAP to the CIT. The top figure shows a visualization of activation class low, mid and high respectively, and below is for valence dimension. Blue, red, green, yellow, purple stand for samples of original source corpus, type A, type B, type C and target corpus respectively.
Keywords
speech emotion recognition | conditional cycle GAN | cross corpus | data augmentation | transfer learning
Authors
Bo-Hao Su Chi-Chun Lee
Publication Date
2021/01/19
Conference
IEEE Spoken Language Technology Workshop (SLT)
2021 IEEE Spoken Language Technology Workshop (SLT)
DOI
10.1109/slt48900.2021.9383512
Publisher
IEEE