A Conditional Cycle Emotion Gan for Cross Corpus Speech Emotion Recognition｜BIIC Lab - NTHU

States and Traits

Other: Signal Modeling for Understanding

A Conditional Cycle Emotion Gan for Cross Corpus Speech Emotion Recognition

Download PDF IEEE Xplore

Abstract

Speech emotion recognition (SER) is important in enabling personalized services and multimedia applications in our life. It also becomes a prevalent topic of research with its potential in creating a better user experience across many modern technologies. However, the highly contextualized scenario and expensive emotion labeling required cause a severe mismatch between already limited-in-scale speech emotional corpora; this hinders the wide adoption of SER. In this work, instead of conventionally learning a common feature space between corpora, we take a novel approach in enhancing the variability of the source (labeled) corpus that is target (unlabeled) data-aware by generating synthetic source domain data using a conditional cycle emotion generative adversarial network (CCEmoGAN). Note that no target samples with label are used during whole training process. We evaluate our framework in cross corpus emotion recognition tasks and obtain a three classes valence recognition accuracy of 47.56%, 50.11% and activation accuracy of 51.13%, 65.7% when transferring from the IEMOCAP to the CIT dataset, and the IEMOCAP to the MSP-IMPROV dataset respectively. The benefit of increasing target domain-aware variability in the source domain to improve emotion discriminability in cross corpus emotion recognition is further visualized in our augmented data space.

Figures

Architecture of cross corpus speech emotion recognition using our proposed conditional cycle emotion GAN data augmentation.

Visualization of three different types of synthetic data in the setting from the IEMOCAP to the CIT. The top figure shows a visualization of activation class low, mid and high respectively, and below is for valence dimension. Blue, red, green, yellow, purple stand for samples of original source corpus, type A, type B, type C and target corpus respectively.

Keywords

speech emotion recognition ｜ conditional cycle GAN ｜ cross corpus ｜ data augmentation ｜ transfer learning

Authors

Publication Date

2021/01/19

Conference

2021 IEEE Spoken Language Technology Workshop (SLT)

DOI

10.1109/slt48900.2021.9383512

Publisher

RESEARCH

Related Research