Abstract
Speech emotion recognition (SER) is important in enabling personalized services and multimedia applications in our life. It also becomes a prevalent topic of research with its potential in creating a better user experience across many modern technologies. However, the highly contextualized scenario and expensive emotion labeling required cause a severe mismatch between already limited-in-scale speech emotional corpora; this hinders the wide adoption of SER. In this work, instead of conventionally learning a common feature space between corpora, we take a novel approach in enhancing the variability of the source (labeled) corpus that is target (unlabeled) data-aware by generating synthetic source domain data using a conditional cycle emotion generative adversarial network (CCEmoGAN). Note that no target samples with label are used during whole training process. We evaluate our framework in cross corpus emotion recognition tasks and obtain a three classes valence recognition accuracy of 47.56%, 50.11% and activation accuracy of 51.13%, 65.7% when transferring from the IEMOCAP to the CIT dataset, and the IEMOCAP to the MSP-IMPROV dataset respectively. The benefit of increasing target domain-aware variability in the source domain to improve emotion discriminability in cross corpus emotion recognition is further visualized in our augmented data space.