RESEARCH

HOME RESEARCH
Behavior Computing
States and Traits
Speech and Language
Unsupervised Cross-Corpus Speech Emotion Recognition Using a Multi-Source Cycle-GAN
Abstract
Speech emotion recognition (SER) plays a crucial role in understanding user feelings when developing artificial intelligence services. However, the data mismatch and label distortion between the training (source) set and the testing (target) set significantly degrade the performances when developing the SER systems. Additionally, most emotion-related speech datasets are highly contextualized and limited in size. The manual annotation cost is often too high leading to an active investigation of unsupervised cross-corpus SER techniques. In this paper, we propose a framework in unsupervised cross-corpus emotion recognition using multi-source corpus in a data augmentation manner. We introduced Corpus-Aware Emotional CycleGAN (CAEmoCyGAN) including a corpus-aware attention mechanism to aggregate each source datasets to generate the synthetic target sample. We choose the widely used speech emotion corpora the IEMOCAP and the VAM as sources and the MSP-Podcast as the target. By generating synthetic target-aware samples to augment source datasets and by directly training on this augmented dataset, our proposed multi-source target-aware augmentation method outperforms other baseline models in activation and valence classification. 
Figures
Overview of the cross-corpus speech emotion recognition (SER) architecture using our proposed CAEmoCyGAN data augmentation.
Overview of the cross-corpus speech emotion recognition (SER) architecture using our proposed CAEmoCyGAN data augmentation.
Keywords
speech emotion recognition | data augmentation | cross corpus | unsupervised learning | multi-sources attention
Authors
Chi-Chun Lee
Publication Date
2022/01/27
Journal
IEEE COMPUTER SOCIETY
IEEE Transactions on Affective Computing
DOI
10.1109/TAFFC.2022.3146325
Publisher
IEEE