Abstract
Speech emotion recognition (SER) plays a crucial role in understanding user feelings when developing artificial intelligence services. However, the data mismatch and label distortion between the training (source) set and the testing (target) set significantly degrade the performances when developing the SER systems. Additionally, most emotion-related speech datasets are highly contextualized and limited in size. The manual annotation cost is often too high leading to an active investigation of unsupervised cross-corpus SER techniques. In this paper, we propose a framework in unsupervised cross-corpus emotion recognition using multi-source corpus in a data augmentation manner. We introduced Corpus-Aware Emotional CycleGAN (CAEmoCyGAN) including a corpus-aware attention mechanism to aggregate each source datasets to generate the synthetic target sample. We choose the widely used speech emotion corpora the IEMOCAP and the VAM as sources and the MSP-Podcast as the target. By generating synthetic target-aware samples to augment source datasets and by directly training on this augmented dataset, our proposed multi-source target-aware augmentation method outperforms other baseline models in activation and valence classification.