Abstract
Advancement in speech emotion recognition technology has brought tremendous potential in designing human-centered applications across a wide range of scenarios. However, due to the difficulty in obtaining large-scale labeled emotion corpus for every application domains, most of the existing databases are collected within disparate and limited contexts. This contextualization often undermines the variability in the emotional acoustic manifestation due to the limitation in the amount of labeled data that can be collected for each particular context. This, hence, creates a robustness issue across emotional scenarios. In this work, we propose to learn an enhanced acoustic code vector for in-context emotion database through adversarially learning from large out-of-context emotion corpus to obtain robust emotion recognition. We demonstrate that our framework can obtain improved recognition accuracy using low dimensional representations on two different databases, and it maintains its modeling power even when given very limited in-context training samples.