RESEARCH

HOME RESEARCH
Behavior Computing
Speech and Language
Other: Signal Modeling for Understanding
Learning an Arousal-Valence Speech Front-End Network using Media Data In-the-Wild for Emotion Recognition
Abstract
Recent progress in speech emotion recognition (SER) technology has benefited from the use of deep learning techniques. However, expensive human annotation and difficulty in emotion database collection make it challenging for rapid deployment of SER across diverse application domains. An initialization - fine-tuning strategy help mitigate these technical challenges. In this work, we propose an initialization network that gears toward SER applications by learning the speech front-end network on a large media data collected in-the-wild jointly with proxy arousal-valence labels that are multimodally derived from audio and text information, termed as the Arousal-Valence Speech Front-End Network (AV-SpNET). The AV-SpNET can then be easily stacked simply with the supervised layers for the target emotion corpus of interest. We evaluate our proposed AV-SpNET on tasks of SER for two separate emotion corpora, the USC IEMOCAP and the NNIME database. The AV-SpNET outperforms other initialization techniques and reach the best overall performances requiring only 75% of the in-domain annotated data. We also observe that generally, by using the AV-SpNET as front-end network, it requires as little as 50% of the fine-tuned data to surpass method based on randomly-initialized network with fine-tuning on the complete training set.
Figures
A complete schematic of our initialization - fine-tuning framework for SER. The left shows our proposed network architecture ofarousal-valence speech front-end network (AV-SpNET) that is learned fromthe background DaAimedia corpus, and the right shows the recognition network by stacking AV-SpNET with fully connected dense layers to perform the final emotion recognition in the given target emotion databases.
A complete schematic of our initialization - fine-tuning framework for SER. The left shows our proposed network architecture ofarousal-valence speech front-end network (AV-SpNET) that is learned fromthe background DaAimedia corpus, and the right shows the recognition network by stacking AV-SpNET with fully connected dense layers to perform the final emotion recognition in the given target emotion databases.
Keywords
speech emotion recognition | media data in-the-wild | convolutional neural network | speech front-end network
Authors
Publication Date
2018/10/22
Conference
AVEC
AVEC'18: Proceedings of the 2018 on Audio/Visual Emotion Challenge and Workshop
DOI
10.1145/3266302.3266306
Publisher
ACM