Learning an Arousal-Valence Speech Front-End Network using Media Data In-the-Wild for Emotion Recognition｜BIIC Lab - NTHU

Speech and Language

Other: Signal Modeling for Understanding

Learning an Arousal-Valence Speech Front-End Network using Media Data In-the-Wild for Emotion Recognition

Download PDF ACM Digital Library

Abstract

Recent progress in speech emotion recognition (SER) technology has benefited from the use of deep learning techniques. However, expensive human annotation and difficulty in emotion database collection make it challenging for rapid deployment of SER across diverse application domains. An initialization - fine-tuning strategy help mitigate these technical challenges. In this work, we propose an initialization network that gears toward SER applications by learning the speech front-end network on a large media data collected in-the-wild jointly with proxy arousal-valence labels that are multimodally derived from audio and text information, termed as the Arousal-Valence Speech Front-End Network (AV-SpNET). The AV-SpNET can then be easily stacked simply with the supervised layers for the target emotion corpus of interest. We evaluate our proposed AV-SpNET on tasks of SER for two separate emotion corpora, the USC IEMOCAP and the NNIME database. The AV-SpNET outperforms other initialization techniques and reach the best overall performances requiring only 75% of the in-domain annotated data. We also observe that generally, by using the AV-SpNET as front-end network, it requires as little as 50% of the fine-tuned data to surpass method based on randomly-initialized network with fine-tuning on the complete training set.

Figures

A complete schematic of our initialization - fine-tuning framework for SER. The left shows our proposed network architecture ofarousal-valence speech front-end network (AV-SpNET) that is learned fromthe background DaAimedia corpus, and the right shows the recognition network by stacking AV-SpNET with fully connected dense layers to perform the final emotion recognition in the given target emotion databases.

Keywords

speech emotion recognition ｜ media data in-the-wild ｜ convolutional neural network ｜ speech front-end network

Authors

Publication Date

2018/10/22

Conference

AVEC'18: Proceedings of the 2018 on Audio/Visual Emotion Challenge and Workshop

DOI

10.1145/3266302.3266306

Publisher

RESEARCH

Related Research