RESEARCH

HOME RESEARCH
Behavior Computing
States and Traits
Spoken Dialogs
Speech Representation Learning for Emotion Recognition Using End-to-End ASR with Factorized Adaptation
Abstract
Developing robust speech emotion recognition (SER) systems is challenging due to small-scale of existing emotional speech datasets. However, previous works have mostly relied on handcrafted acoustic features to build SER models that are difficult to handle a wide range of acoustic variations. One way to alleviate this problem is by using speech representations learned from deep end-to-end models trained on large-scale speech database. Specifically, in this paper, we leverage an end-to-end ASR to extract ASR-based representations for speech emotion recognition. We further devise a factorized domain adaptation approach on the pre-trained ASR model to improve both the speech recognition rate and the emotion recognition accuracy on the target emotion corpus, and we also provide an analysis in the effectiveness of representations extracted from different ASR layers. Our experiments demonstrate the importance of ASR adaptation and layer depth for emotion recognition.
Figures
The proposed framework. We adopt SVD-based model adaptation to FC layers in the encoder ofLAS.
The proposed framework. We adopt SVD-based model adaptation to FC layers in the encoder ofLAS.
Keywords
speech emotion recognition | end-to-end ASR | acoustic representation | domain adaptation
Authors
Publication Date
2020/10/25
Conference
Interspeech
Interspeech 2020
DOI
10.21437/Interspeech.2020-2524
Publisher
ISCA