Speech Representation Learning for Emotion Recognition Using End-to-End ASR with Factorized Adaptation｜BIIC Lab - NTHU

States and Traits

Spoken Dialogs

Speech Representation Learning for Emotion Recognition Using End-to-End ASR with Factorized Adaptation

Download PDF ResearchGate

Abstract

Developing robust speech emotion recognition (SER) systems is challenging due to small-scale of existing emotional speech datasets. However, previous works have mostly relied on handcrafted acoustic features to build SER models that are difficult to handle a wide range of acoustic variations. One way to alleviate this problem is by using speech representations learned from deep end-to-end models trained on large-scale speech database. Specifically, in this paper, we leverage an end-to-end ASR to extract ASR-based representations for speech emotion recognition. We further devise a factorized domain adaptation approach on the pre-trained ASR model to improve both the speech recognition rate and the emotion recognition accuracy on the target emotion corpus, and we also provide an analysis in the effectiveness of representations extracted from different ASR layers. Our experiments demonstrate the importance of ASR adaptation and layer depth for emotion recognition.

Figures

The proposed framework. We adopt SVD-based model adaptation to FC layers in the encoder ofLAS.

Keywords

speech emotion recognition ｜ end-to-end ASR ｜ acoustic representation ｜ domain adaptation

Authors