Learning with Rater-Expanded Label Space to Improve Speech Emotion Recognition｜BIIC Lab - NTHU

States and Traits

Learning with Rater-Expanded Label Space to Improve Speech Emotion Recognition

IEEE Digital Library

Abstract

Automatic sensing of emotional information in speech is important for numerous everyday applications. Conventional Speech Emotion Recognition (SER) models rely on averaging or consensus of human annotations for training, but emotions and raters’ interpretations are subjective in nature, leading to diverse variations in perceptions. To address this, our proposed approach integrates the rater’s subjectivity by forming the Perception-Coherent Clusters (PCC) of raters to be used to derive expanded label space for learning to improve SER. We evaluate our method on the IEMOCAP and MSP-Podcast corpora, considering scenarios of fixed and variable raters, respectively. The proposed architecture, Rater Perception Coherency (RPC)-based SER surpasses single-task models with consensus labels by achieving UAR improvements of 3.39% for IEMOCAP and 2.03% for MSP-Podcast. Further analysis providescomprehensive insights into the contributions of these perception consistency clusters in SER learning.

Figures

This figure illustrates the model architecture for 4-category SER, which includes two parts: (a) the estimation of homogeneous clusters of raters based on inter-rater perception consistency, and (b) the multi-perception SER training that integrates rater-ambiguity conditioned learning jointly with consensus label learning using the clusters from part (a).

Keywords

speech emotion recognition ｜ multi-tasking ｜ rater subjectivity ｜ perception consistency clusters

Authors

Journal

IEEE Transactions on Affective Computing

DOI

10.1109/TAFFC.2024.3360428

Publisher

RESEARCH

Related Research