RESEARCH

HOME RESEARCH
Behavior Computing
Spoken Dialogs
Speech and Language
Exploiting Annotators' Typed Description of Emotion Perception to Maximize Utilization of Ratings for Speech Emotion Recognition
Abstract
The decision of ground truth for speech emotion recognition (SER) is still a critical issue in affective computing tasks. Previous studies on emotion recognition often rely on consensus labels after aggregating the classes selected by multiple annotators. It is common for a perceptual evaluation conducted to annotate emotional corpora to include the class “other,” allowing the annotators the opportunity to describe the emotion with their own words. This practice provides valuable emotional information, which, however, is ignored in most emotion recognition studies. This paper utilizes easy-accessed natural language processing toolkits to mine the sentiment of these typed descriptions, enriching and maximizing the information obtained from the annotators. The polarity information is combined with primary and secondary annotations provided by individual evaluators under a label distribution framework, creating a complete representation for the emotional content of the speech sentences. Finally, we train multitask learning SER models with existing learning methods (soft-label, multi-label, and distribution-label) to show the performance of the novel ground truth in the MSP-Podcast corpus.
Figures
Negative emotion words
Negative emotion words
Positive emotion words
Positive emotion words
Ambiguous emotion words
Ambiguous emotion words
Keywords
Emotion recognition | Distribution-label learning | Soft-label learning | Multi-label learning
Authors
Huang-Cheng Chou Chi-Chun Lee
Publication Date
2022/05/07
Conference
IEEE ICASSP
IEEE ICASSP 2022
DOI
10.1109/ICASSP43922.2022.9746990
Publisher
IEEE