Exploiting Annotators' Typed Description of Emotion Perception to Maximize Utilization of Ratings for Speech Emotion Recognition｜BIIC Lab - NTHU

Speech and Language

Exploiting Annotators' Typed Description of Emotion Perception to Maximize Utilization of Ratings for Speech Emotion Recognition

Download PDF IEEE Xplore

Abstract

The decision of ground truth for speech emotion recognition (SER) is still a critical issue in affective computing tasks. Previous studies on emotion recognition often rely on consensus labels after aggregating the classes selected by multiple annotators. It is common for a perceptual evaluation conducted to annotate emotional corpora to include the class “other,” allowing the annotators the opportunity to describe the emotion with their own words. This practice provides valuable emotional information, which, however, is ignored in most emotion recognition studies. This paper utilizes easy-accessed natural language processing toolkits to mine the sentiment of these typed descriptions, enriching and maximizing the information obtained from the annotators. The polarity information is combined with primary and secondary annotations provided by individual evaluators under a label distribution framework, creating a complete representation for the emotional content of the speech sentences. Finally, we train multitask learning SER models with existing learning methods (soft-label, multi-label, and distribution-label) to show the performance of the novel ground truth in the MSP-Podcast corpus.

Figures

Negative emotion words

Positive emotion words

Ambiguous emotion words

Keywords

Emotion recognition ｜ Distribution-label learning ｜ Soft-label learning ｜ Multi-label learning

Authors