Abstract
The decision of ground truth for speech emotion recognition (SER) is still a critical issue in affective computing tasks. Previous studies on emotion recognition often rely on consensus labels after aggregating the classes selected by multiple annotators. It is common for a perceptual evaluation conducted to annotate emotional corpora to include the class “other,” allowing the annotators the opportunity to describe the emotion with their own words. This practice provides valuable emotional information, which, however, is ignored in most emotion recognition studies. This paper utilizes easy-accessed natural language processing toolkits to mine the sentiment of these typed descriptions, enriching and maximizing the information obtained from the annotators. The polarity information is combined with primary and secondary annotations provided by individual evaluators under a label distribution framework, creating a complete representation for the emotional content of the speech sentences. Finally, we train multitask learning SER models with existing learning methods (soft-label, multi-label, and distribution-label) to show the performance of the novel ground truth in the MSP-Podcast corpus.