Abstract
An individual’s emotion perception plays a key role in affecting our decision-making and task performances. Previous speech emotion recognition research focuses mainly on recognizing the emotion label derived from the majority vote (hard label) of the speaker (i.e., producer) but not on recognizing per-rater’s emotion perception. In this work, we propose a framework that integrates different viewpoints of emotion perception from other co-raters (exclude target rater) using soft and hard label learning to improve target rater’s emotion perception recognition. Our methods achieve [3.97%, 1.48%] and [1.71%, 2.87%] improvement on average unweighted accuracy recall (UAR) on the three-class (low, middle, and high class) [valence, activation (arousal)] emotion recognition task for four different raters on the IEMOCAP and the NNIME databases, respectively. Further analyses show that learning from the soft label of co-raters provides the most robust accuracy even without obtaining the target rater’s labels. By simply adding 50% of a target raters annotation, our framework performance mostly surpasses the model trained with 100% of raters annotations.