Abstract
Most speech emotion recognition studies often focus on recognizing pre-set emotion classes. However, the task definition may change due to a shift in focus to a previously unseen class in real-world applications. This cross-task modeling has not been addressed previously. Lengthy data re-collection, model retraining, and the traditional adaptation and transfer learning approaches are not applicable to this cross-task setting. This study proposes an enroll-to-verify framework to avoid model retraining and rapidly perform a new task prediction using only a handful of enrolled samples. Specifically, we use negative angular margin prototypical loss in a pretrained multiclass network as an emotion encoder. Then, we enroll a few samples corresponding to emotion classes in the new task definition and simply compare the encoded embedding distance to perform recognition. In the experiments on the IEMOCAP dataset, given a four-class pretrained emotion encoder, we achieved a 71.9% unweighted average recall in the frustration (unseen) recognition task. The MELD dataset was used where the unseen class was surprise, fear, or disgust. The results revealed that enrolling only 20 samples without retraining was comparable to supervised training using the complete dataset. Further analyses were conducted to demonstrate the working mechanism of our proposed enroll-to-verify approach.