Abstract
Modeling multimodal behavior streams to automatically identify emotion states of an individual has progressed extensively especially with the advancement of deep learning algorithms. Emotion, being an abstract internal state, creates substantial differences in an individual's behavior expressivity, the development of personalized recognition framework is a critical next step to improve algorithm's modeling capacity. In this work, we propose to integrate the target speaker's personality embedding into the learning of multimodal (speech and language) attention based network architecture to improve recognition performances. Specifically, we propose a Personal AttributeAware Attention Network (PAaAN) that learns its multimodal attention weights jointly with the target speaker's retrievable acoustic embedding of personality. Our acoustic domain adapted personality retrieval strategy mitigates the common issue on the lack of personality scores in the current available emotion databases, and our proposed PAaAN then learns its attention weight by jointly considering an individual target speaker's personality profile with his or her multimodal acoustic and lexical modalties. In this work, we achieve a 70% unweighted accuracy in the IEMOCAP 4-class multimodal emotion recognition task. Further analysis shows the effect of integrating personality on the variation of our attention weights of each acoustic and lexical behavior modality for each speaker in the IEMOCAP database.