Using Speaker-Aligned Graph Memory Block in Multimodally Attentive Emotion Recognition Network｜BIIC Lab - NTHU

Speech and Language

Multimodal Model

Using Speaker-Aligned Graph Memory Block in Multimodally Attentive Emotion Recognition Network

Download PDF ResearchGate

Abstract

Integrating multimodal emotion sensing modules in realizing human-centered technologies is rapidly growing. Despite recent advancement of deep architectures in improving recognition performances, inability to handle individual differences in the expressive cues creates a major hurdle for real world applications. In this work, we propose a Speaker-aligned Graph Memory Network (SaGMN) that leverages the use of speaker embedding learned from a large speaker verification network to characterize such an individualized personal difference across speakers. Specifically, the learning of the gated memory block is jointly optimized with a speaker graph encoder which aligns similar vocal characteristics samples together while effectively enlarge the discrimination across emotion classes. We evaluate our multimodal emotion recognition network on the CMUMOSEI database and achieve a state-of-art accuracy of 65.1% UAR and 74.7% F1 score. Further visualization experiments demonstrate the effect of speaker space alignment with the use of graph memory blocks.

Figures

Our framework SaGMN has a multimodal backbone network with a speaker aligned memory block. The similarity of speaker embeddings extracted from pre-trained speaker recognition network are used to derive adjacency matrix for the graph convolutional layer in the memory block. The resulting memory vector and multimodal attended vectors are used for the final emotion recognition.

Keywords

emotion recognition ｜ speaker embedding ｜ graph ｜ memory network

Authors

Publication Date

2020/10/25

Conference

Interspeech 2020

DOI

10.21437/Interspeech.2020-1688

Publisher

RESEARCH

Related Research