Generating fMRI-Enriched Acoustic Vectors using a Cross-Modality Adversarial Network for Emotion Recognition｜BIIC Lab - NTHU

States and Traits

Speech and Language

Generating fMRI-Enriched Acoustic Vectors using a Cross-Modality Adversarial Network for Emotion Recognition

Download PDF ACM Digital Library

Abstract

Automatic emotion recognition has long been developed by concentrating on modeling human expressive behavior. At the same time, neuro-scientific evidences have shown that the varied neuroresponses (i.e., blood oxygen level-dependent (BOLD) signals measured from the functional magnetic resonance imaging (fMRI)) is also a function on the types of emotion perceived. While past research has indicated that fusing acoustic features and fMRI improves the overall speech emotion recognition performance, obtaining fMRI data is not feasible in real world applications. In this work, we propose a cross modality adversarial network that jointly models the bi-directional generative relationship between acoustic features of speech samples and fMRI signals of human percetual responses by leveraging a parallel dataset. We encode the acoustic descriptors of a speech sample using the learned cross modality adversarial network to generate the fMRI-enriched acoustic vectors to be used in the emotion classifier. The generated fMRI-enriched acoustic vector is evaluated not only in the parallel dataset but also in an additional dataset without fMRI scanning. Our proposed framework significantly outperform using acoustic features only in a four-class emotion recognition task for both datasets, and the use of cyclic loss in learning the bi-directional mapping is also demonstrated to be crucial in achieving improved recognition rates.

Figures

Our cross modality adversarial framework used for emotion recognition can be split into two parts: (upper portion) the first part includes learning a cross modality network, i.e., training G : X → Y and F : Y → X simultaneously with Lcyc (cycle consistency loss) and LGP (adversarial loss); (bottom portion) the second part is to derive fMRI-enriched acoustic vectors from the learned G generator in order to train the final speech emotion recognizer.

Keywords

fMRI ｜ Acoustic Representation ｜ Cross-Modality Adversarial Network ｜ Speech Emotion Recognition

Authors