RESEARCH

HOME RESEARCH
Behavior Computing
Other: Signal Modeling for Understanding
States and Traits
Speech and Language
Generating fMRI-Enriched Acoustic Vectors using a Cross-Modality Adversarial Network for Emotion Recognition
Abstract
Automatic emotion recognition has long been developed by concentrating on modeling human expressive behavior. At the same time, neuro-scientific evidences have shown that the varied neuroresponses (i.e., blood oxygen level-dependent (BOLD) signals measured from the functional magnetic resonance imaging (fMRI)) is also a function on the types of emotion perceived. While past research has indicated that fusing acoustic features and fMRI improves the overall speech emotion recognition performance, obtaining fMRI data is not feasible in real world applications. In this work, we propose a cross modality adversarial network that jointly models the bi-directional generative relationship between acoustic features of speech samples and fMRI signals of human percetual responses by leveraging a parallel dataset. We encode the acoustic descriptors of a speech sample using the learned cross modality adversarial network to generate the fMRI-enriched acoustic vectors to be used in the emotion classifier. The generated fMRI-enriched acoustic vector is evaluated not only in the parallel dataset but also in an additional dataset without fMRI scanning. Our proposed framework significantly outperform using acoustic features only in a four-class emotion recognition task for both datasets, and the use of cyclic loss in learning the bi-directional mapping is also demonstrated to be crucial in achieving improved recognition rates.
Figures
Our cross modality adversarial framework used for emotion recognition can be split into two parts: (upper portion) the first part includes learning a cross modality network, i.e., training G : X → Y and F : Y → X simultaneously with Lcyc (cycle consistency loss) and LGP (adversarial loss); (bottom portion) the second part is to derive fMRI-enriched acoustic vectors from the learned G generator in order to train the final speech emotion recognizer.
Our cross modality adversarial framework used for emotion recognition can be split into two parts: (upper portion) the first part includes learning a cross modality network, i.e., training G : X → Y and F : Y → X simultaneously with Lcyc (cycle consistency loss) and LGP (adversarial loss); (bottom portion) the second part is to derive fMRI-enriched acoustic vectors from the learned G generator in order to train the final speech emotion recognizer.
Keywords
fMRI | Acoustic Representation | Cross-Modality Adversarial Network | Speech Emotion Recognition
Authors
Jeng-Lin Li Ya-Tse Wu Chi-Chun Lee
Publication Date
2018/10/16
Conference
ICMI
ICMI '18: Proceedings of the 20th ACM International Conference on Multimodal Interaction
DOI
10.1145/3242969.3242992
Publisher
ACM