Modeling Perceivers Neural-Responses Using Lobe-Dependent Convolutional Neural Network to Improve Speech Emotion Recognition

States and Traits

Other: Signal Modeling for Understanding

Multimodal Model

Download PDF ResearchGate

Abstract

Developing automatic emotion recognition by modeling expressive behaviors is becoming crucial in enabling the next generation design of human-machine interface. Also, with the availability of functional magnetic resonance imaging (fMRI), researchers have also conducted studies into quantitative understanding of vocal emotion perception mechanism. In this work, our aim is two folds: 1) investigating whether the neuralresponses can be used to automatically decode the emotion labels of vocal stimuli, and 2) combining acoustic and fMRI features to improve the speech emotion recognition accuracies. We introduce a novel framework of lobe-dependent convolutional neural network (LD-CNN) to provide better modeling of perceivers neural-responses on vocal emotion. Furthermore, by fusing LD-CNN with acoustic features, we demonstrate an overall 63.17% accuracies in a four-class emotion recognition task (9.89% and 14.42% relative improvement compared to the acoustic-only and the fMRI-only features). Our analysis further shows that temporal lobe possess the most information in decoding emotion labels; the fMRI and the acoustic information are complementary to each other, where neural-responses and acoustic features are better at discriminating along the valence and activation dimensions, respectively.

Figures

A schematic ofmultimodal emotion recognition from audio (Fisher-vector feature representation) and fMRI (Lobe-depedent convolutional neural network-derived feature representation) data

Keywords

speech emotion recognition ｜ convolutional neural network (CNN) ｜ affective computing, fMRI

Authors