Developing automatic emotion recognition by modeling expressive behaviors is becoming crucial in enabling the next generation design of human-machine interface. Also, with the availability of functional magnetic resonance imaging (fMRI), researchers have also conducted studies into quantitative understanding of vocal emotion perception mechanism. In this work, our aim is two folds: 1) investigating whether the neuralresponses can be used to automatically decode the emotion labels of vocal stimuli, and 2) combining acoustic and fMRI features to improve the speech emotion recognition accuracies. We introduce a novel framework of lobe-dependent convolutional neural network (LD-CNN) to provide better modeling of perceivers neural-responses on vocal emotion. Furthermore, by fusing LD-CNN with acoustic features, we demonstrate an overall 63.17% accuracies in a four-class emotion recognition task (9.89% and 14.42% relative improvement compared to the acoustic-only and the fMRI-only features). Our analysis further shows that temporal lobe possess the most information in decoding emotion labels; the fMRI and the acoustic information are complementary to each other, where neural-responses and acoustic features are better at discriminating along the valence and activation dimensions, respectively.