Understanding neuro-perceptual mechanism of vocal emotion perception continues to be an important research direction not only in advancing scientific knowledge but also in inspiring more robust affective computing technologies. The large variabilities in the manifested fMRI signals among subjects has been shown to be due to the effect of individual difference, i.e., inter-subject variability. However, relatively few works have developed modeling techniques in task of automatic neuro-perceptual decoding to handle such idiosyncrasies. In our work, we propose a novel computation method of deep voting fusion neural network architecture by learning an adjusted weight matrix applied at the fusion layer. The framework achieves an unweighted average recall of 53.10% in a four-class vocal emotion states decoding task, i.e., a relative improvement of 8.9% over a two-stage SVM decisionlevel fusion. Our framework demonstrates its effectiveness in handling individual differences. Further analysis is conducted to study the properties of the learned adjusted weight matrix as a function of emotion classification accuracy.