Previous studies have shown that an individual's subjective emotional evaluation involves cognitive processing that tends to be different from the directly-measured affective-physio signals, which creates a bias in the emotion labelings. Research has shown that this bodily-physiological signals are shown to be more related to the intended emotion elicitation type (Int), yet correlated less to an individual's subjective emotion feelings (-Sb). Hence in this work, we suggest that this intended emotion elicitation status from the original stimuli (-Int) should be incorporated as an explicit regularization in achieving a more robust subjective emotion recognition system using physiology. To be more specific, we propose a novel conditional tensor fusion network in which the stimulation's emotion type -Int is firstly learned, then this learned intended annotation would then act as an explicit conditional regularization toward the final subjective emotion labeling. Our experiments indicate that this additional regulation helps to improve the overall emotion recognition on self-reported labels using physiology. We achieve an unweighted recall of 69.8% using ECG-EDA multimodal fusion, which is a relative improvement of 6.3% over the vanilla DNN method. Further feature analysis shows that several descriptors from ECG signals are indicative of the differences between these two emotion annotation schemes.