Abstract
Bone et al. recently proposed an unsupervised signal-derived vocal arousal score (VC-AS) based on fusion of three intuitive acoustic features, i.e., pitch, intensity, and HF500, and have shown the effectiveness of quantifying human perceptual ratings of arousal robustly across multiple corpora. Due to the readily-applicable nature of the system, this objective quantification scheme could foresee-ably be used in multiple fields of behavioral science as an objective measure of affect. In this work, we investigate in detail the relationship of this signalderived measure to both intended arousal expression (i.e., production aspect) and perceived arousal rating (i.e., perception aspect). On the perception side, our results on three databases (EMA, VAM, and IEMOCAP) indicate that VC-AS agrees with mean perception at least as well as an average individual rater does. Regarding production, we observe that intended arousal correlates more with VC-AS than mean perception (EMA and IEMOCAP), and that VC-AS correlates more with intended arousal than perceived arousal (EMA); these findings are surprising given that the framework is motivated by extensive affective perception studies, although there is physiological backing. Implications for the use ofVC-AS for novel scientific study (e.g., to mitigate subjectivity) is further discussed.