Abstract
Arousal is essential in understanding human behavior and decision-making. In this work, we present a multimodal arousal rating framework that incorporates minimal set of vocal and non-verbal behavior descriptors. The rating framework and fusion techniques are unsupervised in nature to ensure that it can be readily-applicable and interpretable. Our proposed multimodal framework improves correlation to human judgment from 0.66 (vocal-only) to 0.68 (multimodal); analysis shows that the supervised fusion framework does not improve correlation. Lastly, an interesting empirical evidence demonstrates that the signal-based quantification of arousal achieves a higher agreement with each individual rater than the agreement among raters themselves. This further strengthens that machine-based rating is a viable way of measuring subjective humans’ internal states through observing behavior features objectively.