Abstract
Human's judgment has been shown to be thin-sliced in nature, i.e., accurate perception can often be achieved for a short duration of exposure to expressive behaviors. In this work, we develop a mutual information-based framework to select the most emotion-rich 20% of local multimodal behavior segments within a 3-minute long affective dyadic interaction in the USC CreativeIT database. We obtain a prediction accuracy of 0.597, 0.728, and 0.772 (measured by Spearman correlation) for an actor's global (session-level) emotion attributes (activation, dominance, and valence) using Fishervector encoding and support vector regression built on these 20% of multimodal emotion-rich behavior segments. Our framework achieves a better accuracy over using the interaction in its entirety and a variety of other data selection baseline methods by a significant margin. Furthermore, our analysis indicates that the highest prediction accuracy can be obtained using only 20% - 30% of data within each session, i.e., additional evidences for the thin-slice nature of affect perception.