RESEARCH

HOME RESEARCH
Behavior Computing
Spoken Dialogs
Multimodal Model
Deriving Dyad-Level Interaction Representation Using Interlocutors Structural and Expressive Multimodal Behavior Features
Abstract
The overall interaction atmosphere is often a result of complex interplay between individual interlocutor's behavior expressions and joint manifestation of dyadic interaction dynamics. There is very limited work, if any, that has computationally analyzed a human interaction at the dyad-level. Hence, in this work, we propose to compute an extensive novel set of features representing multi-faceted aspects of a dyadic interaction. These features are grouped into two broad categories: expressive and structural behavior dynamics, where each captures information about within-speaker behavior manifestation, interspeaker behavior dynamics, durational and transitional statistics providing holistic behavior quantifications at the dyad-level. We carry out an experiment of recognizing targeted affective atmosphere using the proposed expressive and structural behavior dynamics features derived from audio and video modalities. Our experiment shows that the inclusion of both expressive and structural behavior dynamics is essential in achieving promising recognition accuracies across six different classes (72.5%), where structural-based features improve the recognition rates on classes of sad and surprise. Further analyses reveal important aspects of multimodal behavior dynamics within dyadic interactions that are related to the affective atmospheric scene.
Figures
A schematic of our complete multimodal structural and expressive features. The framework involves two steps: 1) preprocessing to assign each audio and video frame as one ofthe three distinct states, and 2) computing structural and expressive features to capture aspects on individual speaker’s behavioral manifestation, inter-speaker behavioral dynamics, durational and transitional statistics.
A schematic of our complete multimodal structural and expressive features. The framework involves two steps: 1) preprocessing to assign each audio and video frame as one ofthe three distinct states, and 2) computing structural and expressive features to capture aspects on individual speaker’s behavioral manifestation, inter-speaker behavioral dynamics, durational and transitional statistics.
Keywords
affect recognition | face-to-face interaction | multimodal behaviors | dyad-level affect
Authors
Publication Date
2017/08/20
Conference
Interspeech 2017
Interspeech 2017
DOI
10.21437/Interspeech.2017-569
Publisher
ISCA