Improving Induced Valence Recognition by Integrating Acoustic Sound Semantics in Movies｜BIIC Lab - NTHU

States and Traits

Multimodal Model

Audience

Improving Induced Valence Recognition by Integrating Acoustic Sound Semantics in Movies

Download PDF IEEE Xplore

Abstract

Every sound event that we receive and produce everyday carry certain emotional cues. Recently, developing computational methods to recognize induced emotion in movies using content-based modeling is gaining more attention. Most of the existing works treat this as a task of multimodal audiovisual modeling; while these approaches are promising, this type of holistic modeling underestimates the impact of various semantically meaningful events designed in movies. In specifics, acoustic sound semantics such as human sounds in movies can significantly direct the viewer’s attention to emotional content in movies. This work explores the use of cross-modal attention mechanism in modeling how the verbal and non-verbal human sound semantics affect induced valence jointly with conventional audio-visual content-based modeling. Our proposed method integrates both self and cross-modal attention into a feature-based transformer (Fea-TF CSMA) where it obtains a 49.74% accuracy on seven class valence classification on the COGNIMUSE movie dataset. Further analysis reveals insights about the effect of human verbal and non-verbal acoustic sound semantics on induced valence.

Figures

Our proposed multi-modal transformer model for induced valence recognition, which includes audio-visual features and acoustic sound semantics with self and cross-modal attention mechanism

Keywords

sound event detection ｜ induced emotion ｜ crossmodal attention ｜ transformer

Authors