RESEARCH

HOME RESEARCH
Multimedia Modeling
States and Traits
Multimodal Model
Audience
Improving Induced Valence Recognition by Integrating Acoustic Sound Semantics in Movies
Abstract
Every sound event that we receive and produce everyday carry certain emotional cues. Recently, developing computational methods to recognize induced emotion in movies using content-based modeling is gaining more attention. Most of the existing works treat this as a task of multimodal audiovisual modeling; while these approaches are promising, this type of holistic modeling underestimates the impact of various semantically meaningful events designed in movies. In specifics, acoustic sound semantics such as human sounds in movies can significantly direct the viewer’s attention to emotional content in movies. This work explores the use of cross-modal attention mechanism in modeling how the verbal and non-verbal human sound semantics affect induced valence jointly with conventional audio-visual content-based modeling. Our proposed method integrates both self and cross-modal attention into a feature-based transformer (Fea-TF CSMA) where it obtains a 49.74% accuracy on seven class valence classification on the COGNIMUSE movie dataset. Further analysis reveals insights about the effect of human verbal and non-verbal acoustic sound semantics on induced valence.
Figures
Our proposed multi-modal transformer model for induced valence recognition, which includes audio-visual features and acoustic sound semantics with self and cross-modal attention mechanism
Our proposed multi-modal transformer model for induced valence recognition, which includes audio-visual features and acoustic sound semantics with self and cross-modal attention mechanism
Keywords
sound event detection | induced emotion | crossmodal attention | transformer
Authors
Publication Date
2022/08/29
Conference
EUSIPCO
European Signal Processing Conference, EUSIPCO 2022
DOI
10.1093/jos/17.4.335
Publisher
EURASIP
European Association for Signal Processing (EURASIP)