RESEARCH

HOME RESEARCH
Behavior Computing
Speech and Language
An Audio-Saliency Masking Transformer for Audio Emotion Classification in Movies
Abstract
The process of perception to affective response of humans is gated by a bottom-up saliency mechanism at the sensory level. In specifics, auditory saliency emphasizes audio segments that need to be attended to cognitively appraise and experience emotion. In this work, inspired by this mechanism, we propose an end-to-end feature masking network for audio emotion recognition in movies. Our proposed AudioSaliency Masking Transformer (ASTM) adjusts feature embedding using two learnable masks; one of them cross-refers to an auditory saliency map, and the other one is through selfreference. By joint training for front-end mask gating and the transformer as the back-end emotion classifier, we achieve three-class UARs of 46.26%, 49.03%, 53.49% and 53.51% on experienced arousal, experienced valence, intended arousal, and intended valence, respectively. We further analyze which acoustic feature categories that our saliency mask attends to the most.
Figures
The proposed framework of ASMT.
The proposed framework of ASMT.
Keywords
emotion recognition | auditory saliency | affective multimedia | transformer
Authors
Publication Date
2022/05/07
Conference
IEEE ICASSP
IEEE ICASSP 2022
DOI
10.1109/ICASSP43922.2022.9746403
Publisher
IEEE