Abstract
The process of perception to affective response of humans is gated by a bottom-up saliency mechanism at the sensory level. In specifics, auditory saliency emphasizes audio segments that need to be attended to cognitively appraise and experience emotion. In this work, inspired by this mechanism, we propose an end-to-end feature masking network for audio emotion recognition in movies. Our proposed AudioSaliency Masking Transformer (ASTM) adjusts feature embedding using two learnable masks; one of them cross-refers to an auditory saliency map, and the other one is through selfreference. By joint training for front-end mask gating and the transformer as the back-end emotion classifier, we achieve three-class UARs of 46.26%, 49.03%, 53.49% and 53.51% on experienced arousal, experienced valence, intended arousal, and intended valence, respectively. We further analyze which acoustic feature categories that our saliency mask attends to the most.