Jointly Learning from Unimodal and Multimodal-Rated Labels in Audio-Visual Emotion Recognition｜BIIC Lab - NTHU

States and Traits

Multimodal Model

Other: Media Processing for Interpretation

Jointly Learning from Unimodal and Multimodal-Rated Labels in Audio-Visual Emotion Recognition

IEEE Xplore

Abstract

Audio-visual emotion recognition (AVER) has been an important research area in human-computer interaction (HCI). Traditionally, audio-visual emotional datasets and corresponding models derive their ground truths from annotations obtained by raters after watching the audio-visual stimuli. This conventional method, however, neglects the nuanced human perception of emotional states, which varies when annotations are made under different emotional stimuli conditions—whether through unimodal or multimodal stimuli. This study investigates the potential for enhanced AVER system performance by integrating diverse levels of annotation stimuli, reflective of varying perceptual evaluations. We propose a two-stage training method to train models with the labels elicited by audio-only, face-only, and audio-visual stimuli. Our approach utilizes different levels of annotation stimuli according to which modality is present within different layers of the model, effectively modeling annotation at the unimodal and multi-modal levels to capture the full scope of emotion perception across unimodal and multimodal contexts. We conduct the experiments and evaluate the models on the CREMA-D emotion database. The proposed methods achieved the best performances in macro-/weighted-F1 scores. Additionally, we measure the model calibration, performance bias, and fairness metrics considering the age, gender, and race of the AVER systems.

Figures

The paper was also presented in ICASSP 2025 on April 11th, 2025.

Authors

Publication Date

2025/01/15

Conference

ICASSP 2025

DOI

10.1109/OJSP.2025.3530274

RESEARCH

Related Research