RESEARCH

HOME RESEARCH
Behavior Computing
Speech and Language
Fairness
Phonetically-Anchored Domain Adaptation for Cross-Lingual Speech Emotion Recognition
Abstract
The prevalence of cross-lingual  speech emotion recognition  (SER) modeling has significantly increased due to its wide range of applications. Previous studies have primarily focused on technical strategies to adapt features, domains, and labels across languages, often overlooking the underlying commonalities between the languages. In this study, we address the language adaptation challenge in cross-lingual scenarios by incorporating vowel-phonetic constraints. Our approach is structured in two main parts. Firstly, we investigate the vowel-phonetic commonalities associated with specific emotions across languages, particularly focusing on common vowels that prove to be valuable for SER modeling. Secondly, we utilize these identified common vowels as anchors to facilitate cross-lingual SER. To demonstrate the effectiveness of our approach, we conduct case studies using  American English  and  Taiwanese Mandarin  with two naturalistic emotional speech corpora: the MSP-Podcast and BIIC-Podcast corpora. The approach leverages evidence that certain vowels, including monophthongs and diphthongs, exhibit emotion-specific commonality across languages, serving as phonetic anchors to enhance unsupervised cross-lingual SER learning. The proposed model surpasses baseline performance, highlighting the importance of phonetic similarities for effective language adaptation in cross-lingual SER scenarios.
Keywords
Phonetics | Linguistics | Emotion Recognition | Adaptation Models | Acoustics | Affective Computing | Speech Emotion Recognition | Cross Lingual | Transfer Learning
Authors
Publication Date
2025/01/11
DOI
10.1109/TAFFC.2025.3530105
Publisher