Abstract
The prevalence of cross-lingual speech emotion recognition (SER) modeling has significantly increased due to its wide range of applications. Previous studies have primarily focused on technical strategies to adapt features, domains, and labels across languages, often overlooking the underlying commonalities between the languages. In this study, we address the language adaptation challenge in cross-lingual scenarios by incorporating vowel-phonetic constraints. Our approach is structured in two main parts. Firstly, we investigate the vowel-phonetic commonalities associated with specific emotions across languages, particularly focusing on common vowels that prove to be valuable for SER modeling. Secondly, we utilize these identified common vowels as anchors to facilitate cross-lingual SER. To demonstrate the effectiveness of our approach, we conduct case studies using American English and Taiwanese Mandarin with two naturalistic emotional speech corpora: the MSP-Podcast and BIIC-Podcast corpora. The approach leverages evidence that certain vowels, including monophthongs and diphthongs, exhibit emotion-specific commonality across languages, serving as phonetic anchors to enhance unsupervised cross-lingual SER learning. The proposed model surpasses baseline performance, highlighting the importance of phonetic similarities for effective language adaptation in cross-lingual SER scenarios.