RESEARCH

HOME RESEARCH
Behavior Computing
Speech and Language
Affect
Phonetic Anchor-Based Transfer Learning To Facilitate Unsupervised Cross-Lingual Speech Emotion Recognition
Abstract
Modeling cross-lingual Speech Emotion Recognition (SER) has become more prevalent because of its diverse applications. Existing studies have mostly focused on technical approaches that adapt the feature, domain, or label across languages, without considering in detail the similarities between the languages. This study focuses on domain adaptation in cross-lingual scenarios using phonetic constraints. This work is framed in a twofold manner. First, we analyze emotion-specific phonetic commonality across languages by identifying common vowels that are useful for SER modeling. Second, we leverage these common vowels as an anchoring mechanism to facilitate cross-lingual SER. We consider American English and Taiwanese Mandarin as a case study to demonstrate the potential of our approach. This work uses two in-the-wild natural emotional speech corpora: MSP-Podcast (American English), and BIIC-Podcast (Taiwanese Mandarin). The proposed unsupervised cross-lingual SER model using these phonetical anchors outperforms the baselines with a 58.64% of unweighted average recall (UAR).
Figures
Proposed contrastive learning approach using emotion-specific commonality-based anchoring mechanism for cross-lingual SER.
Proposed contrastive learning approach using emotion-specific commonality-based anchoring mechanism for cross-lingual SER.
Keywords
speech emotion recognition | domain adaptation | cross-lingual | transfer learning
Conference
IEEE ICASSP
DOI
10.1109/ICASSP49357.2023.10095250
Publisher
IEEE