Phonetic Anchor-Based Transfer Learning To Facilitate Unsupervised Cross-Lingual Speech Emotion Recognition｜BIIC Lab - NTHU

Speech and Language

Affect

Phonetic Anchor-Based Transfer Learning To Facilitate Unsupervised Cross-Lingual Speech Emotion Recognition

Download PDF IEEE Xplore

Abstract

Modeling cross-lingual Speech Emotion Recognition (SER) has become more prevalent because of its diverse applications. Existing studies have mostly focused on technical approaches that adapt the feature, domain, or label across languages, without considering in detail the similarities between the languages. This study focuses on domain adaptation in cross-lingual scenarios using phonetic constraints. This work is framed in a twofold manner. First, we analyze emotion-specific phonetic commonality across languages by identifying common vowels that are useful for SER modeling. Second, we leverage these common vowels as an anchoring mechanism to facilitate cross-lingual SER. We consider American English and Taiwanese Mandarin as a case study to demonstrate the potential of our approach. This work uses two in-the-wild natural emotional speech corpora: MSP-Podcast (American English), and BIIC-Podcast (Taiwanese Mandarin). The proposed unsupervised cross-lingual SER model using these phonetical anchors outperforms the baselines with a 58.64% of unweighted average recall (UAR).

Figures

Proposed contrastive learning approach using emotion-specific commonality-based anchoring mechanism for cross-lingual SER.

Keywords

speech emotion recognition ｜ domain adaptation ｜ cross-lingual ｜ transfer learning

Authors