Abstract
In generalized Speech Emotion Recognition (SER), traditional generalization techniques like transfer learning and domain adaptation rely on access to some amount of unlabeled target domain data. However, with increasing privacy concerns, building SER systems under zero-shot scenarios, where no target domain data is available, poses a significant challenge. In such cases, conventional methods become impractical without access to target samples or features. To leverage any available target information to bridge this gap, this work explores the potential of Large Language Models (LLMs), with their powerful generative capabilities, to generate target corpora based on documented scenario settings and published research, enabling SER under zero-shot conditions. We assess the effectiveness of LLMs in SER tasks across both text and speech modalities under challenging zero-shot conditions, using IEMOCAP and MSP-PODCAST as unseen target corpora. To ensure a fair comparison, we validate the performance of the synthetic data against real source data from MELD and MSP-IMPROV. Our experimental results reveal that, on average, the synthetic data not only matches but often surpasses the performance of real data in both text and speech modalities.