Abstract
The rapid growth of Speech Emotion Recognition (SER) has diverse global applications, from improving human-computer interactions to aiding mental health diagnostics. However, SER models might contain social bias toward gender, leading to un- fair outcomes. This study analyzes gender bias in SER models trained with Self-Supervised Learning (SSL) at scale, explor- ing factors influencing it. SSL-based SER models are chosen for their cutting-edge performance. Our research pioneering re- search gender bias in SER from both upstream model and data perspectives. Our findings reveal that females exhibit slightly higher overall SER performance than males. Modified CPC and XLS-R, two well-known SSL models, notably exhibit sig- nificant bias. Moreover, models trained with Mandarin datasets display a pronounced bias toward valence. Lastly, we find that gender-wise emotion distribution differences in training data significantly affect gender bias, while upstream model repre- sentation has a limited impact.