Abstract
Speech emotion recognition (SER) adds to the humane aspects of voice technologies to enhance user experiences. The ground truth emotion annotations provided by human raters and attributes related to the speakers themselves arise a compounded
fairness issue in SER. While there exist works in fair SER, our work presents one of the first studies in addressing the unique joint speaker-rater (two-sided) bias, focusing on the issue of gender fairness. Our cross-reference evaluation demonstrates that the SER fair model, which merely mitigates one-sided bias introduces biases when examining from another viewpoint. Furthermore, in order to handle model stability when optimizing for these compounded speaker-rater constraints, we introduce a flexible controlled mechanism that dynamically balances the contribution of each viewpoint. Our analyses show the efficacy of our approach in achieving a fair SER that meets the dual speaker-rater gender neutrality criterion.