Abstract
Speech emotion recognition (SER) helps to achieve bet- ter human-to-machine interactions in voice technologies. Re- cent studies have pointed out critical fairness issues in the SER. While there are efforts in building fair SER, most of the works focus on fairness between demographic groups and rely on these broad categorical attributes to build a fair SER. In this pa- per, we instead focus on the fairness learning among individual speakers, which is rarely discussed yet much more intuitively appealing in constructing a fair SER model. To reduce the re- liance on knowing speaker IDs, we perform unsupervised clus- tering on the utterance embeddings from a pre-trained speaker verification model that puts utterances with different character- istics into clusters that roughly represent the true speaker index. Our evaluation demonstrates that with these cluster IDs, we can construct a fairness-aware SER model at an individual speaker- level without knowing speaker IDs upfront.