Abstract
Noise-robust speech emotion recognition (SER) systems are important in real world applications. Conventionally, noise ro- bustness is achieved by training on a noise-augmented dataset. In this work, instead of pre-defining noise SNRs to augment the clean set, we propose an augment-while-train strategy while referencing speech distortion metric. This strategy (MetricAug) constructs an augmented set per each training epoch by assess- ing the effect of different distortion levels have on degrading the SER performances. That is, we augment more of those noisy data that degrade the SER performance the most dynam- ically at each learning epoch. We evaluate our framework on two databases, MSP-Podcast and MELD. Our framework shows consistent robustness against varying levels and even unseen noise types. Further analysis reveals that by choosing STOI as the metric of noise distortion, it leads the construction of aug- mented sets better than metrics of PESQ and fwSNRseg.