RESEARCH

HOME RESEARCH
Behavior Computing
Speech and Language
States and Traits
MetricAug: A Distortion Metric-Lead Augmentation Strategy for Training Noise-Robust Speech Emotion Recognizer
Abstract
Noise-robust speech emotion recognition (SER) systems are important in real world applications. Conventionally, noise ro- bustness is achieved by training on a noise-augmented dataset. In this work, instead of pre-defining noise SNRs to augment the clean set, we propose an augment-while-train strategy while referencing speech distortion metric. This strategy (MetricAug) constructs an augmented set per each training epoch by assess- ing the effect of different distortion levels have on degrading the SER performances. That is, we augment more of those noisy data that degrade the SER performance the most dynam- ically at each learning epoch. We evaluate our framework on two databases, MSP-Podcast and MELD. Our framework shows consistent robustness against varying levels and even unseen noise types. Further analysis reveals that by choosing STOI as the metric of noise distortion, it leads the construction of aug- mented sets better than metrics of PESQ and fwSNRseg.
Figures
Illustration of MetricAug: An epoch-wise distortion metric-lead noise augmentation.
Illustration of MetricAug: An epoch-wise distortion metric-lead noise augmentation.
Keywords
speech emotion recognition | peech distortion metrics | noise robustness
Authors
Ya-Tse Wu Chi-Chun Lee
Publication Date
2023/08/22
Conference
Interspeech
Interspeech 2023
Publisher
ISCA