Abstract
Speech recordings in call centers are narrowband and mixed with various noises. Developing a bandwidth expansion (BWE) model is important to mitigate the automated speech recognition (ASR) performance gap between the low and high sampling rate speech data. To further address the in-the-wild noise in call cen- ter settings, we propose an Embedding-Polished Wave-U-Net (EP-WUN) that includes an additional speech quality classifier to handle the noise and bandwidth expansion of 8k audio si- multaneously. Our framework shows improved speech quality metrics on a well-known BWE dataset (Valentini-Botinhao cor- pus) when comparing to the current state-of-the-art noise-robust BWE model with 33% fewer parameters. It also achieves an 11.71% word error rate reduction when evaluating on a real- world interactive voice response system from the E.SUN bank.