Affect
ASR for Affective Speech: Investigating Impact of Emotion and Speech Generative Strategy
Abstract
This work investigates how emotional speech and generative strategies affect ASR performance. We analyze speech synthesized from three emotional TTS models and find that substitution errors dominate, with emotional expressiveness varying across models. Based on these insights, we introduce two generative strategies: one using transcription correctness and another using emotional salience, to construct fine-tuning subsets. Results show consistent WER improvements on real emotional datasets without noticeable degradation on clean LibriSpeech utterances. The combined strategy achieves the strongest gains, particularly for expressive speech. These findings highlight the importance of targeted augmentation for building emotion-aware ASR systems.
Figures
Workflow for synthesized emotional speech analysis
Keywords
Emotional speech | Automatic speech recognition | Text-to-speech synthesis | Data augmentation
Publication Date
2025/12/06
Conference
IEEE ASRU