RESEARCH

HOME RESEARCH
Behavior Computing
Other: Signal Modeling for Understanding
States and Traits
A Dual-Complementary Acoustic Embedding Network Learned from Raw Waveform for Speech Emotion Recognition
Abstract
Speech emotion recognition (SER) technology has recently become a trend in a broader field and has achieved remarkable recognition performances using deep learning technique. However, the recognition performances obtained using end-to-end learning directly from raw audio waveform still hardly exceed those based on hand-crafted acoustic descriptors. Instead of solely rely on raw waveform or acoustic descriptors for SER, we propose an acoustic space augmentation network, termed as Dual-Complementary Acoustic Embedding Network (DCaEN), that combines knowledge-based features with raw waveform embedding learned with a novel complementary constraint. DCaEN includes representations from eGeMAPS acoustic feature and raw waveform by specifying a negative cosine distance loss to explicitly constrain the raw waveform embedding to be different from eGeMAPS. Our experimental results demonstrate an improved emotion discriminative power on the IEMOCAP database, which achieves 59.31% in a four class emotion recognition. Our analysis also demonstrates that the learned raw waveform embedding of DCaEN converges close to reverse mirroring of the original eGeMAPS space.
Figures
Illustration of our framework: Dual Complementary Acoustic Embedding Model. The model is divided into two stages: first, embeddings learned from Feature Network with eGeMAPS is the input; second, we use an end-to-end architecture to learn complementary embedding from raw waveform with cosine similarity constraint. Finally, these representations are concatenated to perform final SER.
Illustration of our framework: Dual Complementary Acoustic Embedding Model. The model is divided into two stages: first, embeddings learned from Feature Network with eGeMAPS is the input; second, we use an end-to-end architecture to learn complementary embedding from raw waveform with cosine similarity constraint. Finally, these representations are concatenated to perform final SER.
Keywords
speech emotion recognition | raw waveform | endto-end learning | acoustic space augmentation
Authors
Jeng-Lin Li Chun-Min Chang Chi-Chun Lee
Publication Date
2019/09/03
Conference
International Conference on Affective Computing and Intelligent Interaction (ACII)
2019 8th International Conference on Affective Computing and Intelligent Interaction (ACII)
DOI
10.1109/acii.2019.8925496
Publisher
IEEE