RESEARCH

HOME RESEARCH
Multimedia Modeling
Other: Signal Modeling for Understanding
Attentive Convolutional Recurrent Neural Network Using Phoneme-Level Acoustic Representation for Rare Sound Event Detection
Abstract
A well-trained Acoustic Sound Event Detection system captures the patterns of the sound to accurately detect events of interest in an auditory scene, which enables applications across domains of multimedia, smart living, and even health monitoring. Due to the scarcity and the weak labelling nature of the sound event data, it is often challenging to train an accurate and robust acoustic event detection model directly, especially for those rare occurrences. In this paper, we proposed an architecture which takes the advantage of integrating ASR network representations as additional input when training a sound event detector. Here we used the convolutional bi-directional recurrent neural network (CBRNN), which includes both spectral and temporal attentions, as the SED classifier and further combined the ASR feature representations when performing the end-to-end CBRNN training. Our experiments on the TUT 2017 rare sound event detection dataset showed that with the inclusion of ASR features, the overall discriminative performance of the end-to-end sound event detection system has improved; the average performance of our proposed framework in terms of f-score and error rates are 97 % and 0.05 % respectively.
Figures
An illustration ofthe proposed architecture; the proposed gradual network which concatenates the ASR network representations with the sound feature representations in path ofend-to-end SED training.
An illustration ofthe proposed architecture; the proposed gradual network which concatenates the ASR network representations with the sound feature representations in path ofend-to-end SED training.
Keywords
sound event detection | convolution recurrent neural network | attention | automatic speech recognition
Authors
Publication Date
2020/10/25
Conference
Interspeech 2020
Interspeech 2020
DOI
10.21437/Interspeech.2020-2585
Publisher
ISCA