Using Attention Networks and Adversarial Augmentation for Styrian Dialect Continuous Sleepiness and Baby Sound Recognition｜BIIC Lab - NTHU

General Well-being

Other: Signal Modeling for Understanding

Using Attention Networks and Adversarial Augmentation for Styrian Dialect Continuous Sleepiness and Baby Sound Recognition

Download PDF ResearchGate

Abstract

In this study, we present extensive attention-based networks with data augmentation methods to participate in the INTERSPEECH 2019 ComPareE Challenge, specifically the three Sub-challenges: Styrian Dialect Recognition, Continuous Sleepiness Regression, and Baby Sound Classification. For Styrian Dialect Sub-challenge, these dialects are classified into Northern Styrian (NorthernS), Urban Sytrian (UrbanS), and Eastern Styrian (EasternS). Our proposed model achieves an UAR 49.5% on the test set, which is 2.5% higher than the baseline. For Continuous Sleepiness Sub-challenge, it is defined as a regression task with score range from 1 (extremely alert) to 9 (very sleepy). In this work, our proposed architecture achieves a Spearman correlation 0.369 on the test set, which surpasses the baseline model by 0.026. For Baby Sound Sub-challenge, the infant sounds are classified into canonical babbling, noncanonical babbling, crying, laughing and junk/other, and our proposed augmentation framework achieves an UAR of 62.39% on the test set, which outperforms the baseline by about 3.7%. Overall, our analyses demonstrate that by fusing attention network models with conventional support vector machine benefits the test set robustness, and the recognition rates of these paralinguistic attributes generally improve when performing data augmentation.

Figures

Styrian Dialect: SVM and CNN+ATT are trained on augmented data are combined together in fusion stage. Continuous Sleepiness: we use SSL on training data. Predictions ofSVR, CNN+ATT and BLSTM are combined in the fusion stage. Baby Sound: we utilize original train data to pretrain AAE model. Encoded laughing data vector, selected Gaussian samples and pretrain decoder are used to generate augmented samples (Synthetic feature and Conditional Synthetic feature).

Keywords

attention networks ｜ augmentation ｜ adversarial learning ｜ computational paralinguistics

Authors