RESEARCH

HOME RESEARCH
Behavior Computing
States and Traits
General Well-being
Other: Signal Modeling for Understanding
Using Attention Networks and Adversarial Augmentation for Styrian Dialect Continuous Sleepiness and Baby Sound Recognition
Abstract
In this study, we present extensive attention-based networks with data augmentation methods to participate in the INTERSPEECH 2019 ComPareE Challenge, specifically the three Sub-challenges: Styrian Dialect Recognition, Continuous Sleepiness Regression, and Baby Sound Classification. For Styrian Dialect Sub-challenge, these dialects are classified into Northern Styrian (NorthernS), Urban Sytrian (UrbanS), and Eastern Styrian (EasternS). Our proposed model achieves an UAR 49.5% on the test set, which is 2.5% higher than the baseline. For Continuous Sleepiness Sub-challenge, it is defined as a regression task with score range from 1 (extremely alert) to 9 (very sleepy). In this work, our proposed architecture achieves a Spearman correlation 0.369 on the test set, which surpasses the baseline model by 0.026. For Baby Sound Sub-challenge, the infant sounds are classified into canonical babbling, noncanonical babbling, crying, laughing and junk/other, and our proposed augmentation framework achieves an UAR of 62.39% on the test set, which outperforms the baseline by about 3.7%. Overall, our analyses demonstrate that by fusing attention network models with conventional support vector machine benefits the test set robustness, and the recognition rates of these paralinguistic attributes generally improve when performing data augmentation.
Figures
Styrian Dialect: SVM and CNN+ATT are trained on augmented data are combined together in fusion stage. Continuous Sleepiness: we use SSL on training data. Predictions ofSVR, CNN+ATT and BLSTM are combined in the fusion stage. Baby Sound: we utilize original train data to pretrain AAE model. Encoded laughing data vector, selected Gaussian samples and pretrain decoder are used to generate augmented samples (Synthetic feature and Conditional Synthetic feature).
Styrian Dialect: SVM and CNN+ATT are trained on augmented data are combined together in fusion stage. Continuous Sleepiness: we use SSL on training data. Predictions ofSVR, CNN+ATT and BLSTM are combined in the fusion stage. Baby Sound: we utilize original train data to pretrain AAE model. Encoded laughing data vector, selected Gaussian samples and pretrain decoder are used to generate augmented samples (Synthetic feature and Conditional Synthetic feature).
Keywords
attention networks | augmentation | adversarial learning | computational paralinguistics
Authors
Bo-Hao Su Meng-Han Lin Chi-Chun Lee
Publication Date
2019/09/15
Conference
Interspeech
Interspeech 2019
DOI
10.21437/Interspeech.2019-2110
Publisher
ISCA