RESEARCH

HOME RESEARCH
Behavior Computing
Spoken Dialogs
States and Traits
Speech and Language
Encoding Individual Acoustic Features Using Dyad-Augmented Deep Variational Representations for Dialog-level Emotion Recognition
Abstract
Face-to-face dyadic spoken dialog is a fundamental unit of human interaction. Despite numerous empirical evidences in demonstrating interlocutor’s behavior dependency in dyadic interactions, few technical works exist in leveraging the unique pattern of dynamics in task of advancing emotion recognition during face-to-face settings. In this work, we propose a framework of encoding an individual’s acoustic features with dyadaugmented deep networks. The dyad-augmented deep networks includes a general variational deep Gaussian Mixture embedding network and a dyad-specific fine-tuned network. Our framework utilizes the augmented dyad-specific feature space to incorporate the unique behavior pattern emerged when two people interact. We perform dialog-level emotion regression tasks in both the CreativeIT and the NNIME databases. We obtain affect regression accuracy of 0.544 and 0.387 for activation and valence in the CreativeIT database (a relative improvement of 4.41% and 4.03% compared to using features without augmenting the dyad-specific representation), and we obtain 0.700 and 0.604 (4.48% and 4.14% relative improvement) for regressing activation and valence in the NNIME database.
Figures
This is the overall framework for an individual’s dialog-level emotion recognition. We first extract low-level descriptors. Then, the LLDs are encoded using two networks ofgeneral VaDE and dyad-specific VaDE. General representation acts as a behavior representation learned from the entire database while dyad-specific representation embeds dyadic interaction dynamics.
This is the overall framework for an individual’s dialog-level emotion recognition. We first extract low-level descriptors. Then, the LLDs are encoded using two networks ofgeneral VaDE and dyad-specific VaDE. General representation acts as a behavior representation learned from the entire database while dyad-specific representation embeds dyadic interaction dynamics.
Keywords
variational deep embedding | dyadic interaction | emotion recognition | feature augmentation | frozen fine-tuning
Authors
Jeng-Lin Li Chi-Chun Lee
Publication Date
2018/09/02
Conference
Interspeech 2018
Interspeech 2018
DOI
10.21437/Interspeech.2018-1455
Publisher
ISCA