RESEARCH

HOME RESEARCH
Behavior Computing
Other: Signal Modeling for Understanding
States and Traits
A multimodal approach for automatic assessment of school principals' oral presentation during pre-service training program
Abstract
Developing automatic recognition systems of subjective rating using behavior data, collected using audio-video recording devices, has been at the forefront of many interdisciplinary research effort between behavior science and engineering in order to provide objective decision-making tools. In the field of education, pre-service training program for school principals has becoming more critical due to the increasingly complex and demanding nature of the job. In this work, we collaborate with researchers from the National Academy for Educational Research to develop a system in order to assess pre-service principals’ oral presentation skill. Our recognition framework incorporates multimodal behavioral data, i.e., audio and video information. With proper handling of label normalization and binarization, we achieve an unweighted average recall of (0.63, 0.70, 0.67) or (0.67, 0.68, 0.67) depending on the choice of labeling schemes, i.e., original or rank-normalized, on differentiating between high versus low performing scores. The three oral presentation rating dimensions used in this work are Dim1: content + structure + word, Dim2: prosody, Dim3: total score.
Figures
Our experimental setup: the raw recording is first manually-segmented into utterances and each utterance is run through audio and video feature extractor component. Video-only system is trained on individual utterances, and audio-only system is trained on entire speech by utilizing second stage statistical functional computation. Classifier of choice is support vector machine, and the multimodal fusion is done by training logistic regression on the decision scores of each modality.
Our experimental setup: the raw recording is first manually-segmented into utterances and each utterance is run through audio and video feature extractor component. Video-only system is trained on individual utterances, and audio-only system is trained on entire speech by utilizing second stage statistical functional computation. Classifier of choice is support vector machine, and the multimodal fusion is done by training logistic regression on the decision scores of each modality.
Keywords
behavioral signal processing (BSP) | oral presentation | multimodal signal processing | education research
Authors
Chi-Chun Lee
Publication Date
2015/09/06
Conference
Interspeech
Interspeech 2015
DOI
10.21437/Interspeech.2015-545
Publisher
ISCA