Abstract
Speech emotion recognition (SER) is being actively developed in multiple real-world application scenarios, and users tend to become intimately connected to the service. However, most existing models are vulnerable to malicious hackers and unable to robustly defend against adversarial attacks. The degraded performance can lead to dreadful user experiences and un-satisfactions. In order to improve the robustness of the SER model against attacks, we proposed a self-supervised augmentation defense (SSAD) model by that using a 'single purifying' as a general (i.e., without knowing the types of the attack beforehand) defense model for adversarial attacks instead of training a custom-made defense model for each type of attacks. In this work, we evaluate our defense approach by performing an emotion recognition task on the well-known IEMOCAP corpus and examine the model performances under multiple adversarial attacks. Our proposed SSAD model achieve average 43.53% and 34.99% UAR while under Fast Gradient Sign Method (FGSM) and Projected Gradient descent (PGD) with significantly different intensity settings. Furthermore, our proposed SSAD boosts a 7.29% increasing in protection efficacy and 3.98% increasing in recovery rate.