Evaluating the Strength of a Multi-Sensory Emotion Detection Model

The authors of this study aimed to evaluate the effectiveness of a transformer-based multimodal audio-text classifier for emotion recognition. This type of deep neural network has been shown to be effective in classifying human interactions into emotions by encoding multiple input modalities, such as audio and text.

To assess the robustness of the classifier, the researchers designed attacks that specifically targeted information deemed important for emotion recognition. These attacks were applied to the input data at inference time, allowing the researchers to measure the impact of the attacks on the classifier’s accuracy.

The results of the study indicated that the multimodal classifier was more resilient to perturbation attacks than the equivalent unimodal classifiers, which only used a single input modality. This suggests that the two modalities are encoded in a way that allows the classifier to benefit from one modality even when the other one is slightly damaged.

Overall, this study provides insight into the robustness of deep neural network classifiers for emotion recognition and highlights the potential benefits of using multiple input modalities in this task.