End-to-End Speech Emotion Recognition Using Multimodal Data Fusion
Abstract
This paper presents an end-to-end framework for speech emotion recognition (SER) by integrating multimodal data fusion techniques. We propose a novel approach that combines acoustic, linguistic, and visual features to enhance the accuracy and robustness of emotion recognition systems. Our approach leverages advanced deep learning models for feature extraction and fusion, followed by a unified classification framework. Experiments on benchmark datasets demonstrate the effectiveness of our method compared to traditional SER systems.