Exploring the portrayal of emotions in intricate narratives: the Stanford Emotional Tales Collection
Time-series modeling and the collection of high-quality time-series datasets are crucial challenges in the field of affective computing, which focuses on capturing human emotions over time. One dataset that presents a challenging test for contemporary time-series emotion recognition models is the Stanford Emotional Narratives Dataset (SEND).
The original SEND, introduced in 20XX, includes multimodal videos of self-paced, unscripted emotional narratives, annotated for emotional valence over time. The recently introduced SENDv2 extends the original dataset with a larger dataset and more diverse narratives, providing an even more challenging test for modern models.
Current approaches for time-series emotion recognition increasingly utilize multimodal deep learning models, leveraging temporal information from speech, facial expressions, physiological signals, and text. Techniques such as cross-modal gated attention mechanisms, multimodal federated learning, and spectral-temporal feature extraction from EEG data have been proposed to effectively capture emotional dynamics in continuous data streams.
Several new modeling approaches have been demonstrated on the SENDv2, including a Transformer-based model and a multimodal Temporal Convolutional Network. The Transformer-based model achieved state-of-the-art performance on the SENDv2, outperforming previous models. The multimodal Temporal Convolutional Network also performed well, demonstrating the potential of convolutional neural networks in affective computing.
Multimodal Federated Learning (FedMultiEmo) integrates visual cues processed by Convolutional Neural Networks (CNNs) with physiological signals classified by Random Forests in a privacy-preserving, real-time framework. This approach achieved up to 87% accuracy with multimodal fusion in automotive emotion recognition settings, demonstrating effective temporal and cross-modal integration.
A cross-modal gated attention mechanism has been developed to fuse visual, textual, and acoustic modalities, improving recognition performance by focusing on important temporal cues across modalities. For speech, novel methods focusing on improved utilization of temporal speech data have enhanced emotion inference capabilities, highlighting the importance of temporal feature modeling for accurate recognition.
Although none of the cited works explicitly report model results versus humans on SEND, the best-performing systems in multimodal emotion recognition generally approach human-level accuracy or exceed 80% accuracy on complex emotion datasets, a trend expected to continue as methods refine temporal modeling and cross-modal fusion.
The SENDv2 dataset includes over 100 hours of multimodal videos, annotated for emotional valence over time. The dataset's increased size and diversity could lead to more accurate and robust time-series emotion recognition models in the future. The study highlights the need for larger and more diverse datasets in affective computing to improve the accuracy of emotion recognition models.
The study also discusses the potential of using the SENDv2 for future research in time-series affective computing. The SENDv2 dataset is available for public use, allowing researchers to further advance the field of time-series emotion recognition. Future releases from ongoing evaluations like SemEval-2025 Task 11 may provide more direct SEND results, offering valuable insights into the continuing advancements in this field.
Science in health-and-wellness and mental health can benefit from advancements in time-series modeling, especially in affective computing. The recently expanded Stanford Emotional Narratives Dataset (SENDv2) offers a challenging test for contemporary models, providing a wealth of temporal data annotated for emotional valence. This dataset could lead to more accurate and robust time-series emotion recognition models in the future, contributing significantly to research in health-and-wellness and mental health.