Article Preview
TopIntroduction
In today’s era of information explosion, sentiment analysis (Sivakumar et al., 2022; Wu et al., 2024), as a technique that can automatically recognize and extract sentiment information in text or images, has received widespread attention and application. Traditional sentiment analysis typically concentrates on identifying general emotional trends, but in real-world applications, users’ emotional expressions often encompass various dimensions. Aspect-based sentiment analysis (Meng et al., 2023; Wang & Li, 2023) has thus emerged, which not only analyzes the overall sentiment but also refines the sentiment tendency to specific aspects, thus providing more precise and valuable insights. Aspect-level sentiment analysis refers to the identification and categorization of sentiment tendencies for specific aspects in texts, such as user comments or social media posts (Jiang et al., 2011).
With the proliferation of social media and multimedia content, it has become difficult to fully capture users’ emotional expressions with pure text analysis. By analyzing both images and text, sentiment analysis methods that combine these two modalities can offer a more precise and holistic understanding of a user’s emotional state from various perspectives. With the support of computer technology, combined image-text aspect-level sentiment analysis (ITASA) becomes possible. The extraction of visual features from images and semantic features from text, as well as the fusion of these two features, is made possible by the advancements in computer vision (Batch et al., 2023; Li et al., 2024; yadav & Raj, 2021) and natural language processing techniques (Shivahare et al., 2022). This enables a comprehensive analysis of emotions. Deep learning models, particularly convolutional neural networks (CNNs; Bhuvaneshwari et al., 2022; Joloudari et al., 2023) and recurrent neural networks (RNNs; Alroobaea, 2022; Topbaş et al., 2021), have demonstrated strong capabilities in image and text processing and become powerful tools for image and text sentiment analysis. Despite the significant advantages they bring to image and text processing, CNNs and RNNs have some drawbacks. CNNs have limited ability to capture long-distance-dependent and sequential information when processing text, although they can capture localized features, which may lead to poor results when processing complex textual sentiment. RNNs, especially traditional RNNs, suffer from gradient vanishing and gradient explosion problems due to the nature of their sequence processing, which impacts the effectiveness of training on long sequence data.