Abstract
While automated text analysis is getting extremely popular and image analysis is gaining interest, multi-modal analysis that combines both text and image information remains rare. However, many text or image data are intrinsically multi-modal, such as social media posts. This study compares three practical workflows for clustering text– image pairs: (1) label-level combination, which clusters text and image separately and combines the resulting labels; (2) vector-level combination, which clusters concatenated embeddings extracted from each modality; and (3) joint embedding, which clusters unified representations from multimodal embedding models such as CLIP. We also introduce a set of reusable evaluation tools to help researchers compare, validate, and benchmark multimodal clustering workflows: the Adjusted Mutual Information (AMI). to assess text-image alignment, the S_DbW index to evaluate number of clusters, and the within-cluster consistency to validate interpretability. We validate the methods on a Chinese protest dataset from social media with 336,921 text-image pairs, and test robustness and scope conditions using a smaller U.S. news dataset on gun violence with 1,297 news headlines. We find that when text and image provide distinct, non- overlapping information, the second and third methods outperform the first. This study serves as a bridge between the text-as-data and image-as-data communities, as well as computational social science.
Type
Publication
Sociological Methodology