Joint Text-and-Image Clustering for Social Science Research

Dec 1, 2025·

Han Zhang

Ryan Leung

· 0 min read

Abstract

While automated text analysis is getting extremely popular and image analysis is gaining interest, multi-modal analysis that combines both text and image information remains rare. However, many text or image data are intrinsically multi-modal, such as social media posts. This study compares three practical workflows for clustering text– image pairs: (1) label-level combination, which clusters text and image separately and combines the resulting labels; (2) vector-level combination, which clusters concatenated embeddings extracted from each modality; and (3) joint embedding, which clusters unified representations from multimodal embedding models such as CLIP. We also introduce a set of reusable evaluation tools to help researchers compare, validate, and benchmark multimodal clustering workflows: the Adjusted Mutual Information (AMI). to assess text-image alignment, the S_DbW index to evaluate number of clusters, and the within-cluster consistency to validate interpretability. We validate the methods on a Chinese protest dataset from social media with 336,921 text-image pairs, and test robustness and scope conditions using a smaller U.S. news dataset on gun violence with 1,297 news headlines. We find that when text and image provide distinct, non- overlapping information, the second and third methods outperform the first. This study serves as a bridge between the text-as-data and image-as-data communities, as well as computational social science.

Type

Journal Article

Publication

Sociological Methodology

Last updated on Nov 20, 2025

"Image Data" "Text Data" "Clustering" "Gender

← Two-Layer Panopticon: How the Chinese Government Uses Digital Surveillance to Prevent Collective Action Jan 1, 2026

Hug Fans or Follow Celebrities? How Nationalism Is Reinforced on Chinese Social Media Nov 7, 2025 →