Theory of Mind (ToM), the ability to understand people's mental states, is an essential ingredient for developing machines with human-level social intelligence. Recent machine learning models, particularly large language models, seem to show some aspects of ToM understanding. However, existing ToM benchmarks use unimodal datasets - either video or text. Human ToM, on the other hand, is more than video or text understanding. People can flexibly reason about another person's mind based on conceptual representations (e.g., goals, beliefs, plans) extracted from any available data. To address this, we introduce a multimodal Theory of Mind question answering (MMToM-QA) benchmark. MMToM-QA comprehensively evaluates machine ToM both on multimodal data and on different kinds of unimodal data about a person's activity in a household environment. To engineer multimodal ToM capacity, we propose a novel method, BIP-ALM (Bayesian Inverse Planning Accelerated by Language Models). BIP-ALM extracts unified representations from multimodal data and utilizes language models for scalable Bayesian inverse planning. We conducted a systematic comparison of human performance, BIP-ALM, and state-of-the-art models, including GPT-4. The experiments demonstrate that large language models and large multimodal models still lack robust ToM capacity. BIP-ALM, on the other hand, shows promising results, by leveraging the power of both model-based mental inference and language models.
MMToM-QA systematically evaluates the cognitive ability to understand people's minds both on multimodal data and different unimodal data. MMToM-QA consists of 600 questions. The questions are categorized into seven types, evaluating belief inference and goal inference in rich and diverse situations. Each belief inference type has 100 questions, totaling 300 belief questions; each goal inference type has 75 questions, totaling 300 goal questions.
The MMToM-QA benchmark and usage instructions are available in the GitHub repository. A text-only version is also available for download from Hugging Face.
Each question is paired with a clip of the full activity in a video (as RGB-D frames), as well as a text description of the scene and the actions taken by the person in that clip.
(a) Types of data provided
(b) The procedural generation process
We propose Bayesian Inverse Planning Accelerated by Language Models (BIP-ALM), a novel method to engineer multimodal Theory of Mind. This method:
We evaluate large language models (e.g., GPT-4, LLaMA 2) on the text-only version of MMToM-QA.
We assess large multimodal models (e.g., GPT-4V, InstructBLIP, LLaVA) on both multimodal and video-only versions of MMToM-QA.
Furthermore, we measure human performance in these tasks.
A summary of our findings:
BIP-ALM effectively reasons about a person's mental state and tracks the changes in the mental state over time, benefiting from (1) the modality-invariance of symbolic representations, (2) the robustness of inverse planning, and (3) the scalability and zero-shot flexibility of language models.
@article{jin2024mmtom,
title={Mmtom-qa: Multimodal theory of mind question answering},
author={Jin, Chuanyang and Wu, Yutong and Cao, Jing and Xiang, Jiannan and Kuo, Yen-Ling and Hu, Zhiting and Ullman, Tomer and Torralba, Antonio and Tenenbaum, Joshua B and Shu, Tianmin},
journal={arXiv preprint arXiv:2401.08743},
year={2024}
}