MMToM-QA: Multimodal Theory of Mind Question Answering

Chuanyang Jin1,2
Yutong Wu3
Jing Cao2
Jiannan Xiang4
Yen-Ling Kuo2,5
Zhiting Hu4
Tomer Ullman3
Antonio Torralba2
Joshua Tenenbaum2
Tianmin Shu6
3 Harvard


  • Theory of Mind (ToM), the ability to understand people's minds, is an essential ingredient for developing machines with human-level social intelligence. Existing ToM benchmarks use unimodal datasets - either video or text. We introduce a multimodal Theory of Mind question answering (MMToM-QA) benchmark.
  • MMToM-QA comprehensively evaluates machine Theory of Mind both on multimodal data and on different kinds of unimodal data about a person's activity in a household environment.
  • To engineer multimodal Theory of Mind capacity, we propose a novel method, BIP-ALM (Bayesian Inverse Planning Accelerated by Language Models). BIP-ALM extracts unified representations from multimodal data and utilizes language models for scalable Bayesian inverse planning.

MMToM-QA Benchmark

MMToM-QA is the first multimodal benchmark to evaluate machine Theory of Mind (ToM), the ability to understand people's minds. MMToM-QA consists of 600 questions. Each question is paired with a clip of the full activity in a video (as RGB-D frames), as well as a text description of the scene and the actions taken by the person in that clip. All questions have two choices. The questions are categorized into seven types, evaluating belief inference and goal inference in rich and diverse situations. Each belief inference type has 100 questions, totaling 300 belief questions; each goal inference type has 75 questions, totaling 300 goal questions. The questions are paired with 134 videos of a person looking for daily objects in household environments.

The instructions for using the MMToM-QA benchmark are available here.

Question types in MMToM-QA, with examples:

Example videos:

Types of data provided in MMToM-QA:

Procedural generation:

Bayesian Inverse Planning Accelerated by Language Models (BIP-ALM)

We propose Bayesian Inverse Planning Accelerated by Language Models (BIP-ALM), a novel method to engineer multimodal Theory of Mind. This method:
  • Extracts symbolic representations from video (via the Visual Perception Module) and text (using GPT-4);
  • Aligns and fuses these representations to form a unified representation of the event and the physical scene;
  • Conducts inverse inference about the agent's goal and belief using finetuned language models.

Quantitative Results

We evaluate large language models (e.g., GPT-4, LLaMA 2) on the text-only version of MMToM-QA. We assess large multimodal models (e.g., GPT-4V, InstructBLIP, LLaVA) on both multimodal and video-only versions of MMToM-QA. Furthermore, we measure human performance in these tasks.

A summary of our findings:
  • Human experiment verifies our benchmark design and provides a natural performance yardstick.
  • Current large language models and large multimodal models lack robust ToM capacity.
  • BIP-ALM shows promising results and generalizes effectively.

Qualitative Results & Discussion

  • GPT-4 shows strong performance when people have true beliefs, yet it makes systematic errors when people have false beliefs or update their beliefs, and it has poor judgment on goals.
  • BIP-ALM effectively reasons about a person's mental state and tracks the changes in the mental state over time.
  • BIP-ALM benefits from (1) the modality-invariance of symbolic representations, (2) the robustness of inverse planning, and (3) the scalability and zero-shot flexibility of language models.

How BIP-ALM evaluates the likelihood of different hypotheses via the action likelihood estimation:

How different modalities contribute to the mental state reasoning:


Chuanyang Jin, Yutong Wu, Jing Cao, Jiannan Xiang, Yen-Ling Kuo, Zhiting Hu, Tomer Ullman, Antonio Torralba, Joshua Tenenbaum, Tianmin Shu
MMToM-QA: Multimodal Theory of Mind Question Answering
arXiv Preprint
A short version presented at NeurIPS'23 FMDM Workshop