Theory of Mind (ToM), the ability to understand people's mental states, is an essential ingredient for developing machines with human-level social intelligence. Recent machine learning models, particularly large language models, seem to show some aspects of ToM understanding. However, existing ToM benchmarks use unimodal datasets - either video or text. Human ToM, on the other hand, is more than video or text understanding. People can flexibly reason about another person's mind based on conceptual representations (e.g., goals, beliefs, plans) extracted from any available data. To address this, we introduce a multimodal Theory of Mind question answering (MMToM-QA) benchmark. MMToM-QA comprehensively evaluates machine ToM both on multimodal data and on different kinds of unimodal data about a person's activity in a household environment. To engineer multimodal ToM capacity, we propose a novel method, BIP-ALM (Bayesian Inverse Planning Accelerated by Language Models). BIP-ALM extracts unified representations from multimodal data and utilizes language models for scalable Bayesian inverse planning. We conducted a systematic comparison of human performance, BIP-ALM, and state-of-the-art models, including GPT-4. The experiments demonstrate that large language models and large multimodal models still lack robust ToM capacity. BIP-ALM, on the other hand, shows promising results, by leveraging the power of both model-based mental inference and language models.

MMToM-QA Benchmark

MMToM-QA systematically evaluates the cognitive ability to understand people's minds both on multimodal data and different unimodal data. MMToM-QA consists of 600 questions. The questions are categorized into seven types, evaluating belief inference and goal inference in rich and diverse situations. Each belief inference type has 100 questions, totaling 300 belief questions; each goal inference type has 75 questions, totaling 300 goal questions.

The instructions for using the MMToM-QA benchmark are available here.

Question types in MMToM-QA with examples.

Each question is paired with a clip of the full activity in a video (as RGB-D frames), as well as a text description of the scene and the actions taken by the person in that clip.

Image 1

(a) Types of data provided

Image 2

(b) The procedural generation process

Example Videos

Bayesian Inverse Planning Accelerated by Language Models (BIP-ALM)

We propose Bayesian Inverse Planning Accelerated by Language Models (BIP-ALM), a novel method to engineer multimodal Theory of Mind. This method:

  • Extracts, aligns, and fuses symbolic representations from video and text
  • Conducts inverse inference about the agent's goal and belief using language models


Quantitative Results

We evaluate large language models (e.g., GPT-4, LLaMA 2) on the text-only version of MMToM-QA. We assess large multimodal models (e.g., GPT-4V, InstructBLIP, LLaVA) on both multimodal and video-only versions of MMToM-QA. Furthermore, we measure human performance in these tasks.

A summary of our findings:

  • Human experiment verifies benchmark design and provides a natural performance yardstick
  • Large multimodal models and LLMs are as good as random guessing across all question types
    • As an exception, GPT-4V excels with true beliefs, yet it makes systematic mistakes when people have false beliefs or update their beliefs, and has poor judgment on goals.
  • BIP-ALM shows promising results and generalizes effectively


Qualitative Results & Discussion

BIP-ALM effectively reasons about a person's mental state and tracks the changes in the mental state over time, benefiting from (1) the modality-invariance of symbolic representations, (2) the robustness of inverse planning, and (3) the scalability and zero-shot flexibility of language models.

How BIP-ALM evaluates the likelihood of different hypotheses via the action likelihood estimation.
How different modalities contribute to the mental state reasoning.


        title={Mmtom-qa: Multimodal theory of mind question answering},
        author={Jin, Chuanyang and Wu, Yutong and Cao, Jing and Xiang, Jiannan and Kuo, Yen-Ling and Hu, Zhiting and Ullman, Tomer and Torralba, Antonio and Tenenbaum, Joshua B and Shu, Tianmin},
        journal={arXiv preprint arXiv:2401.08743},