MMToM-QA: Multimodal Theory of Mind Question Answering

ACL 2024
Outstanding Paper Award
1NYU · 2Harvard · 3MIT · 4UCSD · 5UVA · 6JHU

MMToM-QA systematically evaluates the cognitive ability to understand people's minds both on multimodal data and different unimodal data. Questions span seven categories, evaluating belief inference and goal inference in rich and diverse situations.

For details, see our project page and paper. Submission instructions live in the GitHub repo.

# Method Belief Goal All
Human 97.588.593.0
1 AutoToM + Model Spec. (w/ GPT-4o)SOTA Zhang et al., '25 94.065.779.8
2 BIP-ALM (w/ LLaMA 2) Jin et al., '24 80.373.376.7
3 AutoToM (w/ GPT-4o) Zhang et al., '25 88.762.375.5
4o3-mini88.740.764.7
5Gemini 2.0 Flash Thinking73.334.754.0
6SimToM (w/ GPT-4o)Wilf et al., '2375.726.351.0
7Gemini 2.0 Pro57.044.750.8
8Gemini 2.0 Flash62.733.348.0
9InstructBLIP48.744.746.7
10GPT-4o55.732.344.0
11Llama 3.1 70B51.336.343.8
12LLaVA43.044.043.5
13Video-LLaMA 242.038.340.2
14GPT-4V55.334.740.0

* indicates that the method was tested on a 200-sample subset of MMToM-QA, as reported by Kim et al., '25.