MMToM-QA: Multimodal Theory of Mind Question Answering

1NYU, 2Harvard, 3MIT, 4UCSD, 5UVA, 6JHU
ACL 2024
Outstanding Paper Award

MMToM-QA systematically evaluates the cognitive ability to understand people's minds both on multimodal data and different unimodal data. The questions are categorized into seven types, evaluating belief inference and goal inference in rich and diverse situations.

For a detailed explanation of the evaluation metric and analysis of the results, please refer to our blog and paper. The instructions for using or submitting to the MMToM-QA benchmark are available here.

Method Belief Goal All
Multimodal Human 97.5 88.5 93
AutoToM + Model Spec. (w/ GTP-4o)
Zhang et al., '25
94.0 65.7 79.8
BIP-ALM (w/ LLaMA 2)
Jin et al., '24
80.3 73.3 76.7
AutoToM (w/ GTP-4o)
Zhang et al., '25
88.7 62.3 75.5
o3-mini 88.7 40.7 64.7
Gemini 2.0 Flash Thinking 73.3 34.7 54.0
SimToM (w/ GTP-4o)
Wilf, et al., '23
75.7 26.3 51.0
Gemini 2.0 Pro 57.0 44.7 50.8
Gemini 2.0 Flash 62.7 33.3 48.0
InstructBLIP 48.7 44.7 46.7
GPT-4o 55.7 32.3 44.0
Llama 3.1 70B 51.3 36.3 43.8
LLaVA 43.0 44.0 43.5
Video-LLaMA 2 42.0 38.3 40.2
GPT-4V 55.3 34.7 40.0
Text Only Human 91.0 74.0 82.5
o1* 95.1 59.2 76.5
o3-mini* 97.1 44.9 71.5
BIP-ALM (w/ LLaMA 2)
Jin et al., '24
82.3 58.7 70.5
Thought-tracing + CoT (w/ GTP-4o)*
Kim et al., '25
93.1 42.9 69.0
SymbolicToM (w/ GPT-4)
Sclar et al., '23
78.3 47.7 63.0
Thought-tracing (w/ GTP-4o)*
Kim et al., '25
78.4 37.8 60.0
GPT-4o* 74.5 39.8 56.5
SimToM (w/ GTP-4o)
Wilf et al., '23
64.3 40.7 52.5
GPT-J 49.0 52.3 50.7
LLaMA 2 56.3 44.3 50.3
Gemini 1.5 Pro* 70.6 28.6 50.0
DeepSeek R1* 73.5 24.5 49.0
GPT-4 62.0 34.0 48.0
Llama 3.3 70B* 50.0 42.9 47.0
Qwen 2.5 72B* 51.0 32.7 41.5
GPT-3.5 43.7 33.0 38.3
Video Only Human 73.3 64.6 68.9
BIP-ALM (w/ LLaMA 2)
Jin et al., '24
64.0 58.3 61.2
InstructBLIP 49.3 52.3 50.8
Video-LLaMA 2 41.0 51.0 46.0
GPT-4V 45.7 46.3 46.0
LLaVA 39.0 45.3 42.2

* indicates that the method was tested on a 200-sample subset of MMToM-QA, as reported by Kim et al., '25.