MMToM-QA systematically evaluates the cognitive ability to understand people's minds both on multimodal data and different unimodal data. The questions are categorized into seven types, evaluating belief inference and goal inference in rich and diverse situations.
For a detailed explanation of the evaluation metric and analysis of the results, please refer to our blog and paper. The instructions for using or submitting to the MMToM-QA benchmark are available here.
Method | Belief | Goal | All | |
---|---|---|---|---|
Multimodal | Human | 97.5 | 88.5 | 93 |
AutoToM + Model Spec. (w/ GTP-4o) Zhang et al., '25 |
94.0 | 65.7 | 79.8 | |
BIP-ALM (w/ LLaMA 2) Jin et al., '24 |
80.3 | 73.3 | 76.7 | |
AutoToM (w/ GTP-4o) Zhang et al., '25 |
88.7 | 62.3 | 75.5 | |
o3-mini | 88.7 | 40.7 | 64.7 | |
Gemini 2.0 Flash Thinking | 73.3 | 34.7 | 54.0 | |
SimToM (w/ GTP-4o) Wilf, et al., '23 |
75.7 | 26.3 | 51.0 | |
Gemini 2.0 Pro | 57.0 | 44.7 | 50.8 | |
Gemini 2.0 Flash | 62.7 | 33.3 | 48.0 | |
InstructBLIP | 48.7 | 44.7 | 46.7 | |
GPT-4o | 55.7 | 32.3 | 44.0 | |
Llama 3.1 70B | 51.3 | 36.3 | 43.8 | |
LLaVA | 43.0 | 44.0 | 43.5 | |
Video-LLaMA 2 | 42.0 | 38.3 | 40.2 | |
GPT-4V | 55.3 | 34.7 | 40.0 | |
Text Only | Human | 91.0 | 74.0 | 82.5 |
o1* | 95.1 | 59.2 | 76.5 | |
o3-mini* | 97.1 | 44.9 | 71.5 | |
BIP-ALM (w/ LLaMA 2) Jin et al., '24 |
82.3 | 58.7 | 70.5 | |
Thought-tracing + CoT (w/ GTP-4o)* Kim et al., '25 |
93.1 | 42.9 | 69.0 | |
SymbolicToM (w/ GPT-4) Sclar et al., '23 |
78.3 | 47.7 | 63.0 | |
Thought-tracing (w/ GTP-4o)* Kim et al., '25 |
78.4 | 37.8 | 60.0 | |
GPT-4o* | 74.5 | 39.8 | 56.5 | |
SimToM (w/ GTP-4o) Wilf et al., '23 |
64.3 | 40.7 | 52.5 | |
GPT-J | 49.0 | 52.3 | 50.7 | |
LLaMA 2 | 56.3 | 44.3 | 50.3 | |
Gemini 1.5 Pro* | 70.6 | 28.6 | 50.0 | |
DeepSeek R1* | 73.5 | 24.5 | 49.0 | |
GPT-4 | 62.0 | 34.0 | 48.0 | |
Llama 3.3 70B* | 50.0 | 42.9 | 47.0 | |
Qwen 2.5 72B* | 51.0 | 32.7 | 41.5 | |
GPT-3.5 | 43.7 | 33.0 | 38.3 | |
Video Only | Human | 73.3 | 64.6 | 68.9 |
BIP-ALM (w/ LLaMA 2) Jin et al., '24 |
64.0 | 58.3 | 61.2 | |
InstructBLIP | 49.3 | 52.3 | 50.8 | |
Video-LLaMA 2 | 41.0 | 51.0 | 46.0 | |
GPT-4V | 45.7 | 46.3 | 46.0 | |
LLaVA | 39.0 | 45.3 | 42.2 |
* indicates that the method was tested on a 200-sample subset of MMToM-QA, as reported by Kim et al., '25.