MMToM-QA systematically evaluates the cognitive ability to understand people's minds both on multimodal data and different unimodal data. Questions span seven categories, evaluating belief inference and goal inference in rich and diverse situations.
For details, see our project page and paper. Submission instructions live in the GitHub repo.
| # | Method | Belief | Goal | All |
|---|---|---|---|---|
| — | Human | 97.5 | 88.5 | 93.0 |
| 1 | AutoToM + Model Spec. (w/ GPT-4o)SOTA Zhang et al., '25 | 94.0 | 65.7 | 79.8 |
| 2 | BIP-ALM (w/ LLaMA 2) Jin et al., '24 | 80.3 | 73.3 | 76.7 |
| 3 | AutoToM (w/ GPT-4o) Zhang et al., '25 | 88.7 | 62.3 | 75.5 |
| 4 | o3-mini | 88.7 | 40.7 | 64.7 |
| 5 | Gemini 2.0 Flash Thinking | 73.3 | 34.7 | 54.0 |
| 6 | SimToM (w/ GPT-4o)Wilf et al., '23 | 75.7 | 26.3 | 51.0 |
| 7 | Gemini 2.0 Pro | 57.0 | 44.7 | 50.8 |
| 8 | Gemini 2.0 Flash | 62.7 | 33.3 | 48.0 |
| 9 | InstructBLIP | 48.7 | 44.7 | 46.7 |
| 10 | GPT-4o | 55.7 | 32.3 | 44.0 |
| 11 | Llama 3.1 70B | 51.3 | 36.3 | 43.8 |
| 12 | LLaVA | 43.0 | 44.0 | 43.5 |
| 13 | Video-LLaMA 2 | 42.0 | 38.3 | 40.2 |
| 14 | GPT-4V | 55.3 | 34.7 | 40.0 |
* indicates that the method was tested on a 200-sample subset of MMToM-QA, as reported by Kim et al., '25.