MMToM-QA: Multimodal Theory of Mind Question Answering

Chuanyang Jin¹, Yutong Wu², Jing Cao³, Jiannan Xiang⁴,

Yen-Ling Kuo⁵, Zhiting Hu⁴, Tomer Ullman², Antonio Torralba³, Joshua Tenenbaum³, Tianmin Shu⁶

¹NYU, ²Harvard, ³MIT, ⁴UCSD, ⁵UVA, ⁶JHU

ACL 2024

Outstanding Paper Award

📑 Blog Paper Code Twitter 📣 Benchmark 🏆 Leaderboard

MMToM-QA systematically evaluates the cognitive ability to understand people's minds both on multimodal data and different unimodal data. The questions are categorized into seven types, evaluating belief inference and goal inference in rich and diverse situations.

For a detailed explanation of the evaluation metric and analysis of the results, please refer to our blog and paper. The instructions for using or submitting to the MMToM-QA benchmark are available here.

	Method	Belief	Goal	All
Multimodal	Human	97.5	88.5	93
	AutoToM + Model Spec. (w/ GTP-4o) Zhang et al., '25	94.0	65.7	79.8
	BIP-ALM (w/ LLaMA 2) Jin et al., '24	80.3	73.3	76.7
	AutoToM (w/ GTP-4o) Zhang et al., '25	88.7	62.3	75.5
	o3-mini	88.7	40.7	64.7
	Gemini 2.0 Flash Thinking	73.3	34.7	54.0
	SimToM (w/ GTP-4o) Wilf, et al., '23	75.7	26.3	51.0
	Gemini 2.0 Pro	57.0	44.7	50.8
	Gemini 2.0 Flash	62.7	33.3	48.0
	InstructBLIP	48.7	44.7	46.7
	GPT-4o	55.7	32.3	44.0
	Llama 3.1 70B	51.3	36.3	43.8
	LLaVA	43.0	44.0	43.5
	Video-LLaMA 2	42.0	38.3	40.2
	GPT-4V	55.3	34.7	40.0
Text Only	Human	91.0	74.0	82.5
	o1*	95.1	59.2	76.5
	o3-mini*	97.1	44.9	71.5
	BIP-ALM (w/ LLaMA 2) Jin et al., '24	82.3	58.7	70.5
	Thought-tracing + CoT (w/ GTP-4o)* Kim et al., '25	93.1	42.9	69.0
	SymbolicToM (w/ GPT-4) Sclar et al., '23	78.3	47.7	63.0
	Thought-tracing (w/ GTP-4o)* Kim et al., '25	78.4	37.8	60.0
	GPT-4o*	74.5	39.8	56.5
	SimToM (w/ GTP-4o) Wilf et al., '23	64.3	40.7	52.5
	GPT-J	49.0	52.3	50.7
	LLaMA 2	56.3	44.3	50.3
	Gemini 1.5 Pro*	70.6	28.6	50.0
	DeepSeek R1*	73.5	24.5	49.0
	GPT-4	62.0	34.0	48.0
	Llama 3.3 70B*	50.0	42.9	47.0
	Qwen 2.5 72B*	51.0	32.7	41.5
	GPT-3.5	43.7	33.0	38.3
Video Only	Human	73.3	64.6	68.9
	BIP-ALM (w/ LLaMA 2) Jin et al., '24	64.0	58.3	61.2
	InstructBLIP	49.3	52.3	50.8
	Video-LLaMA 2	41.0	51.0	46.0
	GPT-4V	45.7	46.3	46.0
	LLaVA	39.0	45.3	42.2

* indicates that the method was tested on a 200-sample subset of MMToM-QA, as reported by Kim et al., '25.