MMToM-QA systematically evaluates the cognitive ability to understand people's minds both on multimodal data and different unimodal data. The questions are categorized into seven types, evaluating belief inference and goal inference in rich and diverse situations.

For a detailed explanation of the evaluation metric and analysis of the results, please refer to our blog and paper. The instructions for using or submitting to the MMToM-QA benchmark are available here.

Method Belief Goal All
Multimodal Human 97.5 88.5 93
InstructBLIP 48.7 44.7 46.7
Video-LLaMA 2 42.0 38.3 40.2
LLaVA 43.0 44.0 43.5
GPT-4V 55.3 34.7 40.0
BIP-ALM w/ GPT-J 81.7 69.0 75.3
BIP-ALM w/ LLaMA 2 80.3 73.3 76.7
Text Only Human 91.0 74.0 82.5
GPT-4 62.0 34.0 48.0
GPT-3.5 43.7 33.0 38.3
GPT-J 49.0 52.3 50.7
LLaMA 2 56.3 44.3 50.3
SimToM w/ GPT-4 64.3 40.7 52.5
SymbolicToM w/ GPT-4 78.3 47.7 63.0
BIP-ALM w/ GPT-J 81.7 61.7 71.7
BIP-ALM w/ LLaMA 2 82.3 58.7 70.5
Video Only Human 73.3 64.6 68.9
InstructBLIP 49.3 52.3 50.8
Video-LLaMA 2 41.0 51.0 46.0
LLaVA 39.0 45.3 42.2
GPT-4V 45.7 46.3 46.0
BIP-ALM w/ GPT-J 64.0 55.3 59.7
BIP-ALM w/ LLaMA 2 64.0 58.3 61.2