MMToM-QA systematically evaluates the cognitive ability to understand people's minds both on multimodal data and different unimodal data. The questions are categorized into seven types, evaluating belief inference and goal inference in rich and diverse situations.
For a detailed explanation of the evaluation metric and analysis of the results, please refer to our blog and paper. The instructions for using or submitting to the MMToM-QA benchmark are available here.
Method | Belief | Goal | All | |
---|---|---|---|---|
Multimodal | Human | 97.5 | 88.5 | 93 |
InstructBLIP | 48.7 | 44.7 | 46.7 | |
Video-LLaMA 2 | 42.0 | 38.3 | 40.2 | |
LLaVA | 43.0 | 44.0 | 43.5 | |
GPT-4V | 55.3 | 34.7 | 40.0 | |
BIP-ALM w/ GPT-J | 81.7 | 69.0 | 75.3 | |
BIP-ALM w/ LLaMA 2 | 80.3 | 73.3 | 76.7 | |
Text Only | Human | 91.0 | 74.0 | 82.5 |
GPT-4 | 62.0 | 34.0 | 48.0 | |
GPT-3.5 | 43.7 | 33.0 | 38.3 | |
GPT-J | 49.0 | 52.3 | 50.7 | |
LLaMA 2 | 56.3 | 44.3 | 50.3 | |
SimToM w/ GPT-4 | 64.3 | 40.7 | 52.5 | |
SymbolicToM w/ GPT-4 | 78.3 | 47.7 | 63.0 | |
BIP-ALM w/ GPT-J | 81.7 | 61.7 | 71.7 | |
BIP-ALM w/ LLaMA 2 | 82.3 | 58.7 | 70.5 | |
Video Only | Human | 73.3 | 64.6 | 68.9 |
InstructBLIP | 49.3 | 52.3 | 50.8 | |
Video-LLaMA 2 | 41.0 | 51.0 | 46.0 | |
LLaVA | 39.0 | 45.3 | 42.2 | |
GPT-4V | 45.7 | 46.3 | 46.0 | |
BIP-ALM w/ GPT-J | 64.0 | 55.3 | 59.7 | |
BIP-ALM w/ LLaMA 2 | 64.0 | 58.3 | 61.2 | |