|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
MMToM-QA is the first multimodal benchmark to evaluate machine Theory of Mind (ToM), the ability to understand people's minds. MMToM-QA consists of 600 questions. Each question is paired with a clip of the full activity in a video (as RGB-D frames), as well as a text description of the scene and the actions taken by the person in that clip. All questions have two choices. The questions are categorized into seven types, evaluating belief inference and goal inference in rich and diverse situations. Each belief inference type has 100 questions, totaling 300 belief questions; each goal inference type has 75 questions, totaling 300 goal questions. The questions are paired with 134 videos of a person looking for daily objects in household environments.
The instructions for using the MMToM-QA benchmark are available here. |
We propose Bayesian Inverse Planning Accelerated by Language Models (BIP-ALM), a novel method to engineer multimodal Theory of Mind. This method:
|
We evaluate large language models (e.g., GPT-4, LLaMA 2) on the text-only version of MMToM-QA.
We assess large multimodal models (e.g., GPT-4V, InstructBLIP, LLaVA) on both multimodal and video-only versions of MMToM-QA.
Furthermore, we measure human performance in these tasks.
A summary of our findings:
|
|
Chuanyang Jin, Yutong Wu, Jing Cao, Jiannan Xiang, Yen-Ling Kuo, Zhiting Hu, Tomer Ullman, Antonio Torralba, Joshua Tenenbaum, Tianmin Shu MMToM-QA: Multimodal Theory of Mind Question Answering arXiv Preprint A short version presented at NeurIPS'23 FMDM Workshop Paper |