MMToM-QA: Multimodal Theory of Mind Question Answering

Introduction

We propose Bayesian Inverse Planning Accelerated by Language Models (BIP-ALM), a novel method to engineer multimodal Theory of Mind.

This is the main text for the project, styled with the 'project-main' class. It should adhere to the style rules defined, like width and line height.

This is additional content for the project, styled with the 'project-content' class. This part will also follow the specific style rules like different width and height properties.

We propose Bayesian Inverse Planning Accelerated by Language Models (BIP-ALM), a novel method to engineer multimodal Theory of Mind.

Bayesian Inverse Planning Accelerated by Language Models (BIP-ALM)

We propose Bayesian Inverse Planning Accelerated by Language Models (BIP-ALM), a novel method to engineer multimodal Theory of Mind. This method:
  • Extracts symbolic representations from video (via the Visual Perception Module) and text (using GPT-4);
  • Aligns and fuses these representations to form a unified representation of the event and the physical scene;
  • Conducts inverse inference about the agent's goal and belief using finetuned language models.