Publications

(* denotes equal contribution; † denotes project lead)

The Era of Real-World Human Interaction: RL from User Conversations

Chuanyang Jin, Jing Xu, Bo Liu, Leitian Tao, Olga Golovneva, Tianmin Shu, Wenting Zhao, Xian Li, Jason Weston

arXiv Preprint / 🔍 Invited Talk at Google DeepMind, Meta TBD Lab / ⭐️ Paper of the Week by Huggingface, DAIR.AI, TuringPost

We posit that to achieve continual model improvement and multifaceted alignment, future models must learn from natural human interaction. We introduce Reinforcement Learning from Human Interaction (RLHI), a paradigm that learns directly from in-the-wild user conversations, leveraging organic replies and long-term history as learning signals. Trained on WildChat, RLHI outperforms RLHF in personalization and instruction-following, and similar feedback enhances performance on reasoning benchmarks.

paper / tweet / slides

SPICE: Self-Play In Corpus Environments Improves Reasoning

Bo Liu, Chuanyang Jin, Seungone Kim, Weizhe Yuan, Wenting Zhao, Ilia Kulikov, Xian Li, Sainbayar Sukhbaatar, Jack Lanchantin, Jason Weston

arXiv Preprint / 📰 Featured in VentureBeat

SPICE is a reinforcement learning framework where a single model improves itself by playing two roles: a Challenger that creates tasks based on corpora, and a Reasoner that solves them. By grounding this self-play in corpora, SPICE addresses hallucination and lack of diversity issues, significantly outperforming standard (ungrounded) self-play across reasoning benchmarks.

paper / tweet / VentureBeat news

A Benchmark of Expert-Level Academic Questions to Assess AI Capabilities

Center for AI Safety, Scale AI, HLE Contributors Consortium

Nature 649 (8099), 1139-1146

Humanity's Last Exam (HLE) is a multi-modal benchmark designed to be the final closed-ended academic benchmark with broad subject coverage. I contributed several challenging mathematics and cognitive science questions.

paper / project / benchmark

AutoToM: Scaling Model-based Mental Inference via Automated Agent Modeling

Zhining Zhang*, Chuanyang Jin*†, Mung Yao Jia*, Shunchi Zhang*, Tianmin Shu (†: project lead)

NeurIPS 2025 (Spotlight)

AutoToM is an automated agent modeling method for scalable, robust, and interpretable mental inference. It achieves SOTA on five benchmarks, produces human-like confidence estimates, and supports embodied decision-making.

paper / project / code

SoMi-ToM: Evaluating Multi-Perspective Theory of Mind in Embodied Social Interactions

Xianzhe Fan, Xuhui Zhou, Chuanyang Jin, Kolby Nottingham, Hao Zhu, Maarten Sap

NeurIPS D&B 2025

SoMi-ToM is a Minecraft-based benchmark designed to evaluate Vision-Language Models' Theory of Mind capabilities through complex, multi-agent social interactions. Experiments reveal that current models significantly underperform compared to humans.

paper / code / benchmark

Do VLMs have internal World Models? Towards an Atomic Evaluation

Qiyue Gao*, Xinyu Pi*, Kevin Liu, Junrong Chen, Ruolan Yang, Xinqi Huang, Xinyu Fang, Lu Sun, Gautham Kishore, Bo Ai, Stone Tao, Mengyang Liu, Jiaxi Yang, Chao-Jung Lai, Chuanyang Jin, Jiannan Xiang, Benhao Huang, Zeming Chen, David Danks, Hao Su, Tianmin Shu, Ziqiao Ma, Lianhui Qin, Zhiting Hu

ACL 2025 Findings / ⭐️ Huggingface Daily Papers Top-3

We introduce WM-ABench, a large-scale benchmark to evaluate whether Vision-Language Models possess internal world models by assessing their perception (visual, spatial, temporal, quantitative, and motion) and prediction (mechanistic simulation, transitive inference, compositional inference).

paper / project / benchmark

OS-Genesis: Automating GUI Agent Trajectory Construction via Reverse Task Synthesis

Qiushi Sun*, Kanzhi Cheng*, Zichen Ding*, Chuanyang Jin*, Yian Wang, Fangzhi Xu, Zhenyu Wu, Chengyou Jia, Liheng Chen, Zhoumianze Liu, Ben Kao, Guohao Li, Junxian He, Yu Qiao, Zhiyong Wu

ACL 2025 / ⭐️ Huggingface Daily Papers Top-1

OS-Genesis is a manual-free data pipeline for synthesizing GUI agent trajectory. It enables agents to actively explore web and mobile environments through stepwise interactions, then derive meaningful low- and high-level task instructions from observed interactions and state changes.

paper / project / code / model / data / slides

MuMA-ToM: Multi-modal Multi-Agent Theory of Mind

Haojun Shi*, Suyu Ye*, Xinyu Fang, Chuanyang Jin, Leyla Isik, Yen-Ling Kuo, Tianmin Shu

AAAI 2025 (Oral) / ⭐️ Featured as a CVPR 2026 Challenge

MuMA-ToM evaluates Theory of Mind reasoning in embodied multi-agent interactions, revealing that current multimodal LLMs significantly lag behind human performance. To bridge this gap, we propose LIMP, a method that combines language models with inverse multi-agent planning to achieve superior results.

paper / project / code / benchmark / leaderboard / slides

MMToM-QA: Multimodal Theory of Mind Question Answering

Chuanyang Jin, Yutong Wu, Jing Cao, Jiannan Xiang, Yen-Ling Kuo, Zhiting Hu, Tomer Ullman, Antonio Torralba, Joshua Tenenbaum, Tianmin Shu

ACL 2024 (Outstanding Paper Award) / 🔍 Invited Talk at University of Washinton / 📰 Featured in Futurity, Synced, ...

Can machines understand people's minds from multimodal inputs? We introduce a comprehensive benchmark, MMToM-QA, and highlight key limitations in current multimodal LLMs. We then propose a novel method that combines the flexibility of LLMs with the robustness of Bayesian inverse planning, achieving promising results.

paper / project / code / benchmark / leaderboard / slides / Futurity news / Synced news / JHU news

How Far Are We From AGI?

Tao Feng*, Chuanyang Jin*, Jingyu Liu*, Kunlun Zhu*, Haoqin Tu, Zirui Cheng, Guanyu Lin, Jiaxuan You

TMLR 2024 / ICLR 2024 AGI Workshop (Oral)

We explore the trajectory toward Artificial General Intelligence by comprehensively analyzing the requisite capabilities across internal, interface, and system dimensions. We also outline a roadmap for responsibly achieving AGI.

paper / project

Neural Amortized Inference for Nested Multi-agent Reasoning

Kunal Jha, Tuan Anh Le, Chuanyang Jin, Yen-Ling Kuo, Joshua Tenenbaum, Tianmin Shu

AAAI 2024 / AAAI 2024 Summer Symposium (Oral)

Multi-agent interactions often rely on higher-order social inference, i.e., understanding how others infer oneself. We introduce a neural amortized inference method to accelerate computationally expensive nested multi-agent reasoning within the I-POMDP framework, significantly reducing computational costs while maintaining high accuracy.

paper / project / code / slides

Beyond the Binary: Capturing Diverse Preferences With Reward Regularization

Vishakh Padmakumar*,Chuanyang Jin*, Hannah Rose Kirk*, He He

NeurIPS 2024 Workshop on Socially Responsible Language Modelling Research

Standard binary feedback in RLHF fails to capture the diversity of user opinions in subjective tasks. To address this, we propose a training method that incorporates estimated user disagreement, leading to reward models that better align with aggregate human preferences.

paper / code

The Cultural Psychology of Large Language Models

Chuanyang Jin*, Songyang Zhang*, Tianmin Shu, Zhihan Cui

Technical Report

We apply cultural psychology scales to ChatGPT to assess its cognitive processing style and value judgments. We find that the model exhibits Eastern-style holistic processing traits while displaying mixed alignment in its cultural values.

paper

Dynamics of RNA Localization to Nuclear Speckles are Connected to Splicing Efficiency

Jinjun Wu*, Yu Xiao*, Yunzheng Liu*, Li Wen, Chuanyang Jin, Shun Liu, Sneha Paul, Chuan He, Oded Regev, Jingyi Fei

Science Advances 10 (42), eadp7727

We demonstrate that RNA localization dynamics to nuclear speckles are tied to gene expression by influencing splicing efficiency. Specifically, nuclear speckles coordinate both co- and post-transcriptional splicing regulation by facilitating the removal of inefficiently excised introns in transcripts enriched within them.

paper

OpenCompass: A Universal Evaluation Platform for Foundation Models

OpenCompass Team

Open-source Project / ⭐️ 6K Github Stars

OpenCompass is an LLM evaluation platform, supporting a wide range of models over 100+ datasets.

code

Fast-DiT: Fast Diffusion Models with Transformers

Chuanyang Jin, Saining Xie

Open-source Project / ⭐️ 900 Github Stars

Fast-DiT improves the efficiency of Diffusion Transformers (DiTs) by incorporating features such as gradient checkpointing, mixed-precision training, and feature pre-extraction. It delivers a 95% speed increase and a 60% reduction in memory usage, and has been integrated into the official DiT implementation.

code

Templates (for web app):

Error