Publications

(* denotes equal contribution; † denotes project lead)
The Era of Real-World Human Interaction: RL from User Conversations
Chuanyang Jin, Jing Xu, Bo Liu, Leitian Tao, Olga Golovneva, Tianmin Shu, Wenting Zhao, Xian Li, Jason Weston
arXiv Preprint / 🔍 Invited Talk at Google and Meta TBD Lab / ⭐️ Paper of the Week by Huggingface, DAIR.AI, and TuringPost
We posit that to achieve continual model improvement and multifaceted alignment, future models must learn from natural human interaction. We introduce Reinforcement Learning from Human Interaction (RLHI), a paradigm that learns directly from in-the-wild user conversations. RLHI beats RLHF at the user level and enabling personalized, contextual, and continually improving AI assistants.
SPICE: Self-Play In Corpus Environments Improves Reasoning
Bo Liu, Chuanyang Jin, Seungone Kim, Weizhe Yuan, Wenting Zhao, Ilia Kulikov, Xian Li, Sainbayar Sukhbaatar, Jack Lanchantin, Jason Weston
arXiv Preprint / 📰 Featured in VentureBeat
SPICE is a reinforcement learning framework where a single model improves itself by playing two roles: a Challenger that creates tasks based on corpora, and a Reasoner that solves them. By grounding this self-play in corpora, SPICE addresses hallucination and lack of diversity issues, significantly outperforming standard (ungrounded) self-play across reasoning benchmarks.
AutoToM: Scaling Model-based Mental Inference via Automated Agent Modeling
Zhining Zhang*, Chuanyang Jin*†, Mung Yao Jia*, Shunchi Zhang*, Tianmin Shu (†: project lead)
NeurIPS 2025 (Spotlight)
AutoToM is an automated agent modeling method for scalable, robust, and interpretable mental inference. It achieves SOTA on five benchmarks, produces human-like confidence estimates, and supports embodied decision-making.
SoMi-ToM: Evaluating Multi-Perspective Theory of Mind in Embodied Social Interactions
Xianzhe Fan, Xuhui Zhou, Chuanyang Jin, Kolby Nottingham, Hao Zhu, Maarten Sap
NeurIPS D&B 2025
SoMi-ToM is a Minecraft-based benchmark designed to evaluate Vision-Language Models' Theory of Mind capabilities through complex, multi-agent social interactions. Experiments reveal that current models significantly underperform compared to humans.
Do VLMs have internal World Models? Towards an Atomic Evaluation
Qiyue Gao*, Xinyu Pi*, Kevin Liu, Junrong Chen, Ruolan Yang, Xinqi Huang, Xinyu Fang, Lu Sun, Gautham Kishore, Bo Ai, Stone Tao, Mengyang Liu, Jiaxi Yang, Chao-Jung Lai, Chuanyang Jin, Jiannan Xiang, Benhao Huang, Zeming Chen, David Danks, Hao Su, Tianmin Shu, Ziqiao Ma, Lianhui Qin, Zhiting Hu
ACL 2025 Findings / ⭐️ Huggingface Daily Papers Top-3
We introduce WM-ABench, a large-scale benchmark to evaluate whether Vision-Language Models possess internal world models by assessing their perception (visual, spatial, temporal, quantitative, and motion) and prediction (mechanistic simulation, transitive inference, compositional inference).
OS-Genesis: Automating GUI Agent Trajectory Construction via Reverse Task Synthesis
Qiushi Sun*, Kanzhi Cheng*, Zichen Ding*, Chuanyang Jin*, Yian Wang, Fangzhi Xu, Zhenyu Wu, Chengyou Jia, Liheng Chen, Zhoumianze Liu, Ben Kao, Guohao Li, Junxian He, Yu Qiao, Zhiyong Wu
ACL 2025 / ⭐️ Huggingface Daily Papers Top-1
OS-Genesis is a manual-free data pipeline for synthesizing GUI agent trajectory. It enables agents to actively explore web and mobile environments through stepwise interactions, then derive meaningful low- and high-level task instructions from observed interactions and state changes.
Humanity’s Last Exam
Long Phan et al., with 1000+ contributors.
Technical Report / 📰 Featured in The New York Times, Reuters, ...
Humanity's Last Exam (HLE) is a multi-modal benchmark designed to be the final closed-ended academic benchmark with broad subject coverage. I contributed several challenging mathematics and cognitive science questions.
MuMA-ToM: Multi-modal Multi-Agent Theory of Mind
Haojun Shi*, Suyu Ye*, Xinyu Fang, Chuanyang Jin, Leyla Isik, Yen-Ling Kuo, Tianmin Shu
AAAI 2025 (Oral)
MuMA-ToM evaluates Theory of Mind reasoning in embodied multi-agent interactions, revealing that current multimodal LLMs significantly lag behind human performance. To bridge this gap, we propose LIMP, a method that combines language models with inverse multi-agent planning to achieve superior results.
MMToM-QA: Multimodal Theory of Mind Question Answering
Chuanyang Jin, Yutong Wu, Jing Cao, Jiannan Xiang, Yen-Ling Kuo, Zhiting Hu, Tomer Ullman, Antonio Torralba, Joshua Tenenbaum, Tianmin Shu
ACL 2024 (Outstanding Paper Award) / 🔍 Invited Talk at University of Washinton
Can machines understand people's minds from multimodal inputs? We introduce a comprehensive benchmark, MMToM-QA, and highlight key limitations in current multimodal LLMs. We then propose a novel method that combines the flexibility of LLMs with the robustness of Bayesian inverse planning, achieving promising results.
How Far Are We From AGI?
Tao Feng*, Chuanyang Jin*, Jingyu Liu*, Kunlun Zhu*, Haoqin Tu, Zirui Cheng, Guanyu Lin, Jiaxuan You
TMLR 2024 / ICLR 2024 AGI Workshop (Oral)
We explore the trajectory toward Artificial General Intelligence by comprehensively analyzing the requisite capabilities across internal, interface, and system dimensions. We also outline a roadmap for responsibly achieving AGI.
Neural Amortized Inference for Nested Multi-agent Reasoning
Kunal Jha, Tuan Anh Le, Chuanyang Jin, Yen-Ling Kuo, Joshua Tenenbaum, Tianmin Shu
AAAI 2024 / AAAI 2024 Summer Symposium (Oral)
Multi-agent interactions often rely on higher-order social inference, i.e., understanding how others infer oneself. We introduce a neural amortized inference method to accelerate computationally expensive nested multi-agent reasoning within the I-POMDP framework, significantly reducing computational costs while maintaining high accuracy.
Beyond the Binary: Capturing Diverse Preferences With Reward Regularization
Vishakh Padmakumar*,Chuanyang Jin*, Hannah Rose Kirk*, He He
NeurIPS 2024 Workshop on Socially Responsible Language Modelling Research
Standard binary feedback in RLHF fails to capture the diversity of user opinions in subjective tasks. To address this, we propose a training method that incorporates estimated user disagreement, leading to reward models that better align with aggregate human preferences.
The Cultural Psychology of Large Language Models
Chuanyang Jin*, Songyang Zhang*, Tianmin Shu, Zhihan Cui
Technical Report
We apply cultural psychology scales to ChatGPT to assess its cognitive processing style and value judgments. We find that the model exhibits Eastern-style holistic processing traits while displaying mixed alignment in its cultural values.
Dynamics of RNA Localization to Nuclear Speckles are Connected to Splicing Efficiency
Jinjun Wu*, Yu Xiao*, Yunzheng Liu*, Li Wen, Chuanyang Jin, Shun Liu, Sneha Paul, Chuan He, Oded Regev, Jingyi Fei
Science Advances 10 (42), eadp7727
We demonstrate that RNA localization dynamics to nuclear speckles are tied to gene expression by influencing splicing efficiency. Specifically, nuclear speckles coordinate both co- and post-transcriptional splicing regulation by facilitating the removal of inefficiently excised introns in transcripts enriched within them.
OpenCompass: A Universal Evaluation Platform for Foundation Models
OpenCompass Team
Open-source Project / ⭐️ 6K Github Stars
OpenCompass is an LLM evaluation platform, supporting a wide range of models over 100+ datasets.
Fast-DiT: Fast Diffusion Models with Transformers
Chuanyang Jin, Saining Xie
Open-source Project / ⭐️ 900 Github Stars
Fast-DiT improves the efficiency of Diffusion Transformers (DiTs) by incorporating features such as gradient checkpointing, mixed-precision training, and feature pre-extraction. It delivers a 95% speed increase and a 60% reduction in memory usage, and has been integrated into the official DiT implementation.

© Chuanyang Jin, 2023
Powered by Hydejack