Computation and Language
☆ SPIRAL: Self-Play on Zero-Sum Games Incentivizes Reasoning via Multi-Agent Multi-Turn Reinforcement Learning
Bo Liu, Leon Guertler, Simon Yu, Zichen Liu, Penghui Qi, Daniel Balcells, Mickel Liu, Cheston Tan, Weiyan Shi, Min Lin, Wee Sun Lee, Natasha Jaques
Recent advances in reinforcement learning have shown that language models can
develop sophisticated reasoning through training on tasks with verifiable
rewards, but these approaches depend on human-curated problem-answer pairs and
domain-specific reward engineering. We introduce SPIRAL, a self-play framework
where models learn by playing multi-turn, zero-sum games against continuously
improving versions of themselves, eliminating the need for human supervision.
Through self-play, SPIRAL generates an infinite curriculum of progressively
challenging problems as models must constantly adapt to stronger opponents. To
enable this self-play training at scale, We implement a fully online,
multi-turn, multi-agent reinforcement learning system for LLMs and propose
role-conditioned advantage estimation (RAE) to stabilize multi-agent training.
Using SPIRAL, self-play on zero-sum games produces reasoning capabilities that
transfer broadly. Training Qwen3-4B-Base on Kuhn Poker alone achieves 8.6%
improvement on math and 8.4% on general reasoning, outperforming SFT on 25,000
expert game trajectories. Analysis reveals that this transfer occurs through
three cognitive patterns: systematic decomposition, expected value calculation,
and case-by-case analysis. Multi-game training (TicTacToe, Kuhn Poker, Simple
Negotiation) further enhances performance as each game develops distinct
reasoning strengths. Applying SPIRAL to a strong reasoning model
(DeepSeek-R1-Distill-Qwen-7B) can still lead to 2.0% average improvement. These
results demonstrate that zero-sum games naturally develop transferable
reasoning capabilities, highlighting a promising direction for autonomous
reasoning development.
comment: Work in Progress
☆ Computational Detection of Intertextual Parallels in Biblical Hebrew: A Benchmark Study Using Transformer-Based Language Models
Identifying parallel passages in biblical Hebrew is foundational in biblical
scholarship for uncovering intertextual relationships. Traditional methods rely
on manual comparison, which is labor-intensive and prone to human error. This
study evaluates the potential of pre-trained transformer-based language models,
including E5, AlephBERT, MPNet, and LaBSE, for detecting textual parallels in
the Hebrew Bible. Focusing on known parallels between the books of Samuel/Kings
and Chronicles, I assessed each model's capability to generate word embeddings
that delineate parallel from non-parallel passages. Utilizing cosine similarity
and Wasserstein Distance measures, I found that E5 and AlephBERT show
significant promise, with E5 excelling in parallel detection and AlephBERT
demonstrating stronger non-parallel differentiation. These findings indicate
that pre-trained models can enhance the efficiency and accuracy of detecting
intertextual parallels in ancient texts, suggesting broader applications for
ancient language studies.
☆ On the Predictive Power of Representation Dispersion in Language Models
We show that a language model's ability to predict text is tightly linked to
the breadth of its embedding space: models that spread their contextual
representations more widely tend to achieve lower perplexity. Concretely, we
find that representation dispersion - the average pairwise cosine distance
among hidden vectors - strongly and negatively correlates with perplexity
across diverse model families (LLaMA, Qwen, and others) and domains (Wikipedia,
news, scientific abstracts). Beyond illustrating this link, we show how
dispersion can be leveraged for a range of practical tasks without requiring
labeled data. First, measuring dispersion on unlabeled text allows us to
predict downstream accuracy in new domains, offering a data-efficient tool for
model selection. Next, we find that identifying layers with higher dispersion
pinpoints the best representations for retrieval-based methods such as kNN-LM,
bypassing exhaustive layer-by-layer searches. Finally, we integrate a simple
push-away objective into training, which increases dispersion in both
single-domain and cross-domain scenarios and directly improves perplexity in
each.
☆ MotionGPT3: Human Motion as a Second Modality
Though recent advances in multimodal models have demonstrated strong
capabilities and opportunities in unified understanding and generation, the
development of unified motion-language models remains underexplored. To enable
such models with high-fidelity human motion, two core challenges must be
addressed. The first is the reconstruction gap between the continuous motion
modality and discrete representation in an autoregressive manner, and the
second is the degradation of language intelligence during unified training.
Inspired by the mixture of experts, we propose MotionGPT3, a bimodal
motion-language model that treats human motion as a second modality, decoupling
motion modeling via separate model parameters and enabling both effective
cross-modal interaction and efficient multimodal scaling training. To preserve
language intelligence, the text branch retains the original structure and
parameters of the pretrained language model, while a new motion branch is
integrated via a shared attention mechanism, enabling bidirectional information
flow between two modalities. We first employ a motion Variational Autoencoder
(VAE) to encode raw human motion into latent representations. Based on this
continuous latent space, the motion branch predicts motion latents directly
from intermediate hidden states using a diffusion head, bypassing discrete
tokenization. Extensive experiments show that our approach achieves competitive
performance on both motion understanding and generation tasks while preserving
strong language capabilities, establishing a unified bimodal motion diffusion
framework within an autoregressive manner.
comment: 21 pages, 8 figures
☆ STACK: Adversarial Attacks on LLM Safeguard Pipelines
Ian R. McKenzie, Oskar J. Hollinsworth, Tom Tseng, Xander Davies, Stephen Casper, Aaron D. Tucker, Robert Kirk, Adam Gleave
Frontier AI developers are relying on layers of safeguards to protect against
catastrophic misuse of AI systems. Anthropic guards their latest Claude 4 Opus
model using one such defense pipeline, and other frontier developers including
Google DeepMind and OpenAI pledge to soon deploy similar defenses. However, the
security of such pipelines is unclear, with limited prior work evaluating or
attacking these pipelines. We address this gap by developing and red-teaming an
open-source defense pipeline. First, we find that a novel few-shot-prompted
input and output classifier outperforms state-of-the-art open-weight safeguard
model ShieldGemma across three attacks and two datasets, reducing the attack
success rate (ASR) to 0% on the catastrophic misuse dataset ClearHarm. Second,
we introduce a STaged AttaCK (STACK) procedure that achieves 71% ASR on
ClearHarm in a black-box attack against the few-shot-prompted classifier
pipeline. Finally, we also evaluate STACK in a transfer setting, achieving 33%
ASR, providing initial evidence that it is feasible to design attacks with no
access to the target pipeline. We conclude by suggesting specific mitigations
that developers could use to thwart staged attacks.
☆ Logit-Gap Steering: Efficient Short-Suffix Jailbreaks for Aligned Large Language Models
We introduce logit-gap steering, a fast jailbreak framework that casts the
refusal-affirmation gap of RLHF-aligned language models as a single pass over
the vocabulary. A forward-computable score blends gap reduction with
lightweight proxies for KL penalty and reward shift, allowing a "sort-sum-stop"
sweep to complete in under a second and return a short suffix--two orders of
magnitude fewer model calls than beam or gradient attacks. The same suffix
generalises to unseen prompts and scales from 0.5 B to 70 B checkpoints,
lifting one-shot attack success from baseline levels to 80-100% while
preserving topical coherence. Beyond efficiency, these suffixes expose
sentence-boundary reward cliffs and other alignment artefacts, offering a
lightweight probe into how safety tuning reshapes internal representations.
☆ Ella: Embodied Social Agents with Lifelong Memory
We introduce Ella, an embodied social agent capable of lifelong learning
within a community in a 3D open world, where agents accumulate experiences and
acquire knowledge through everyday visual observations and social interactions.
At the core of Ella's capabilities is a structured, long-term multimodal memory
system that stores, updates, and retrieves information effectively. It consists
of a name-centric semantic memory for organizing acquired knowledge and a
spatiotemporal episodic memory for capturing multimodal experiences. By
integrating this lifelong memory system with foundation models, Ella retrieves
relevant information for decision-making, plans daily activities, builds social
relationships, and evolves autonomously while coexisting with other intelligent
beings in the open world. We conduct capability-oriented evaluations in a
dynamic 3D open world where 15 agents engage in social activities for days and
are assessed with a suite of unseen controlled evaluations. Experimental
results show that Ella can influence, lead, and cooperate with other agents
well to achieve goals, showcasing its ability to learn effectively through
observation and social interaction. Our findings highlight the transformative
potential of combining structured memory systems with foundation models for
advancing embodied intelligence. More videos can be found at
https://umass-embodied-agi.github.io/Ella/.
☆ EXPERT: An Explainable Image Captioning Evaluation Metric with Structured Explanations ACL 2025
Recent advances in large language models and vision-language models have led
to growing interest in explainable evaluation metrics for image captioning.
However, these metrics generate explanations without standardized criteria, and
the overall quality of the generated explanations remains unverified. In this
paper, we propose EXPERT, a reference-free evaluation metric that provides
structured explanations based on three fundamental criteria: fluency,
relevance, and descriptiveness. By constructing large-scale datasets of
high-quality structured explanations, we develop a two-stage evaluation
template to effectively supervise a vision-language model for both scoring and
explanation generation. EXPERT achieves state-of-the-art results on benchmark
datasets while providing significantly higher-quality explanations than
existing metrics, as validated through comprehensive human evaluation. Our code
and datasets are available at https://github.com/hjkim811/EXPERT.
comment: Accepted at ACL 2025 Findings
☆ Large Language Models Don't Make Sense of Word Problems. A Scoping Review from a Mathematics Education Perspective
The progress of Large Language Models (LLMs) like ChatGPT raises the question
of how they can be integrated into education. One hope is that they can support
mathematics learning, including word-problem solving. Since LLMs can handle
textual input with ease, they appear well-suited for solving mathematical word
problems. Yet their real competence, whether they can make sense of the
real-world context, and the implications for classrooms remain unclear. We
conducted a scoping review from a mathematics-education perspective, including
three parts: a technical overview, a systematic review of word problems used in
research, and a state-of-the-art empirical evaluation of LLMs on mathematical
word problems. First, in the technical overview, we contrast the
conceptualization of word problems and their solution processes between LLMs
and students. In computer-science research this is typically labeled
mathematical reasoning, a term that does not align with usage in mathematics
education. Second, our literature review of 213 studies shows that the most
popular word-problem corpora are dominated by s-problems, which do not require
a consideration of realities of their real-world context. Finally, our
evaluation of GPT-3.5-turbo, GPT-4o-mini, GPT-4.1, and o3 on 287 word problems
shows that most recent LLMs solve these s-problems with near-perfect accuracy,
including a perfect score on 20 problems from PISA. LLMs still showed
weaknesses in tackling problems where the real-world context is problematic or
non-sensical. In sum, we argue based on all three aspects that LLMs have
mastered a superficial solution process but do not make sense of word problems,
which potentially limits their value as instructional tools in mathematics
classrooms.
☆ Auto-TA: Towards Scalable Automated Thematic Analysis (TA) via Multi-Agent Large Language Models with Reinforcement Learning ACL 2025
Congenital heart disease (CHD) presents complex, lifelong challenges often
underrepresented in traditional clinical metrics. While unstructured narratives
offer rich insights into patient and caregiver experiences, manual thematic
analysis (TA) remains labor-intensive and unscalable. We propose a fully
automated large language model (LLM) pipeline that performs end-to-end TA on
clinical narratives, which eliminates the need for manual coding or full
transcript review. Our system employs a novel multi-agent framework, where
specialized LLM agents assume roles to enhance theme quality and alignment with
human analysis. To further improve thematic relevance, we optionally integrate
reinforcement learning from human feedback (RLHF). This supports scalable,
patient-centered analysis of large qualitative datasets and allows LLMs to be
fine-tuned for specific clinical contexts.
comment: Presented at ACL 2025 SRW
☆ Machine Understanding of Scientific Language
Scientific information expresses human understanding of nature. This
knowledge is largely disseminated in different forms of text, including
scientific papers, news articles, and discourse among people on social media.
While important for accelerating our pursuit of knowledge, not all scientific
text is faithful to the underlying science. As the volume of this text has
burgeoned online in recent years, it has become a problem of societal
importance to be able to identify the faithfulness of a given piece of
scientific text automatically. This thesis is concerned with the cultivation of
datasets, methods, and tools for machine understanding of scientific language,
in order to analyze and understand science communication at scale. To arrive at
this, I present several contributions in three areas of natural language
processing and machine learning: automatic fact checking, learning with limited
data, and scientific text processing. These contributions include new methods
and resources for identifying check-worthy claims, adversarial claim
generation, multi-source domain adaptation, learning from crowd-sourced labels,
cite-worthiness detection, zero-shot scientific fact checking, detecting
exaggerated scientific claims, and modeling degrees of information change in
science communication. Critically, I demonstrate how the research outputs of
this thesis are useful for effectively learning from limited amounts of
scientific text in order to identify misinformative scientific statements and
generate new insights into the science communication process
comment: PhD Thesis, 210 pages
☆ TaP: A Taxonomy-Guided Framework for Automated and Scalable Preference Data Generation
Renren Jin, Tianhao Shen, Xinwei Wu, Dan Shi, Haoran Sun, Wuwei Huang, Quandong Wang, Wei Liu, Jian Luan, Bin Wang, Deyi Xiong
Conducting supervised fine-tuning and preference fine-tuning on large
language models (LLMs) requires high-quality datasets to improve their ability
to follow instructions and align with human preferences and values. However,
constructing such datasets is resource-intensive, and most available datasets
for supervised and preference fine-tuning are in English. To address these
challenges, we propose the \underline{\textbf{Ta}}xonomy-Guided
\underline{\textbf{P}}reference Data Generation (TaP) framework, which
facilitates automated and scalable construction of preference datasets across
various languages. TaP is grounded in a structured taxonomy that allows
fine-grained control over dataset composition, thereby ensuring both diversity
and comprehensive coverage. We employ TaP-generated datasets to perform
supervised and preference fine-tuning on various LLMs. Experimental results
demonstrate that LLMs trained on TaP-generated datasets outperform those
trained on existing open-source datasets. Remarkably, LLMs trained on
TaP-generated datasets surpass the performance of those trained on an
open-source dataset that is 180 times larger.
comment: 33 pages, 15 tables, 11 figures
☆ LLM Agents Are the Antidote to Walled Gardens
While the Internet's core infrastructure was designed to be open and
universal, today's application layer is dominated by closed, proprietary
platforms. Open and interoperable APIs require significant investment, and
market leaders have little incentive to enable data exchange that could erode
their user lock-in. We argue that LLM-based agents fundamentally disrupt this
status quo. Agents can automatically translate between data formats and
interact with interfaces designed for humans: this makes interoperability
dramatically cheaper and effectively unavoidable. We name this shift universal
interoperability: the ability for any two digital services to exchange data
seamlessly using AI-mediated adapters. Universal interoperability undermines
monopolistic behaviours and promotes data portability. However, it can also
lead to new security risks and technical debt. Our position is that the ML
community should embrace this development while building the appropriate
frameworks to mitigate the downsides. By acting now, we can harness AI to
restore user freedom and competitive markets without sacrificing security.
♻ ☆ Knowing You Don't Know: Learning When to Continue Search in Multi-round RAG through Self-Practicing SIGIR 2025
Retrieval Augmented Generation (RAG) has shown strong capability in enhancing
language models' knowledge and reducing AI generative hallucinations, driving
its widespread use. However, complex tasks requiring multi-round retrieval
remain challenging, and early attempts tend to be overly optimistic without a
good sense of self-skepticism. Current multi-round RAG systems may continue
searching even when enough information has already been retrieved, or they may
provide incorrect answers without having sufficient information or knowledge.
Existing solutions either require large amounts of expensive human-labeled
process supervision data or lead to subpar performance. This paper aims to
address these limitations by introducing a new framework, SIM-RAG, to
explicitly enhance RAG systems' self-awareness and multi-round retrieval
capabilities. To train SIM-RAG, we first let a RAG system self-practice
multi-round retrieval, augmenting existing question-answer pairs with
intermediate inner monologue reasoning steps to generate synthetic training
data. For each pair, the system may explore multiple retrieval paths, which are
labeled as successful if they reach the correct answer and unsuccessful
otherwise. Using this data, we train a lightweight information sufficiency
Critic. At inference time, the Critic evaluates whether the RAG system has
retrieved sufficient information at each round, guiding retrieval decisions and
improving system-level self-awareness through in-context reinforcement
learning. Experiments across multiple prominent RAG benchmarks show that
SIM-RAG is an effective multi-round RAG solution. Furthermore, this framework
is system-efficient, adding a lightweight component to RAG without requiring
modifications to existing LLMs or search engines, and data-efficient,
eliminating the need for costly human-annotated mid-step retrieval process
supervision data.
comment: Proceedings of the 48th International ACM SIGIR 2025
♻ ☆ SEUF: Is Unlearning One Expert Enough for Mixture-of-Experts LLMs? ACL'25
Recent advancements in LLMs unlearning have shown remarkable success in
removing unwanted data-model influences while preserving the model's utility
for legitimate knowledge. Despite these strides, sparse Mixture-of-Experts
(MoE) LLMs--a key subset of the LLM family--have remained unexplored in the
context of unlearning. As MoE LLMs are celebrated for their exceptional
performance, we ask:How can unlearning be performed effectively and efficiently
on MoE LLMs? Our pilot study shows that the dynamic routing nature of MoE LLMs
introduces unique challenges, leading to excessive forgetting, uncontrolled
knowledge erasure and substantial utility drops when existing unlearning
methods are applied. To address this, we propose a novel Selected-Expert
Unlearning Framework (SEUF). Through expert attribution, unlearning is
concentrated on the most actively engaged experts for the specified knowledge.
Concurrently, an anchor loss is applied to the router to stabilize the active
state of this targeted expert, ensuring focused and controlled unlearning. SEUF
is compatible with various standard unlearning algorithms. Extensive
experiments demonstrate that SEUF enhances both forget quality up to 5% and
model utility by 35% on MoE LLMs across various benchmarks and LLM
architectures (compared to standard unlearning algorithms), while only
unlearning 0.06% of the model parameters.
comment: Accepted to ACL'25
♻ ☆ KMI: A Dataset of Korean Motivational Interviewing Dialogues for Psychotherapy NAACL 2025
The increasing demand for mental health services has led to the rise of
AI-driven mental health chatbots, though challenges related to privacy, data
collection, and expertise persist. Motivational Interviewing (MI) is gaining
attention as a theoretical basis for boosting expertise in the development of
these chatbots. However, existing datasets are showing limitations for training
chatbots, leading to a substantial demand for publicly available resources in
the field of MI and psychotherapy. These challenges are even more pronounced in
non-English languages, where they receive less attention. In this paper, we
propose a novel framework that simulates MI sessions enriched with the
expertise of professional therapists. We train an MI forecaster model that
mimics the behavioral choices of professional therapists and employ Large
Language Models (LLMs) to generate utterances through prompt engineering. Then,
we present KMI, the first synthetic dataset theoretically grounded in MI,
containing 1,000 high-quality Korean Motivational Interviewing dialogues.
Through an extensive expert evaluation of the generated dataset and the
dialogue model trained on it, we demonstrate the quality, expertise, and
practicality of KMI. We also introduce novel metrics derived from MI theory in
order to evaluate dialogues from the perspective of MI.
comment: Accepted at NAACL 2025 Main Conference
♻ ☆ Position: Machine Learning Conferences Should Establish a "Refutations and Critiques" Track
Rylan Schaeffer, Joshua Kazdan, Yegor Denisov-Blanch, Brando Miranda, Matthias Gerstgrasser, Susan Zhang, Andreas Haupt, Isha Gupta, Elyas Obbad, Jesse Dodge, Jessica Zosa Forde, Koustuv Sinha, Francesco Orabona, Sanmi Koyejo, David Donoho
Science progresses by iteratively advancing and correcting humanity's
understanding of the world. In machine learning (ML) research, rapid
advancements have led to an explosion of publications, but have also led to
misleading, incorrect, flawed or perhaps even fraudulent studies being accepted
and sometimes highlighted at ML conferences due to the fallibility of peer
review. While such mistakes are understandable, ML conferences do not offer
robust processes to help the field systematically correct when such errors are
made. This position paper argues that ML conferences should establish a
dedicated "Refutations and Critiques" (R&C) Track. This R&C Track would provide
a high-profile, reputable platform to support vital research that critically
challenges prior research, thereby fostering a dynamic self-correcting research
ecosystem. We discuss key considerations including track design, review
principles, potential pitfalls, and provide an illustrative example submission
concerning a recent ICLR 2025 Oral. We conclude that ML conferences should
create official, reputable mechanisms to help ML research self-correct.
♻ ☆ Riddle Me This! Stealthy Membership Inference for Retrieval-Augmented Generation CCS 2025
Retrieval-Augmented Generation (RAG) enables Large Language Models (LLMs) to
generate grounded responses by leveraging external knowledge databases without
altering model parameters. Although the absence of weight tuning prevents
leakage via model parameters, it introduces the risk of inference adversaries
exploiting retrieved documents in the model's context. Existing methods for
membership inference and data extraction often rely on jailbreaking or
carefully crafted unnatural queries, which can be easily detected or thwarted
with query rewriting techniques common in RAG systems. In this work, we present
Interrogation Attack (IA), a membership inference technique targeting documents
in the RAG datastore. By crafting natural-text queries that are answerable only
with the target document's presence, our approach demonstrates successful
inference with just 30 queries while remaining stealthy; straightforward
detectors identify adversarial prompts from existing methods up to ~76x more
frequently than those generated by our attack. We observe a 2x improvement in
TPR@1%FPR over prior inference attacks across diverse RAG configurations, all
while costing less than $0.02 per document inference.
comment: This is the full version (27 pages) of the paper 'Riddle Me This!
Stealthy Membership Inference for Retrieval-Augmented Generation' published
at CCS 2025
♻ ☆ LibVulnWatch: A Deep Assessment Agent System and Leaderboard for Uncovering Hidden Vulnerabilities in Open-Source AI Libraries ACL 2025
Zekun Wu, Seonglae Cho, Umar Mohammed, Cristian Munoz, Kleyton Costa, Xin Guan, Theo King, Ze Wang, Emre Kazim, Adriano Koshiyama
Open-source AI libraries are foundational to modern AI systems, yet they
present significant, underexamined risks spanning security, licensing,
maintenance, supply chain integrity, and regulatory compliance. We introduce
LibVulnWatch, a system that leverages recent advances in large language models
and agentic workflows to perform deep, evidence-based evaluations of these
libraries. Built on a graph-based orchestration of specialized agents, the
framework extracts, verifies, and quantifies risk using information from
repositories, documentation, and vulnerability databases. LibVulnWatch produces
reproducible, governance-aligned scores across five critical domains,
publishing results to a public leaderboard for ongoing ecosystem monitoring.
Applied to 20 widely used libraries, including ML frameworks, LLM inference
engines, and agent orchestration tools, our approach covers up to 88% of
OpenSSF Scorecard checks while surfacing up to 19 additional risks per library,
such as critical RCE vulnerabilities, missing SBOMs, and regulatory gaps. By
integrating advanced language technologies with the practical demands of
software risk assessment, this work demonstrates a scalable, transparent
mechanism for continuous supply chain evaluation and informed library
selection.
comment: ACL 2025 Student Research Workshop and ICML 2025 TAIG Workshop
♻ ☆ TTRL: Test-Time Reinforcement Learning
Yuxin Zuo, Kaiyan Zhang, Li Sheng, Shang Qu, Ganqu Cui, Xuekai Zhu, Haozhan Li, Yuchen Zhang, Xinwei Long, Ermo Hua, Biqing Qi, Youbang Sun, Zhiyuan Ma, Lifan Yuan, Ning Ding, Bowen Zhou
This paper investigates Reinforcement Learning (RL) on data without explicit
labels for reasoning tasks in Large Language Models (LLMs). The core challenge
of the problem is reward estimation during inference while not having access to
ground-truth information. While this setting appears elusive, we find that
common practices in Test-Time Scaling (TTS), such as majority voting, yield
surprisingly effective rewards suitable for driving RL training. In this work,
we introduce Test-Time Reinforcement Learning (TTRL), a novel method for
training LLMs using RL on unlabeled data. TTRL enables self-evolution of LLMs
by utilizing the priors in the pre-trained models. Our experiments demonstrate
that TTRL consistently improves performance across a variety of tasks and
models. Notably, TTRL boosts the pass@1 performance of Qwen-2.5-Math-7B by
approximately 211% on the AIME 2024 with only unlabeled test data. Furthermore,
although TTRL is only supervised by the maj@n metric, TTRL has demonstrated
performance to consistently surpass the upper limit of the initial model maj@n,
and approach the performance of models trained directly on test data with
ground-truth labels. Our experimental findings validate the general
effectiveness of TTRL across various tasks and highlight TTRL's potential for
broader tasks and domains. GitHub: https://github.com/PRIME-RL/TTRL