Computation and Language
☆ StepWiser: Stepwise Generative Judges for Wiser Reasoning
Wei Xiong, Wenting Zhao, Weizhe Yuan, Olga Golovneva, Tong Zhang, Jason Weston, Sainbayar Sukhbaatar
As models increasingly leverage multi-step reasoning strategies to solve
complex problems, supervising the logical validity of these intermediate steps
has become a critical research challenge. Process reward models address this by
providing step-by-step feedback, but current approaches have two major
drawbacks: they typically function as classifiers without providing
explanations, and their reliance on supervised fine-tuning with static datasets
limits generalization. Inspired by recent advances, we reframe stepwise reward
modeling from a classification task to a reasoning task itself. We thus propose
a generative judge that reasons about the policy model's reasoning steps (i.e.,
meta-reasons), outputting thinking tokens before delivering a final verdict.
Our model, StepWiser, is trained by reinforcement learning using relative
outcomes of rollouts. We show it provides (i) better judgment accuracy on
intermediate steps than existing methods; (ii) can be used to improve the
policy model at training time; and (iii) improves inference-time search.
☆ Generative Interfaces for Language Models
Large language models (LLMs) are increasingly seen as assistants, copilots,
and consultants, capable of supporting a wide range of tasks through natural
conversation. However, most systems remain constrained by a linear
request-response format that often makes interactions inefficient in
multi-turn, information-dense, and exploratory tasks. To address these
limitations, we propose Generative Interfaces for Language Models, a paradigm
in which LLMs respond to user queries by proactively generating user interfaces
(UIs) that enable more adaptive and interactive engagement. Our framework
leverages structured interface-specific representations and iterative
refinements to translate user queries into task-specific UIs. For systematic
evaluation, we introduce a multidimensional assessment framework that compares
generative interfaces with traditional chat-based ones across diverse tasks,
interaction patterns, and query types, capturing functional, interactive, and
emotional aspects of user experience. Results show that generative interfaces
consistently outperform conversational ones, with humans preferring them in
over 70% of cases. These findings clarify when and why users favor generative
interfaces, paving the way for future advancements in human-AI interaction.
comment: Preprint
☆ Evaluating the Evaluators: Are readability metrics good measures of readability?
Plain Language Summarization (PLS) aims to distill complex documents into
accessible summaries for non-expert audiences. In this paper, we conduct a
thorough survey of PLS literature, and identify that the current standard
practice for readability evaluation is to use traditional readability metrics,
such as Flesch-Kincaid Grade Level (FKGL). However, despite proven utility in
other fields, these metrics have not been compared to human readability
judgments in PLS. We evaluate 8 readability metrics and show that most
correlate poorly with human judgments, including the most popular metric, FKGL.
We then show that Language Models (LMs) are better judges of readability, with
the best-performing model achieving a Pearson correlation of 0.56 with human
judgments. Extending our analysis to PLS datasets, which contain summaries
aimed at non-expert audiences, we find that LMs better capture deeper measures
of readability, such as required background knowledge, and lead to different
conclusions than the traditional metrics. Based on these findings, we offer
recommendations for best practices in the evaluation of plain language
summaries. We release our analysis code and survey data.
☆ VibeVoice Technical Report
Zhiliang Peng, Jianwei Yu, Wenhui Wang, Yaoyao Chang, Yutao Sun, Li Dong, Yi Zhu, Weijiang Xu, Hangbo Bao, Zehua Wang, Shaohan Huang, Yan Xia, Furu Wei
This report presents VibeVoice, a novel model designed to synthesize
long-form speech with multiple speakers by employing next-token diffusion,
which is a unified method for modeling continuous data by autoregressively
generating latent vectors via diffusion. To enable this, we introduce a novel
continuous speech tokenizer that, when compared to the popular Encodec model,
improves data compression by 80 times while maintaining comparable performance.
The tokenizer effectively preserves audio fidelity while significantly boosting
computational efficiency for processing long sequences. Thus, VibeVoice can
synthesize long-form speech for up to 90 minutes (in a 64K context window
length) with a maximum of 4 speakers, capturing the authentic conversational
``vibe'' and surpassing open-source and proprietary dialogue models.
☆ Demystifying Scientific Problem-Solving in LLMs by Probing Knowledge and Reasoning
Scientific problem solving poses unique challenges for LLMs, requiring both
deep domain knowledge and the ability to apply such knowledge through complex
reasoning. While automated scientific reasoners hold great promise for
assisting human scientists, there is currently no widely adopted holistic
benchmark for evaluating scientific reasoning, and few approaches
systematically disentangle the distinct roles of knowledge and reasoning in
these tasks. To address these gaps, we introduce SciReas, a diverse suite of
existing benchmarks for scientific reasoning tasks, and SciReas-Pro, a
selective subset that requires more complex reasoning. Our holistic evaluation
surfaces insights about scientific reasoning performance that remain hidden
when relying on individual benchmarks alone. We then propose KRUX, a probing
framework for studying the distinct roles of reasoning and knowledge in
scientific tasks. Combining the two, we conduct an in-depth analysis that
yields several key findings: (1) Retrieving task-relevant knowledge from model
parameters is a critical bottleneck for LLMs in scientific reasoning; (2)
Reasoning models consistently benefit from external knowledge added in-context
on top of the reasoning enhancement; (3) Enhancing verbalized reasoning
improves LLMs' ability to surface task-relevant knowledge. Finally, we conduct
a lightweight analysis, comparing our science-focused data composition with
concurrent efforts on long CoT SFT, and release SciLit01, a strong 8B baseline
for scientific reasoning.
comment: 28 pages, 16 figures
☆ The Ramon Llull's Thinking Machine for Automated Ideation
Xinran Zhao, Boyuan Zheng, Chenglei Si, Haofei Yu, Ken Liu, Runlong Zhou, Ruochen Li, Tong Chen, Xiang Li, Yiming Zhang, Tongshuang Wu
This paper revisits Ramon Llull's Ars combinatoria - a medieval framework for
generating knowledge through symbolic recombination - as a conceptual
foundation for building a modern Llull's thinking machine for research
ideation. Our approach defines three compositional axes: Theme (e.g.,
efficiency, adaptivity), Domain (e.g., question answering, machine
translation), and Method (e.g., adversarial training, linear attention). These
elements represent high-level abstractions common in scientific work -
motivations, problem settings, and technical approaches - and serve as building
blocks for LLM-driven exploration. We mine elements from human experts or
conference papers and show that prompting LLMs with curated combinations
produces research ideas that are diverse, relevant, and grounded in current
literature. This modern thinking machine offers a lightweight, interpretable
tool for augmenting scientific creativity and suggests a path toward
collaborative ideation between humans and AI.
comment: 21 pages, 3 figures
☆ Do LVLMs Know What They Know? A Systematic Study of Knowledge Boundary Perception in LVLMs EMNLP2025
Large vision-language models (LVLMs) demonstrate strong visual question
answering (VQA) capabilities but are shown to hallucinate. A reliable model
should perceive its knowledge boundaries-knowing what it knows and what it does
not. This paper investigates LVLMs' perception of their knowledge boundaries by
evaluating three types of confidence signals: probabilistic confidence, answer
consistency-based confidence, and verbalized confidence. Experiments on three
LVLMs across three VQA datasets show that, although LVLMs possess a reasonable
perception level, there is substantial room for improvement. Among the three
confidences, probabilistic and consistency-based signals are more reliable
indicators, while verbalized confidence often leads to overconfidence. To
enhance LVLMs' perception, we adapt several established confidence calibration
methods from Large Language Models (LLMs) and propose three effective methods.
Additionally, we compare LVLMs with their LLM counterparts, finding that
jointly processing visual and textual inputs decreases question-answering
performance but reduces confidence, resulting in an improved perception level
compared to LLMs.
comment: EMNLP2025 Findings
☆ Beyond the Black Box: Integrating Lexical and Semantic Methods in Quantitative Discourse Analysis with BERTopic
Quantitative Discourse Analysis has seen growing adoption with the rise of
Large Language Models and computational tools. However, reliance on black box
software such as MAXQDA and NVivo risks undermining methodological transparency
and alignment with research goals. This paper presents a hybrid, transparent
framework for QDA that combines lexical and semantic methods to enable
triangulation, reproducibility, and interpretability. Drawing from a case study
in historical political discourse, we demonstrate how custom Python pipelines
using NLTK, spaCy, and Sentence Transformers allow fine-grained control over
preprocessing, lemmatisation, and embedding generation. We further detail our
iterative BERTopic modelling process, incorporating UMAP dimensionality
reduction, HDBSCAN clustering, and c-TF-IDF keyword extraction, optimised
through parameter tuning and multiple runs to enhance topic coherence and
coverage. By juxtaposing precise lexical searches with context-aware semantic
clustering, we argue for a multi-layered approach that mitigates the
limitations of either method in isolation. Our workflow underscores the
importance of code-level transparency, researcher agency, and methodological
triangulation in computational discourse studies. Code and supplementary
materials are available via GitHub.
comment: 5 pages conference paper, 4 tables
☆ Retrieval-Augmented Generation for Natural Language Art Provenance Searches in the Getty Provenance Index
This research presents a Retrieval-Augmented Generation (RAG) framework for
art provenance studies, focusing on the Getty Provenance Index. Provenance
research establishes the ownership history of artworks, which is essential for
verifying authenticity, supporting restitution and legal claims, and
understanding the cultural and historical context of art objects. The process
is complicated by fragmented, multilingual archival data that hinders efficient
retrieval. Current search portals require precise metadata, limiting
exploratory searches. Our method enables natural-language and multilingual
searches through semantic retrieval and contextual summarization, reducing
dependence on metadata structures. We assess RAG's capability to retrieve and
summarize auction records using a 10,000-record sample from the Getty
Provenance Index - German Sales. The results show this approach provides a
scalable solution for navigating art market archives, offering a practical tool
for historians and cultural heritage professionals conducting historically
sensitive research.
☆ It's All About In-Context Learning! Teaching Extremely Low-Resource Languages to LLMs EMNLP 2025
Extremely low-resource languages, especially those written in rare scripts,
as shown in Figure 1, remain largely unsupported by large language models
(LLMs). This is due in part to compounding factors such as the lack of training
data. This paper delivers the first comprehensive analysis of whether LLMs can
acquire such languages purely via in-context learning (ICL), with or without
auxiliary alignment signals, and how these methods compare to
parameter-efficient fine-tuning (PEFT). We systematically evaluate 20
under-represented languages across three state-of-the-art multilingual LLMs.
Our findings highlight the limitation of PEFT when both language and its script
are extremely under-represented by the LLM. In contrast, zero-shot ICL with
language alignment is impressively effective on extremely low-resource
languages, while few-shot ICL or PEFT is more beneficial for languages
relatively better represented by LLMs. For LLM practitioners working on
extremely low-resource languages, we summarise guidelines grounded by our
results on adapting LLMs to low-resource languages, e.g., avoiding fine-tuning
a multilingual model on languages of unseen scripts.
comment: Accepted by EMNLP 2025
☆ "Where does it hurt?" -- Dataset and Study on Physician Intent Trajectories in Doctor Patient Dialogues ECAI 2025
Tom Röhr, Soumyadeep Roy, Fares Al Mohamad, Jens-Michalis Papaioannou, Wolfgang Nejdl, Felix Gers, Alexander Löser
In a doctor-patient dialogue, the primary objective of physicians is to
diagnose patients and propose a treatment plan. Medical doctors guide these
conversations through targeted questioning to efficiently gather the
information required to provide the best possible outcomes for patients. To the
best of our knowledge, this is the first work that studies physician intent
trajectories in doctor-patient dialogues. We use the `Ambient Clinical
Intelligence Benchmark' (Aci-bench) dataset for our study. We collaborate with
medical professionals to develop a fine-grained taxonomy of physician intents
based on the SOAP framework (Subjective, Objective, Assessment, and Plan). We
then conduct a large-scale annotation effort to label over 5000 doctor-patient
turns with the help of a large number of medical experts recruited using
Prolific, a popular crowd-sourcing platform. This large labeled dataset is an
important resource contribution that we use for benchmarking the
state-of-the-art generative and encoder models for medical intent
classification tasks. Our findings show that our models understand the general
structure of medical dialogues with high accuracy, but often fail to identify
transitions between SOAP categories. We also report for the first time common
trajectories in medical dialogue structures that provide valuable insights for
designing `differential diagnosis' systems. Finally, we extensively study the
impact of intent filtering for medical dialogue summarization and observe a
significant boost in performance. We make the codes and data, including
annotation guidelines, publicly available at
https://github.com/DATEXIS/medical-intent-classification.
comment: Accepted at ECAI 2025
♻ ☆ From Intents to Conversations: Generating Intent-Driven Dialogues with Contrastive Learning for Multi-Turn Classification CIKM 2025
In conversational AI systems, a critical challenge in training effective
multi-turn intent classification models lies in the generation of large-scale,
domain-specific, multilingual dialogue datasets. In this paper, we introduce
Chain-of-Intent, a novel framework that integrates Hidden Markov Models (HMMs)
with Large Language Models (LLMs) to generate intent-driven, context-aware
dialogues through self-play. Our method first extracts domain-specific intent
transition patterns from real-world e-commerce chat logs, which guide the
modeling of turn-level dynamics and intent sequences. LLMs are then employed to
parameterize the emission probabilities of HMMs, enabling the generation of
natural, coherent utterances aligned with predicted intents and dialogue
context. We further propose MINT-CL, a multi-task contrastive learning
framework for multi-turn intent classification, which improves performance
while reducing dependence on large-scale annotated datasets. Empirical results
demonstrate that our approach outperforms competitive baselines in both
dialogue generation quality and classification accuracy, particularly in
multilingual settings. To facilitate future research, we release MINT-E, a
comprehensive, multilingual, intent-aware multi-turn dialogue corpus derived
from the e-commerce domain. The reproduced source code and dataset are
available at https://github.com/junhua/chain-of-intent.
comment: Accepted to Proceedings of CIKM 2025
♻ ☆ Bridging the Editing Gap in LLMs: FineEdit for Precise and Targeted Text Modifications
Large Language Models (LLMs) have significantly advanced natural language
processing, demonstrating strong capabilities in tasks such as text generation,
summarization, and reasoning. Recently, their potential for automating precise
text editing tasks across specialized domains, such as programming code, LaTeX,
and structured database languages, has gained attention. However, current
state-of-the-art LLMs still struggle with executing precise, instruction-driven
edits, particularly when structural accuracy and strict adherence to domain
conventions are required. To address these challenges, we introduce
InstrEditBench, an automated benchmark dataset comprising over 30,000
structured editing tasks spanning diverse domains, including Wikipedia
articles, LaTeX documents, source code, and database languages. Using this
benchmark, we develop FineEdit, a specialized editing model explicitly trained
for accurate, context-aware text modifications. Experimental evaluations
demonstrate that FineEdit outperforms state-of-the-art models, achieving
improvements of approximately 10\% over Gemini models on single-turn edits, up
to 30\% over Llama-3.2-3B, and exceeding Mistral-7B-OpenOrca performance by
over 40\% on direct editing tasks. FineEdit also effectively generalizes to
realistic multi-turn editing scenarios, highlighting its practical
applicability. To facilitate further research and reproducibility, we release
FineEdit at https://github.com/StuRinDQB/FineEdit} and
https://huggingface.co/datasets/YimingZeng/FineEdit_bench.
♻ ☆ mRAG: Elucidating the Design Space of Multi-modal Retrieval-Augmented Generation
Large Vision-Language Models (LVLMs) have made remarkable strides in
multimodal tasks such as visual question answering, visual grounding, and
complex reasoning. However, they remain limited by static training data,
susceptibility to hallucinations, and inability to verify claims against
up-to-date, external evidence, compromising their performance in dynamic
real-world applications. Retrieval-Augmented Generation (RAG) offers a
practical solution to mitigate these challenges by allowing the LVLMs to access
large-scale knowledge databases via retrieval mechanisms, thereby grounding
model outputs in factual, contextually relevant information. Here in this
paper, we conduct the first systematic dissection of the multimodal RAG
pipeline for LVLMs, explicitly investigating (1) the retrieval phase: on the
modality configurations and retrieval strategies, (2) the re-ranking stage: on
strategies to mitigate positional biases and improve the relevance of retrieved
evidence, and (3) the generation phase: we further investigate how to best
integrate retrieved candidates into the final generation process. Finally, we
extend to explore a unified agentic framework that integrates re-ranking and
generation through self-reflection, enabling LVLMs to select relevant evidence
and suppress irrelevant context dynamically. Our full-stack exploration of RAG
for LVLMs yields substantial insights, resulting in an average performance
boost of 5% without any fine-tuning.
comment: 16 pages
♻ ☆ TL-Training: A Task-Feature-Based Framework for Training Large Language Models in Tool Use EMNLP 2025
Junjie Ye, Yilong Wu, Sixian Li, Yuming Yang, Zhiheng Xi, Tao Gui, Qi Zhang, Xuanjing Huang, Peng Wang, Zhongchao Shi, Jianping Fan, Zhengyin Du
Large language models (LLMs) achieve remarkable advancements by leveraging
tools to interact with environments, a critical step toward generalized AI.
However, the standard supervised fine-tuning (SFT) approach, which relies on
large-scale datasets, often overlooks task-specific characteristics in tool
use, leading to performance bottlenecks. To address this issue, we analyze
three existing LLMs and uncover key insights: training data can inadvertently
impede tool-use behavior, token importance is distributed unevenly, and errors
in tool calls fall into a small set of categories. Building on these findings,
we propose~\emph{TL-Training}, a task-feature-based framework that mitigates
the effects of suboptimal training data, dynamically adjusts token weights to
prioritize key tokens during SFT, and incorporates a robust reward mechanism
tailored to error categories, optimized through proximal policy optimization.
We validate TL-Training by training CodeLLaMA-2-7B and evaluating it on four
open-source test sets. Our results demonstrate that the LLM trained by our
method matches or surpasses both open- and closed-source LLMs in tool-use
performance using only 1,217 training data points. Additionally, our method
enhances robustness in noisy environments and improves general task
performance, offering a scalable and efficient paradigm for tool-use training
in LLMs. Code and data are available at
https://github.com/Junjie-Ye/TL-Training.
comment: Accepted by EMNLP 2025
♻ ☆ ChatGPT Doesn't Trust Chargers Fans: Guardrail Sensitivity in Context
While the biases of language models in production are extensively documented,
the biases of their guardrails have been neglected. This paper studies how
contextual information about the user influences the likelihood of an LLM to
refuse to execute a request. By generating user biographies that offer
ideological and demographic information, we find a number of biases in
guardrail sensitivity on GPT-3.5. Younger, female, and Asian-American personas
are more likely to trigger a refusal guardrail when requesting censored or
illegal information. Guardrails are also sycophantic, refusing to comply with
requests for a political position the user is likely to disagree with. We find
that certain identity groups and seemingly innocuous information, e.g., sports
fandom, can elicit changes in guardrail sensitivity similar to direct
statements of political ideology. For each demographic category and even for
American football team fandom, we find that ChatGPT appears to infer a likely
political ideology and modify guardrail behavior accordingly.
♻ ☆ A Survey on Data Selection for LLM Instruction Tuning
Instruction tuning is a vital step of training large language models (LLMs),
so how to enhance the effect of instruction tuning has received increased
attention. Existing works indicate that the quality of the dataset is more
crucial than the quantity during instruction tuning of LLMs. Therefore,
recently a lot of studies focus on exploring the methods of selecting
high-quality subset from instruction datasets, aiming to reduce training costs
and enhance the instruction-following capabilities of LLMs. This paper presents
a comprehensive survey on data selection for LLM instruction tuning. Firstly,
we introduce the wildly used instruction datasets. Then, we propose a new
taxonomy of the data selection methods and provide a detailed introduction of
recent advances, and the evaluation strategies and results of data selection
methods are also elaborated in detail. Finally, we emphasize the open
challenges and present new frontiers of this task.
comment: Published in JAIR (Vol. 83, Article 32, 2025)
♻ ☆ An Ontology-Driven Graph RAG for Legal Norms: A Hierarchical, Temporal, and Deterministic Approach
Retrieval-Augmented Generation (RAG) systems in the legal domain face a
critical challenge: standard, flat-text retrieval is blind to the hierarchical,
diachronic, and causal structure of law, leading to anachronistic and
unreliable answers. This paper introduces an ontology-driven Graph RAG
framework designed to overcome these limitations. We ground our knowledge graph
in a formal, LRMoo-inspired model that distinguishes abstract legal Works from
their versioned Expressions. We model temporal states as efficient aggregations
that reuse the versioned expressions (CTVs) of unchanged components, and we
reify legislative events as first-class Action nodes to make causality explicit
and queryable. This structured backbone enables a unified, planner-guided query
strategy that applies explicit policies to deterministically resolve complex
requests for (i) point-in-time retrieval, (ii) hierarchical impact analysis,
and (iii) auditable provenance reconstruction. Through a case study on the
Brazilian Constitution, we demonstrate how this approach provides a verifiable,
temporally-correct substrate for LLMs, enabling higher-order analytical
capabilities while drastically reducing the risk of factual errors. The result
is a practical framework for building more trustworthy and explainable legal AI
systems.
comment: This is a major revision that significantly expands and deepens the
original manuscript. While the core ontological model remains the same, this
version provides a substantially more rigorous and detailed account of how
the framework is applied in practice, particularly within a
Retrieval-Augmented Generation (RAG) context
♻ ☆ Exploring the Robustness of Language Models for Tabular Question Answering via Attention Analysis
Large Language Models (LLMs), already shown to ace various unstructured text
comprehension tasks, have also remarkably been shown to tackle table
(structured) comprehension tasks without specific training. Building on earlier
studies of LLMs for tabular tasks, we probe how in-context learning (ICL),
model scale, instruction tuning, and domain bias affect Tabular QA (TQA)
robustness by testing LLMs, under diverse augmentations and perturbations, on
diverse domains: Wikipedia-based $\textbf{WTQ}$, financial $\textbf{TAT-QA}$,
and scientific $\textbf{SCITAB}$. Although instruction tuning and larger, newer
LLMs deliver stronger, more robust TQA performance, data contamination and
reliability issues, especially on $\textbf{WTQ}$, remain unresolved. Through an
in-depth attention analysis, we reveal a strong correlation between
perturbation-induced shifts in attention dispersion and the drops in
performance, with sensitivity peaking in the model's middle layers. We
highlight the need for improved interpretable methodologies to develop more
reliable LLMs for table comprehension. Through an in-depth attention analysis,
we reveal a strong correlation between perturbation-induced shifts in attention
dispersion and performance drops, with sensitivity peaking in the model's
middle layers. Based on these findings, we argue for the development of
structure-aware self-attention mechanisms and domain-adaptive processing
techniques to improve the transparency, generalization, and real-world
reliability of LLMs on tabular data.
comment: Accepted TMLR 2025
♻ ☆ Label Set Optimization via Activation Distribution Kurtosis for Zero-shot Classification with Generative Models EMNLP 2025
In-context learning (ICL) performance is highly sensitive to prompt design,
yet the impact of class label options (e.g. lexicon or order) in zero-shot
classification remains underexplored. This study proposes LOADS (Label set
Optimization via Activation Distribution kurtosiS), a post-hoc method for
selecting optimal label sets in zero-shot ICL with large language models
(LLMs). LOADS is built upon the observations in our empirical analysis, the
first to systematically examine how label option design (i.e., lexical choice,
order, and elaboration) impacts classification performance. This analysis shows
that the lexical choice of the labels in the prompt (such as agree vs. support
in stance classification) plays an important role in both model performance and
model's sensitivity to the label order. A further investigation demonstrates
that optimal label words tend to activate fewer outlier neurons in LLMs'
feed-forward networks. LOADS then leverages kurtosis to measure the neuron
activation distribution for label selection, requiring only a single forward
pass without gradient propagation or labelled data. The LOADS-selected label
words consistently demonstrate effectiveness for zero-shot ICL across
classification tasks, datasets, models and languages, achieving maximum
performance gain from 0.54 to 0.76 compared to the conventional approach of
using original dataset label words.
comment: Accepted by EMNLP 2025