Conference on Neural Information Processing Systems
San Diego | Dec 2–7, 2025

College of Computing faculty are contributors to more than a quarter of Georgia Tech’s 85 published papers at NeurIPS 2025 in San Diego. As artificial intelligence applications become more common across large sections of society, NeurIPS is focusing on how AI can become more reliable, useful, and responsible in the real world. Among the College of Computing’s work are seven “spotlight” papers, representing top research contributions based on peer review. These papers focus primarily on applications, datasets & benchmarks, probabilistic methods, reinforcement learning, and social and economic aspects of machine learning.
Georgia Tech’s external partners collaborating with computing researchers include:
- CISPA
- Carnegie Mellon University
- Emory University
- ETH Zurich
- Fortiss GmbH
- Google Deepmind
- Harvard
- IBM Research
- Meta
- NVIDIA Research
- OpenAI
- Penn State
- Toyota Motor Europe
- TUM & Helmholtz AI
- University of Michigan
- University of Oxford
Research contributions in the program highlight both core machine learning research and fast-growing areas like foundation models, robotics, and AI for science. Overall, NeurIPS 2025 emphasizes trustworthy AI, cross-disciplinary collaboration, and building technology that can work safely and fairly in many different settings.

More than 150 Georgia Tech researchers at NeurIPS 2025 are part of the main technical papers program. They are collaborating with 358 external partner authors from across 130+ organizations worldwide.
Tech’s 85 papers represent expertise from across the institute, including business, computing, engineering and the sciences. More than half of the 85 papers are led by Tech experts who are cited as first authors on the papers.
Explore details of all the papers below ⬇️. They are organized by primary area and then decision type (oral, spotlight, and poster papers).
RESEARCH PAPERS
AL/ML Datasets & Benchmarks for health and life sciences
(e.g. climate, health, life sciences, physics, social sciences)
Thousand Voices of Trauma: A Large-Scale Synthetic Dataset for Modeling Prolonged Exposure Therapy Conversations
Suhas BN, Andrew Sherrill, Rosa I. Arriaga, Christopher Wiese, Saeed Abdullah
Abstract
The advancement of AI systems for mental health support is hindered by limited access to therapeutic conversation data, particularly for trauma treatment. We present Thousand Voices of Trauma, a synthetic benchmark dataset of 3,000 therapy conversations based on Prolonged Exposure therapy protocols for Post-traumatic Stress Disorder (PTSD). The dataset comprises 500 unique cases, each explored through six conversational perspectives that mirror the progression of therapy from initial anxiety to peak distress to emotional processing. We incorporated diverse demographic profiles (ages 18-80, M=49.3, 49.4\% male, 44.4\% female, 6.2\% non-binary), 20 trauma types, and 10 trauma-related behaviors using deterministic and probabilistic generation methods. Analysis reveals realistic distributions of trauma types (witnessing violence 10.6\%, bullying 10.2\%) and symptoms (nightmares 23.4\%, substance abuse 20.8\%). Clinical experts validated the dataset’s therapeutic fidelity, highlighting its emotional depth while suggesting refinements for greater authenticity. We also developed an emotional trajectory benchmark with standardized metrics for evaluating model responses. This privacy-preserving dataset addresses critical gaps in trauma-focused mental health data, offering a valuable resource for advancing both patient-facing applications and clinician training tools.
RoFt-Mol: Benchmarking Robust Fine-tuning with Molecular Graph Foundation Models
Shikun Liu, Deyu Zou, Nima Shoghi, Victor Fung, Kai Liu, Pan Li
Abstract
In the era of foundation models, fine-tuning pre-trained models for specific downstream tasks has become crucial. This drives the need for robust fine-tuning methods to address challenges such as model overfitting and sparse labeling. Moleculargraph foundation models (MGFMs) face unique difficulties that complicate fine-tuning. These models are limited by smaller pre-training datasets and more severedata scarcity for downstream tasks, both of which require enhanced model generalization. Moreover, MGFMs must accommodate diverse objectives, including bothregression and classification tasks. To better understand and improve fine-tuningtechniques under these conditions, we classify eight fine-tuning methods into threemechanisms: weight-based, representation-based, and partial fine-tuning. Webenchmark these methods on downstream regression and classification tasks acrosssupervised and self-supervised pre-trained models in diverse labeling settings. Thisextensive evaluation provides valuable insights and informs the design of a refinedrobust fine-tuning method, ROFT-MOL. This approach combines the strengths ofsimple post-hoc weight interpolation with more complex weight ensemble fine-tuning methods, delivering improved performance across both task types whilemaintaining the ease of use inherent in post-hoc weight interpolation.
Applications
(e.g., vision, language, speech and audio, Creative AI)
AceRAG: Advancing Reasoning-Intensive Retrieval-Augmented Generation via LLM Self-Play
Ran Xu, Yuchen Zhuang, Zihan Dong, Ruiyu Wang, Yue Yu, Joyce Ho, Linjun Zhang, Haoyu Wang, Wenqi Shi, Carl Yang
Abstract
Retrieval-augmented generation (RAG) systems often struggle with complex reasoning tasks due to ineffective multi-hop retrieval and limited reasoning ability. We propose AceRAG, a cooperative self-play framework that trains a single large language model (LLM) to alternate between two roles: a decomposer that breaks down complex queries and a solver that integrates retrieved contexts for answer generation. AceRAG couples supervised fine-tuning on a diverse mixture of retrieval, reasoning, and decomposition tasks with reinforcement fine-tuning optimized for final answer accuracy, eliminating the need for intermediate annotations. Extensive experiments on three reasoning-intensive tasks across 10 datasets show that AceRAG outperforms state-of-the-art baselines, achieving an average exact match improvement of 7.6%. Remarkably, on document-level reasoning tasks, AceRAG-32B matches the performance of the giant DeepSeek-V3 model using less than 5% of its parameters. Even at smaller scales (1.5B and 8B), AceRAG often surpasses existing RAG models with up to 9x more parameters, highlighting its exceptional efficiency and effectiveness in tackling complex reasoning tasks.
Language Models can Self-Improve at State-Value Estimation for Better Search
Ethan Mendes, Alan Ritter
Abstract
Collecting ground-truth rewards or human demonstrations for multi-step reasoning tasks is often prohibitively expensive and time-consuming, especially in interactive domains like web tasks. To address this bottleneck, we present self-taught lookahead (STL), a self-supervised method that leverages state-transition dynamics to improve a value model capable of effectively guiding language model-controlled search without any labeled data. We find that moderately sized (8 billion parameters) open-weight value models improved with STL can match the performance of using a gpt-4o value model. Furthermore, we find that specialized value models learned with STL can be deployed with computationally lightweight search algorithms, achieving performance that matches that of more expensive tree search methods, while reducing costs by an order of magnitude.
Self-Supervised Learning of Motion Concepts by Optimizing Counterfactuals
Stefan Stojanov, David Wendt, Seungwoo Kim, Rahul Venkatesh, Kevin Feigelis, Klemen Kotar, Khai Loong Aw, Jiajun Wu, Daniel Yamins
Abstract
Estimating motion primitives from video (e.g. optical flow and occlusion) is a critically-important computer vision problem with many downstream applications, including in controllable video generation and robotics. Current solutions are primarily supervised on synthetic data or require tuning of situation-specific heuristics, which inherently limits these models’ capabilities in real-world contexts. A natural solution to transcend these limitations would be to deploy large-scale self-supervised video models, which can be scalably trained on unrestricted real-world video datasets. However, despite recent progress, motion-primitive extraction from large pretrained video models remains relatively underexplored. In this work, we describe Opt-CWM, a self-supervised flow and occlusion estimation technique from a pretrained video prediction model. Opt-CWM uses “counterfactual probes” to extract motion information from a base video model in a zero-shot fashion. The key problem we solve is optimal probe generation, using a combination of an efficient parameterization of the space counterfactual probes, together with a novel generic sparse-prediction principle for learning the probe-generation parameters in a self-supervised fashion. Opt-CWM achieves state-of-the-art performance for motion estimation on real-world videos while requiring no labeled data.
Web-Shepherd: Advancing PRMs for Reinforcing Web Agents
Hyungjoo Chae, Seonghwan Kim, Junhee Cho, Seungone Kim, Seungjun Moon, Gyeom Hwangbo, Dongha Lim, Minjin Kim, Yeonjun Hwang, Minju Gwak, Dongwook Choi, Minseok Kang, Gwanhoon Im, ByeongUng Cho, Hyojun Kim, Jun Han, Taeyoon Kwon, Minju Kim, Beong-woo Kwak, Dongjin Kang, Jinyoung Yeo
Abstract
Web navigation is a unique domain that can automate many repetitive real-life tasks and is challenging as it requires long-horizon sequential decision making beyond typical multimodal large language model (MLLM) tasks.Yet, specialized reward models for web navigation that can be utilized during both training and test-time have been absent until now. Despite the importance of speed and cost-effectiveness, prior works have utilized MLLMs as reward models, which poses significant constraints for real-world deployment. To address this, in this work, we propose the first process reward model (PRM) called Web-Shepherd which could assess web navigation trajectories in a step-level. To achieve this, we first construct the WebPRM Collection, a large-scale dataset with 40K step-level preference pairs and annotated checklists spanning diverse domains and difficulty levels. Next, we also introduce the WebRewardBench, the first meta-evaluation benchmark for evaluating PRMs. In our experiments, we observe that our Web-Shepherd achieves about 30 points better accuracy compared to using GPT-4o on WebRewardBench. Furthermore, when testing on WebArena-lite by using GPT-4o-mini as the policy and Web-Shepherd as the verifier, we achieve 10.9 points better performance, in 10$\times$ less cost compared to using GPT-4o-mini as the verifier. Our model, dataset, and code are publicly available at LINK.
ZeroS: Zero Sum Linear Attention for Efficient Transformers
Jiecheng Lu, Xu Han, Yan Sun, Viresh Pati, Yubin Kim, Siddhartha Somani, Shihao Yang
Abstract
Linear attention methods offer Transformers $O(N)$ complexity but typically underperform standard softmax attention. We identify two fundamental limitations affecting these approaches: the restriction to convex combinations that only permits additive information blending, and uniform accumulated weight bias that dilutes attention in long contexts. We propose Zero-Sum Linear Attention (ZeroS), which addresses these limitations by removing the constant zero-order term $1/t$ and reweighting the remaining zero-sum softmax residuals. This modification creates mathematically stable weights, enabling both positive and negative values and allowing a single attention layer to perform contrastive operations. While maintaining $O(N)$ complexity, ZeroS theoretically expands the set of representable functions compared to convex combinations. Empirically, it matches or exceeds standard softmax attention across various sequence modeling benchmarks.
E-MoFlow: Learning Egomotion and Optical Flow from Event Data via Implicit Regularization
Wenpu Li, Bangyan Liao, Yi Zhou, QiXu , Pian Wan, Peidong Liu
Abstract
The estimation of optical flow and 6-DoF ego-motion—two fundamental tasks in 3-D vision—has typically been addressed independently. For neuromorphic vision (e.g., event cameras), however, the lack of robust data association makes solving the two problems separately an ill-posed challenge, especially in the absence of supervision via ground truth.Existing works mitigate this ill-posedness by either enforcing the smoothness of the flow field via an explicit variational regularizer or leveraging explicit structure-and-motion priors in the parametrization to improve event alignment.The former notably introduces bias in results and computational overhead, while the latter—which parametrizes the optical flow in terms of the scene depth and the camera motion—often converges to suboptimal local minima.To address these issues, we propose an unsupervised pipeline that jointly optimizes egomotion and flow via implicit spatial-temporal and geometric regularization. First, by modeling camera’s egomotion as a continuous spline and optical flow as an implicit neural representation, our method inherently embeds spatial-temporal coherence through inductive biases. Second, we incorporate structure-and-motion priors through differential geometric constraints, bypassing explicit depth estimation while maintaining rigorous geometric consistency.As a result, our framework (called \textbf{E-MoFlow}) unifies egomotion and optical flow estimation via implicit regularization under a fully unsupervised paradigm. Experiments demonstrate its versatility to general 6-DoF motion scenarios, achieving state-of-the-art performance among unsupervised methods and competitive even with supervised approaches. Code will be released upon acceptance.
FLAME: Fast Long-context Adaptive Memory for Event-based Vision
Biswadeep Chakraborty, Saibal Mukhopadhyay
Abstract
We propose Fast Long-range Adaptive Memory for Event (FLAME), a novel scalable architecture that combines neuro-inspired feature extraction with robust structured sequence modelingto efficiently process asynchronous and sparse event camera data. As a departure from conventional input encoding methods, FLAME presents Event Attention Layer, a novel feature extractor that leverages neuromorphic dynamics (Leaky Integrate-and-Fire (LIF)) to directly capture multi-timescale features from event streams. The feature extractor is integrates with a structured state-space model with a novel Event-Aware HiPPO (EA-HiPPO) mechanism that dynamically adapts memory retention based on inter-event intervals to understand relationship across varying temporal scales and event sequences. A Normal Plus Low Rank (NPLR) decomposition reduces the computational complexity of state update from $\mathcal{O}(N^2)$ to $\mathcal{O}(Nr)$, where $N$ represents the dimension of the core state vector and $r$ is the rank of a low-rank component (with $r \ll N$). FLAME demonstrates state-of-the-art accuracy for event-by-event processing on complex event camera datasets.
Probabilistic Reasoning with LLMs for Privacy Risk Estimation
Jonathan Zheng, Alan Ritter, Sauvik Das, Wei “Coco” Xu
Abstract
Probabilistic reasoning is a key aspect of both human and artificial intelligence that allows for handling uncertainty and ambiguity in decision-making. In this paper, we introduce a new numerical reasoning task under uncertainty for large language models, focusing on estimating the privacy risk of user-generated documents containing privacy-sensitive information. We propose BRANCH, a new LLM methodology that estimates the $k$-privacy value of a text—the size of the population matching the given information. BRANCH factorizes a joint probability distribution of personal information as random variables. The probability of each factor in a population is estimated separately using a Bayesian network and combined to compute the final $k$-value. Our experiments show that this method successfully estimates the $k$-value 73% of the time, a 13% increase compared to o3-mini with chain-of-thought reasoning. We also find that LLM uncertainty is a good indicator for accuracy, as high variance predictions are 37.47% less accurate on average.
SRPO: Enhancing Multimodal LLM Reasoning via Reflection-Aware Reinforcement Learning
Zhongwei Wan, Alex Dou, Che Liu, Yu Zhang, Dongfei Cui, Qinjian Zhao, Hui Shen, Jing Xiong, Yi Xin, Yifan Jiang, Chaofan Tao, Yangfan He, Mi Zhang, Shen Yan
Abstract
Multimodal large language models (MLLMs) have shown promising capabilities in reasoning tasks, yet still struggle significantly with complex problems requiring explicit self-reflection and self-correction, especially compared to their unimodal text-based counterparts. Existing reflection methods are simplistic and struggle to generate meaningful, instructive feedback, as the reasoning ability and knowledge limits of pre-trained models are largely fixed during initial training. To overcome these challenges, we propose \textit{multimodal \textbf{S}elf-\textbf{R}eflection enhanced reasoning with Group Relative \textbf{P}olicy \textbf{O}ptimization} \textbf{SRPO}, a two-stage reflection-aware reinforcement learning (RL) framework explicitly designed to enhance multimodal LLM reasoning. In the first stage, we construct a high-quality, reflection-focused dataset under the guidance of an advanced MLLM, which generates reflections based on initial responses to help the policy model to learn both reasoning and self-reflection. In the second stage, we introduce a novel reward mechanism within the GRPO framework that encourages concise and cognitively meaningful reflection while avoiding redundancy. Extensive experiments across multiple multimodal reasoning benchmarks—including MathVista, MathVision, Mathverse, and MMMU-Pro—using Qwen-2.5-VL-7B and Qwen-2.5-VL-32B demonstrate that SRPO significantly outperforms state-of-the-art models, achieving notable improvements in both reasoning accuracy and reflection quality.
Stable Cinemetrics : Structured Taxonomy and Evaluation for Professional Video Generation
Agneet Chatterjee, Rahim Entezari, Maksym Zhuravinskyi, Maksim Lapin, Reshinth Adithyan, Amit Raj, Chitta Baral, ‘YZ’ Yezhou Yang, Varun Jampani
Abstract
Recent advances in text-to-video (T2V) generation have enabled high-fidelity video synthesis from natural language prompts. However, existing models and benchmarks fail to capture the complexity and requirements of professional video generation. Towards that goal, we introduce Stable Cinemetrics (SCINE), a structured evaluation framework that formalizes filmmaking principles into four disentangled, hierarchical taxonomies: Setup, Event, Lighting, and Camera} Together, these taxonomies define 76 fine-grained control nodes grounded in industry practices. Using these taxonomies, we construct a benchmark of prompts aligned with professional use cases and develop an automated pipeline for prompt categorization and question generation, enabling independent evaluation of each control dimension. We conduct a large-scale human study spanning 10+ models and 20K videos, annotated by a pool of 80+ film professionals. Our analysis, both coarse and fine-grained reveal that even the strongest current models exhibit significant gaps, particularly in Events and Camera-related controls. To enable scalable evaluation, we train an automatic evaluator, a vision-language model aligned with expert annotations that outperforms existing zero-shot baselines. SCINE is the first approach to formalize professional video generation within the landscape of video generative models, introducing taxonomies centered around cinematic control and supporting them with structured evaluation pipelines and detailed analyses to guide future research.
Toward Human Deictic Gesture Target Estimation
Xu Cao, Pranav Virupaksha, Sangmin Lee, Bolin Lai, Wenqi Jia, Jintai Chen, James Rehg
Abstract
Humans have a remarkable ability to use co-speech deictic gestures, such as pointing and showing, to enrich verbal communication and support social interaction. These gestures are so fundamental that infants begin to use them even before they acquire spoken language, which highlights their central role in human communication. Understanding the intended targets of another individual’s deictic gestures enables inference of their intentions, comprehension of their current actions, and prediction of upcoming behaviors. Despite its significance, gesture target estimation remains an underexplored task within the computer vision community. In this paper, we introduce GestureTarget, a novel task designed specifically for comprehensive evaluation of social deictic gesture semantic target estimation. To address this task, we propose TransGesture, a set of Transformer-based gesture target prediction models. Given an input image and the spatial location of a person, our models predict the intended target of their gesture within the scene. Critically, our gaze-aware joint cross attention fusion model demonstrates how incorporating gaze-following cues significantly improves gesture target mask prediction IoU by 6% and gesture existence prediction accuracy by 10%. Our results underscore the complexity and importance of integrating gaze cues into deictic gesture intention understanding, advocating for increased research attention to this emerging area. All data, code will be made publicly available upon acceptance.
Ultra-high Resolution Watermarking Framework Resistant to Extreme Cropping and Scaling
Nan Sun, LuYu Yuan, Han Fang, Yuxing Lu, Hefei Ling, Sijing Xie, Chengxin Zhao
Abstract
Recent developments in DNN-based image watermarking techniques have achieved impressive results in protecting digital content. However, most existing methods are constrained to low-resolution images as they need to encode the entire image, leading to prohibitive memory and computational costs when applied to high-resolution images. Moreover, they lack robustness to distortions prevalent in large-image transmission, such as extreme scaling and random cropping. To address these issues, we propose a novel watermarking method based on implicit neural representations (INRs). Leveraging the properties of INRs, our method employs resolution-independent coordinate sampling mechanism to generate watermarks pixel-wise, achieving ultra-high resolution watermark generation with fixed and limited memory and computational resources. This design ensures strong robustness in watermark extraction, even under extreme cropping and scaling distortions. Additionally, we introduce a hierarchical multi-scale coordinate embedding and a low-rank watermark injection strategy to ensure high-quality watermark generation and robust decoding. Experimental results demonstrate that our method significantly outperforms existing schemes in terms of both robustness and computational efficiency while preserving high image quality. Our approach achieves an accuracy greater than 98\% in watermark extraction with only 0. 4\% of the image area in 2K images. These results highlight the effectiveness of our method, making it a promising solution for large-scale and high-resolution image watermarking applications.
Datasets & Benchmarks for applications in language modeling and vision language modeling
CLIMB: Clustering-based Iterative Data Mixture Bootstrapping for Language Model Pre-training
Shizhe Diao, Yu Yang, Yonggan Fu, Xin Dong, Dan Su, Markus Kliegl, Zijia Chen, Peter Belcak, Yoshi Suhara, Hongxu Yin, Mostofa Patwary, Yingyan (Celine) Lin, Jan Kautz, Pavlo Molchanov
Abstract
Pre-training datasets are typically collected from web content and lack inherent domain divisions. For instance, widely used datasets like Common Crawl do not include explicit domain labels, while manually curating labeled datasets such as The Pile is labor-intensive. Consequently, identifying an optimal pre-training data mixture remains a challenging problem, despite its significant benefits for pre-training performance. To address these challenges, we propose CLustering-based Iterative Data Mixture Bootstrapping (CLIMB), an automated framework that discovers, evaluates, and refines data mixtures in a pre-training setting. Specifically, CLIMB embeds and clusters large-scale datasets in a semantic space and then iteratively searches for optimal mixtures using a smaller proxy model and apredictor. This strategy enables effective domain adaptation without relying solely on curated data. When continuously trained on 400B tokens with this mixture, our 1B model exceeds the state-of-the-art Llama-3.2-1B by 2.0%. Moreover, we observe that optimizing for a specific domain (e.g., Social Sciences) yields a 5% improvement over random sampling. Finally, we introduce ClimbLab, a filtered1.2-trillion-token corpus with 20 clusters as a research playground, and ClimbMix, a compact yet powerful 400-billion-token dataset designed for efficient pre-training that delivers superior performance under an equal token budget. We analyze the final data mixture, elucidating the characteristics of an optimal data mixture.
SWE-smith: Scaling Data for Software Engineering Agents
John Yang, Kilian Lieret, Carlos Jimenez, Alexander Wettig, Kabir Khandpur, Yanzhe Zhang, Binyuan Hui, Ofir Press, Ludwig Schmidt, Diyi Yang
Abstract
Despite recent progress in Language Models (LMs) for software engineering, collecting training data remains a significant pain point.Existing datasets are small, with at most 1,000s of training instances from 11 or fewer GitHub repositories.The procedures to curate such datasets are often complex, necessitating hundreds of hours of human labor; companion execution environments also take up several terabytes of storage, severely limiting their scalability and usability.To address this pain point, we introduce SWE-smith, a novel pipeline for generating software engineering training data at scale.Given any Python codebase, SWE-smith constructs a corresponding execution environment, then automatically synthesizes 100s to 1,000s of task instances that break existing test(s) in the codebase.Using SWE-smith, we create a dataset of 50k instances sourced from 128 GitHub repositories, an order of magnitude larger than all previous works.We train SWE-agent-LM-32B, achieving 40.2% Pass@1 resolve rate on the SWE-bench Verified benchmark, state of the art among open source models.We open source SWE-smith (collection procedure, task instances, trajectories, models) to lower the barrier of entry for research in LM systems for automated software engineering.All assets available at \url{https://swesmith.com}.
MLE-Dojo: Interactive Environments for Empowering LLM Agents in Machine Learning Engineering
Rushi Qiang, Yuchen Zhuang, Yinghao Li, Dingu Sagar V K, Rongzhi Zhang, ChangHao Li, Ian Wong, Sherry Yang, Percy Liang, Chao Zhang, Bo Dai
Abstract
We introduce MLE-Dojo, a Gym-style framework for systematically reinforcement learning, evaluating, and improving autonomous large language model (LLM) agents in iterative machine learning engineering (MLE) workflows. Unlike existing benchmarks that primarily rely on static datasets or single-attempt evaluations, MLE-Dojo provides an interactive environment enabling agents to iteratively experiment, debug, and refine solutions through structured feedback loops. Built upon 200+ real-world Kaggle challenges, MLE-Dojo covers diverse, open-ended MLE tasks carefully curated to reflect realistic engineering scenarios such as data processing, architecture search, hyperparameter tuning, and code debugging. Its fully executable environment supports comprehensive agent training via both supervised fine-tuning and reinforcement learning, facilitating iterative experimentation, realistic data sampling, and real-time outcome verification. Extensive evaluations of eight frontier LLMs reveal that while current models achieve meaningful iterative improvements, they still exhibit significant limitations in autonomously generating long-horizon solutions and efficiently resolving complex errors. Furthermore, MLE-Dojo’s flexible and extensible architecture seamlessly integrates diverse data sources, tools, and evaluation protocols, uniquely enabling model-based agent tuning and promoting interoperability, scalability, and reproducibility. We open-source our framework and benchmarks to foster community-driven innovation towards next-generation MLE agents.
Scientists’ First Exam: Probing Cognitive Abilities of MLLM via Perception, Understanding, and Reasoning
Yuhao Zhou, Yiheng Wang, Xuming He, Ruoyao Xiao, Zhiwei Li, Qiantai Feng, Zijie Guo, Yuejin Yang, Hao Wu, Wenxuan Huang, Jiaqi Wei, Dan Si, Yao Xiuqi, Jia Bu, Haiwen Huang, Tianfan Fu, Shixiang Tang, Ben Fei, Dongzhan Zhou, Fenghua Ling, Yan Lu, Siqi Sun, Chenhui Li, Guanjie Zheng, Jiancheng Lv, Wenlong Zhang, Lei Bai
Abstract
Scientific discoveries increasingly rely on complex multimodal reasoning based on information-intensive scientific data and domain-specific expertise. Empowered by expert-level scientific benchmarks, scientific Multimodal Large Language Models (MLLMs) hold the potential to significantly enhance this discovery process in realistic workflows. However, current scientific benchmarks mostly focus on evaluating the knowledge understanding capabilities of MLLMs, leading to an inadequate assessment of their perception and reasoning abilities. To address this gap, we present the Scientists’ First Exam (SFE) benchmark, designed to evaluate the scientific cognitive capacities of MLLMs through three interconnected levels: scientific signal perception, scientific attribute understanding, scientific comparative reasoning. Specifically, SFE comprises 830 expert-verified VQA pairs across three question types, spanning 66 multimodal tasks across five high-value disciplines. Extensive experiments reveal that current state-of-the-art GPT-o3 and InternVL-3 achieve only 34.08% and 26.52% on SFE, highlighting significant room for MLLMs to improve in scientific realms. We hope the insights obtained in SFE will facilitate further developments in AI-enhanced scientific discoveries.
REAL: Benchmarking Autonomous Agents on Deterministic Simulations of Real Websites
Div Garg, Diego Caples, Andis Draguns, Nikil Ravi, Pranav Putta, Naman Garg, Prannay Hebbar, Youngchul Joo, Jindong Gu, Charles London, Christian Schroeder de Witt, Sumeet Motwani
Abstract
We introduce REAL, a benchmark and framework for multi-turn agent evaluations on deterministic simulations of real-world websites. REAL comprises high-fidelity, deterministic replicas of 11 widely-used websites across domains such as e-commerce, travel, communication, and professional networking. We also release a benchmark consisting of 112 practical tasks that mirror everyday complex user interactions requiring both accurate information retrieval and state-changing actions. All interactions occur within this fully controlled setting, eliminating safety risks and enabling robust, reproducible evaluation of agent capability and reliability. Our novel evaluation framework combines programmatic checks of website state for action-based tasks with rubric-guided LLM-based judgments for information retrieval. The framework supports both open-source and proprietary agent systems through a flexible evaluation harness that accommodates black-box commands within browser environments, allowing research labs to test agentic systems without modification. Our empirical results show that frontier language models achieve at most a 41% success rate on REAL, highlighting critical gaps in autonomous web navigation and task completion capabilities. Our framework supports easy integration of new tasks, reproducible evaluation, and scalable post-training data generation, marking a significant step forward in evaluating and advancing agent capabilities.
Deep Learning
(e.g., architectures, generative models, optimization for deep networks, foundation models, LLMs)
A Closer Look at Model Collapse: From a Generalization-to-Memorization Perspective
Lianghe Shi, Meng Wu, Huijie Zhang, Zekai Zhang, Molei Tao, Qing Qu
Abstract
The widespread use of diffusion models has led to an abundance of AI-generated data, raising concerns about model collapse—a phenomenon in which recursive iterations of training on synthetic data lead to performance degradation. Prior work primarily characterizes this collapse via variance shrinkage or distribution shift, but these perspectives miss practical manifestations of model collapse. This paper identifies a transition from generalization to memorization during model collapse in diffusion models, where models increasingly replicate training data instead of generating novel content during iterative training on synthetic samples. This transition is directly driven by the declining entropy of the synthetic training data produced in each training cycle, which serves as a clear indicator of model degradation. Motivated by this insight, we propose an entropy-based data selection strategy to mitigate the transition from generalization to memorization and alleviate model collapse. Empirical results show that our approach significantly enhances visual quality and diversity in recursive generation, effectively preventing collapse.
AdaSPEC: Selective Knowledge Distillation for Efficient Speculative Decoders
Yuezhou Hu, Jiaxin Guo, Xinyu Feng, Tuo Zhao
Abstract
The widespread use of diffusion models has led to an abundance of AI-generated data, raising concerns about model collapse—a phenomenon in which recursive iterations of training on synthetic data lead to performance degradation. Prior work primarily characterizes this collapse via variance shrinkage or distribution shift, but these perspectives miss practical manifestations of model collapse. This paper identifies a transition from generalization to memorization during model collapse in diffusion models, where models increasingly replicate training data instead of generating novel content during iterative training on synthetic samples. This transition is directly driven by the declining entropy of the synthetic training data produced in each training cycle, which serves as a clear indicator of model degradation. Motivated by this insight, we propose an entropy-based data selection strategy to mitigate the transition from generalization to memorization and alleviate model collapse. Empirical results show that our approach significantly enhances visual quality and diversity in recursive generation, effectively preventing collapse.
SoFar: Language-Grounded Orientation Bridges Spatial Reasoning and Object Manipulation
Zekun Qi, Wenyao Zhang, Yufei Ding, Runpei Dong, XinQiang Yu, Jingwen Li, Lingyun Xu, Baoyu Li, Xialin He, Guofan Fan, Jiazhao Zhang, Jiawei He, Jiayuan Gu, Xin Jin, Kaisheng Ma, Zhizheng Zhang, He Wang, Li Yi
Abstract
While spatial reasoning has made progress in object localization relationships, it often overlooks object orientation— a key factor in 6-DoF fine-grained manipulation. Traditional pose representations rely on pre-defined frames or templates, limiting generalization and semantic grounding. In this paper, we introduce the concept of semantic orientation, which defines object orientations using natural language in a reference-frame-free manner (e.g., the “plug-in” direction of a USB or the “handle” direction of a cup). To support this, we construct OrienText300K, a large-scale dataset of 3D objects annotated with semantic orientations, and develop PointSO, a general model for zero-shot semantic orientation prediction. By integrating semantic orientation into VLM agents, our SoFAR framework enables 6-DoF spatial reasoning and generates robotic actions. Extensive experiments demonstrated the effectiveness and generalization of our SOFAR, e.g., zero-shot 48.7% successful rate on OpenDOR and 58.3% successful rate on SIMPLER Widox-X setting.
Variational Learning Finds Flatter Solutions at the Edge of Stability
Avrajit Ghosh, Bai Cong, Rio Yokota, Saiprasad Ravishankar, Rongrong Wang, Molei Tao, Mohammad Emtiyaz Khan, Thomas Möllenhoff
Abstract
The performance of Variational Learning (VL) for deep neural networks has consistently been improving over the years and is now at par with the standard optimizers. Part of its empirical success can be explained by theories such as PAC-Bayes bounds, minimum description length and marginal likelihood, but there are few tools to unravel the implicit regularization in play. Here, we use the Edge of Stability (EoS) to understand the implicit regularization of VL. EoS has previously been used to show that gradient descent can find flat solutions and we extend this result to VL to show that it can find even flatter solutions. This is obtained by simply controlling the posterior covariance and the number of Monte Carlo samples from the posterior. These results are derived in a similar fashion as the standard EoS literature for deep learning by first deriving a result for a quadratic problem and then extending it to general loss functions. We empirically validate these findings on a wide variety of large networks, such as ResNet and ViT, to find that the theoretical results closely match the empirical ones. Ours is the first work to use EoS for VL and show its effectiveness for deep learning.
Win Fast or Lose Slow: Balancing Speed and Accuracy in Latency-Sensitive Decisions of LLMs
Hao Kang, Qingru Zhang, Han Cai, Weiyuan Xu, Tushar Krishna, Yilun Du, Tsachy Weissman
Abstract
Large language models (LLMs) have shown remarkable performance across diverse reasoning and generation tasks, and are increasingly deployed as agents in dynamic environments such as code generation and recommendation systems. However, many real-world applications, such as high-frequency trading and real-time competitive gaming, require decisions under strict latency constraints, where faster responses directly translate into higher rewards. Despite the importance of this latency–quality trade-off, it remains underexplored in the context of LLM-based agents. In this work, we present the first systematic study of this trade-off in real-time decision-making tasks. To support our investigation, we introduce two new benchmarks: HFTBench, a high-frequency trading simulation, and StreetFighter, a competitive gaming platform. Our analysis reveals that optimal latency–quality balance varies by task, and that sacrificing quality for lower latency can significantly enhance downstream performance. To address this, we propose FPX, an adaptive framework that dynamically selects model size and quantization level based on real-time demands. Our method achieves the best performance on both benchmarks, improving win rate by up to 80% in Street Fighter and boosting daily yield by up to 26.52% in trading, underscoring the need for latency-aware evaluation and deployment strategies for LLM-based agents. These results demonstrate the critical importance of latency-aware evaluation and deployment strategies for real-world LLM-based agents.
ZeroS: Zero‑Sum Linear Attention for Efficient Transformers
Jiecheng Lu, Xu Han, Yan Sun, Viresh Pati, Yubin Kim, Siddhartha Somani, Shihao Yang
Abstract
Linear attention methods offer Transformers $O(N)$ complexity but typically underperform standard softmax attention. We identify two fundamental limitations affecting these approaches: the restriction to convex combinations that only permits additive information blending, and uniform accumulated weight bias that dilutes attention in long contexts. We propose Zero-Sum Linear Attention (ZeroS), which addresses these limitations by removing the constant zero-order term $1/t$ and reweighting the remaining zero-sum softmax residuals. This modification creates mathematically stable weights, enabling both positive and negative values and allowing a single attention layer to perform contrastive operations. While maintaining $O(N)$ complexity, ZeroS theoretically expands the set of representable functions compared to convex combinations. Empirically, it matches or exceeds standard softmax attention across various sequence modeling benchmarks.
AmorLIP: Efficient Language-Image Pretraining via Amortization
Haotian Sun, Yitong Li, Yuchen Zhuang, Niao He, Hanjun Dai, Bo Dai
Abstract
Contrastive Language-Image Pretraining (CLIP) has demonstrated strong zero-shot performance across diverse downstream text-image tasks. Existing CLIP methods typically optimize a contrastive objective using negative samples drawn from each minibatch. To achieve robust representation learning, these methods require extremely large batch sizes and escalate computational demands to hundreds or even thousands of GPUs. Prior approaches to mitigate this issue often compromise downstream performance, prolong training duration, or face scalability challenges with very large datasets. To overcome these limitations, we propose AmorLIP, an efficient CLIP pretraining framework that amortizes expensive computations involved in contrastive learning through lightweight neural networks, which substantially improves training efficiency and performance. Leveraging insights from a spectral factorization of energy-based models, we introduce novel amortization objectives along with practical techniques to improve training stability. Extensive experiments across 38 downstream tasks demonstrate the superior zero-shot classification and retrieval capabilities of AmorLIP, consistently outperforming standard CLIP baselines with substantial relative improvements of up to 12.24%.
Ask a Strong LLM Judge when Your Reward Model is Uncertain
Zhenghao Xu, Qin Lu, Qingru Zhang, Liang Qiu, Ilgee Hong, Changlong Yu, Wenlin Yao, Yao Liu, Haoming Jiang, Lihong Li, Hyokun Yun, Tuo Zhao
Abstract
Reward model (RM) plays a pivotal role in reinforcement learning with human feedback (RLHF) for aligning large language models (LLMs). However, classical RMs trained on human preferences are vulnerable to reward hacking and generalize poorly to out-of-distribution (OOD) inputs. By contrast, strong LLM judges equipped with reasoning capabilities demonstrate superior generalization, even without additional training, but incur significantly higher inference costs, limiting their applicability in online RLHF. In this work, we propose an uncertainty-based routing framework that efficiently complements a fast RM with a strong but costly LLM judge. Our approach formulates advantage estimation in policy gradient (PG) methods as pairwise preference classification, enabling principled uncertainty quantification to guide routing. Uncertain pairs are forwarded to the LLM judge, while confident ones are evaluated by the RM. Experiments on RM benchmarks demonstrate that our uncertainty-based routing strategy significantly outperforms random judge calling at the same cost, and downstream alignment results showcase its effectiveness in improving online RLHF.
Creativity or Brute Force? Using Brainteasers as a Window into the Problem-Solving Abilities of Large Language Models
Sophia Simeng Han, Stephen Xia, Grant Zhang, Howard Dai, Chen Liu, Lichang Chen, Hoang H Nguyen, Hongyuan Mei, Jiayuan Mao, R. Thomas McCoy
Abstract
Accuracy remains a standard metric for evaluating AI systems, but it offers limited insight into how models arrive at their solutions. In this work, we introduce a benchmark based on brainteasers written in long narrative form to probe more deeply into the types of reasoning strategies that models use. Brainteasers are well-suited for this goal because they can be solved with multiple approaches, such as a few-step solution that uses a creative insight or a longer solution that uses more brute force. We investigate large language models (LLMs) across multiple layers of reasoning, focusing not only on correctness but also on the quality and creativity of their solutions. We investigate many aspects of the reasoning process: (1) semantic parsing of the brainteasers into precise mathematical competition style formats; (2) self-correcting solutions based on gold solutions; (3) producing step-by-step sketches of solutions; and (4) making use of hints.We find that LLMs are in many cases able to find creative, insightful solutions to brainteasers, suggesting that they capture some of the capacities needed to solve novel problems in creative ways. Nonetheless, there also remain situations where they rely on brute force despite the availability of more efficient, creative solutions, highlighting a potential direction for improvement in the reasoning abilities of LLMs.
Elevating Visual Perception in Multimodal LLMs with Visual Embedding Distillation
Jitesh Jain, Zhengyuan Yang, Humphrey Shi, Jianfeng Gao, Jianwei Yang
Abstract
In recent times, the standard practice for developing MLLMs is to feed features from vision encoder(s) into the LLM and train with natural language supervision. This approach often causes models to lean towards language comprehension and undermine the rich visual perception signals present in the data, which are critical for tasks involving spatial reasoning in the domain of embodied AI and robotics. Is it possible to optimize both at the same time? In this work, we propose VisPer-LM, the first approach that infuses visual perception knowledge from expert vision encoders into the LLM’s (of an MLLM) hidden representations. We start by investigating MLLMs trained solely with natural language supervision and identify a positive correlation between the quality of visual representations within these models and their downstream performance. Given this insight, we formulate the objective during the pretraining stage in MLLMs as a coupled optimization of predictive visual embedding and next (text) token prediction. Moreover, through extensive probing, we observe improved visual representation quality due to embedding optimization, underscoring the effectiveness of our probing setup. We demonstrate that our VisPer-LM outperforms the single and multi-encoder baselines, proving our approach’s superiority over explicitly feeding the corresponding features to the LLM. In particular, VisPer-LM boosts performance by an average margin of up to 2.5% on various benchmarks, with a notable improvement of 8.7% on the Depth task in CV-Bench.
Fast Solvers for Discrete Diffusion Models: Theory and Applications of High-Order Algorithms
Yinuo Ren, Haoxuan Chen, Yuchen Zhu, Wei Guo, Yongxin Chen, Grant Rotskoff, Molei Tao, Lexing Ying
Abstract
Discrete diffusion models have emerged as a powerful generative modeling framework for discrete data with successful applications spanning from text generation to image synthesis. However, their deployment faces challenges due to the high dimensionality of the state space, necessitating the development of efficient inference algorithms. Current inference approaches mainly fall into two categories: exact simulation and approximate methods such as $\tau$-leaping. While exact methods suffer from unpredictable inference time and redundant function evaluations, $\tau$-leaping is limited by its first-order accuracy. In this work, we advance the latter category by tailoring the first extension of high-order numerical inference schemes to discrete diffusion models, enabling larger step sizes while reducing error. We rigorously analyze the proposed schemes and establish the second-order accuracy of the $\theta$-trapezoidal method in KL divergence. Empirical evaluations on GPT-2 level text and ImageNet-level image generation tasks demonstrate that our method achieves superior sample quality compared to existing approaches under equivalent computational constraints.
Fast-SLM: Towards Latency-Optimal Hybrid Small Language Models
Yonggan Fu, Xin Dong, Shizhe Diao, Matthijs Van Keirsbilck, Hanrong Ye, Wonmin Byeon, Yashaswi Karnati, Lucas Liebenwein, Maksim Khadkevich, Alexander Keller, Jan Kautz, Yingyan (Celine) Lin, Pavlo Molchanov
Abstract
Efficient deployment of small language models (SLMs) is essential for numerous real-world applications with stringent latency constraints. While previous work in SLM design mainly focuses on reducing the number of parameters to deliver parameter-optimal SLMs, parameter efficiency may not necessarily translate into proportional real-device speed-ups. This work aims to provide a systematic exploration and roadmap for latency-optimal SLMs. Our goal is to identify critical determinants of SLMs’ real-device latency and provide generalizable principles and methodologies for SLM design and training when real-device latency becomes the primary consideration. Specifically, we first analyze two central architecture design factors: depth-width ratios and the involved operators. We find that although deep-thin models generally lead to better accuracy under the same parameter budget, they may not lie on the frontier of the accuracy-latency trade-off. To identify the latency-optimal depth-width ratio, we augment previous scaling laws by relating model loss to both model depth and width, thus enabling determination of the sweet spot depth-width ratio when combined with device-specific profiling. Additionally, we explore emerging efficient attention alternatives to understand their potential as candidate building operators. Using the identified promising operators, we build an evolutionary search framework to automatically pinpoint optimal latency combinations of these operators into hybrid SLMs to push the accuracy-latency frontier. In addition to architectural improvements, we further analyze and enhance SLM training by enabling more effective weight updates and improving cache initialization, which are generalizable add-on components for future SLMs. Combining these contributions, we introduce a new family of hybrid SLMs, called Fast-SLM, which significantly advances the accuracy–latency trade-off frontier of state-of-the-art SLMs, e.g., achieving 2.57\% higher accuracy, 1.46$\times$ speed-up, and over 10$\times$ cache reduction compared to Llama3.2-3B.
FlexVAR: Flexible Visual Autoregressive Modeling without Residual Prediction
Siyu Jiao, Gengwei Zhang, Yinlong Qian, Jiancheng Huang, Yao Zhao, Humphrey Shi, Lin Ma, Yunchao Wei, Zequn Jie
Abstract
This work challenges the residual prediction paradigm in visual autoregressive modeling and presents FlexVAR, a new Flexible Visual AutoRegressive image generation paradigm. FlexVAR facilitates autoregressive learning with ground-truth prediction, enabling each step to independently produce plausible images. This simple, intuitive approach swiftly learns visual distributions and makes the generation process more flexible and adaptable. Trained solely on low-resolution images (< 256px), FlexVAR can: (1) Generate images of various resolutions and aspect ratios, even exceeding the resolution of the training images. (2) Support various image-to-image tasks, including image refinement, in/out-painting, and image expansion. (3) Adapt to various autoregressive steps, allowing for faster inference with fewer steps or enhancing image quality with more steps. Our 1.0B model outperforms its VAR counterpart on the ImageNet 256 × 256 benchmark. Moreover, when zero-shot transfer the image generation process with 13 steps, the performance further improves to 2.08 FID, outperforming state-of-the-art autoregressive models AiM/VAR by 0.25/0.28 FID and popular diffusion models LDM/DiT by 1.52/0.19 FID, respectively. When transferring our 1.0B model to the ImageNet 512 × 512 benchmark in a zero-shot manner, FlexVAR achieves competitive results compared to the VAR 2.3B model, which is a fully supervised model trained at 512 × 512 resolution.
Graph-KV: Breaking Sequence via Injecting Structural Biases into Large Language Models
Haoyu Wang, Peihao Wang, Mufei Li, Shikun Liu, Siqi Miao, Zhangyang “Atlas” Wang, Pan Li
Abstract
Modern large language models (LLMs) are inherently auto-regressive, requiring input to be serialized into flat sequences regardless of their structural dependencies. This serialization hinders the model’s ability to leverage structural inductive biases, especially in tasks such as retrieval-augmented generation (RAG) and reasoning on data with native graph structures, where inter-segment dependencies are crucial. We introduce Graph-KV with the potential to overcome this limitation. Graph-KV leverages the KV-cache of text segments as condensed representations and governs their interaction through structural inductive biases. In this framework, ”target” segments selectively attend only to the KV-caches of their designated ”source” segments, rather than all preceding segments in a serialized sequence. This approach induces a graph-structured block mask, sparsifying attention and enabling a message-passing-like step within the LLM. Furthermore, strategically allocated positional encodings for source and target segments reduce positional bias and context window consumption. We evaluate Graph-KV across three scenarios: (1) seven RAG benchmarks spanning direct inference, multi-hop reasoning, and long-document understanding; (2) Arxiv-QA, a novel academic paper QA task with full-text scientific papers structured as citation ego-graphs; and (3) paper topic classification within a citation network.By effectively reducing positional bias and harnessing structural inductive biases, Graph-KV substantially outperforms baselines, including standard costly sequential encoding, across various settings.
Matryoshka Pilot: Learning to Drive Black-Box LLMs with LLMs
ChangHao Li, Yuchen Zhuang, Rushi Qiang, Haotian Sun, Hanjun Dai, Chao Zhang, Bo Dai
Abstract
Despite the impressive generative abilities of black-box large language models (LLMs), their inherent opacity hinders further advancements in capabilities such as reasoning, planning, and personalization. Existing works aim to enhance LLM capabilities via domain-specific adaptation, which require additional training on accessible model parameters, an infeasible option for black-box LLMs. To address this challenge, we introduce Matryoshka Pilot (M-Pilot), a lightweight white-box LLM controller that guides a large-scale black-box LLM generator by decomposing complex tasks into a series of intermediate outputs.Specifically, we consider the black-box LLM as an environment, with M-Pilot serving as a policy to provide intermediate guidance through prompts for driving the black-box LLM. M-Pilot is trained to pivot the outputs of the black-box LLM aligning with preferences during iterative interaction, which enables controllable multi-turn generation and self-improvement in optimizing intermediate guidance. Empirical evaluations on diverse tasks demonstrate that our method effectively enhances the capabilities of black-box LLMs in complex, long-horizon tasks.
Model Provenance Testing for Large Language Models
Ivica Nikolic, Teodora Baluta, Prateek Saxena
Abstract
Large language models are increasingly customized through fine-tuning and other adaptations, creating challenges in enforcing licensing terms and managing downstream impacts such as protecting intellectual property or identifying vulnerabilities. We address this challenge by developing a framework for testing model provenance. Our approach is based on the key observation that real-world model derivations preserve significant similarities in model outputs that can be detected through statistical analysis. Using only black-box access to models, we employ multiple hypothesis testing to compare model similarities against a baseline established by unrelated models.On two comprehensive real-world benchmarks spanning models from 30M to 4B parameters and comprising over 600 models, our tester achieves 90-95% precision and 80-90% recall in identifying derived models.These results demonstrate the viability of systematic provenance verification in production environments even when only API access is available.
Momentum Multi-Marginal Schrödinger Bridge Matching
Panagiotis Theodoropoulos, Augustinos Saravanos, Evangelos Theodorou, Guan-Horng Liu
Abstract
Understanding complex systems by inferring trajectories from sparse sample snapshots is a fundamental challenge in a wide range of domains, e.g., single-cell biology, meteorology, and economics. Despite advancements in Bridge and Flow matching frameworks, current methodologies rely on pairwise interpolation between adjacent snapshots. This hinders their ability to capture long-range temporal dependencies and potentially affects the coherence of the inferred trajectories. To address these issues, we introduce Momentum Multi-Marginal Schrödinger Bridge Matching (3MSBM), a novel matching framework that learns smooth measure-valued splines for stochastic systems that satisfy multiple positional constraints. This is achieved by lifting the dynamics to phase space and generalizing stochastic bridges to be conditioned on several points, forming a multi-marginal conditional stochastic optimal control problem. The underlying dynamics are then learned by minimizing a variational objective, having fixed the path induced by the multi-marginal conditional bridge. As a matching approach, 3MSBM learns transport maps that preserve intermediate marginals throughout training, significantly improving convergence and scalability. Extensive experimentation in a series of real-world applications validates the superior performance of 3MSBM compared to existing methods in capturing complex dynamics with temporal dependencies, opening new avenues for training matching frameworks in multi-marginal settings.
PDPO: Parametric Density Path Optimization
Sebastian Gutierrez Hernandez, Peng Chen, Hao-Min Zhou
Abstract
We introduce Parametric Density Path Optimization (PDPO), a novel method for computing action-minimizing paths between probability densities. The core idea is to represent the target probability path as the pushforward of a reference density through a parametric map, transforming the original infinite-dimensional optimization over densities to a finite-dimensional one over the parameters of the map. We derive a static formulation of the dynamic problem of action minimization and propose cubic spline interpolation of the path in parameter space to solve the static problem. Theoretically, we establish an error bound of the action under proper assumptions on the regularity of the parameter path. Empirically, we find that using 3–5 control points of the spline interpolation suffices to accurately resolve both multimodal and high-dimensional problems. We demonstrate thatPDPO can flexibly accommodate a wide range of potential terms, including those modeling obstacles, mean-field interactions, stochastic control, and higher-order dynamics. Our method outperforms existing state-of-the-art approaches in benchmark tasks, demonstrating superior computational efficiency and solution quality.
Reinforcement Learning for Reasoning in Large Language Models with One Training Example
Yiping Wang, Qing Yang, Zhiyuan Zeng, Liliang Ren, Liyuan Liu, Baolin Peng, Hao Cheng, Xuehai He, Kuan Wang, Jianfeng Gao, Weizhu Chen, Shuohang Wang, Simon Du, Yelong Shen
Abstract
We show that reinforcement learning with verifiable reward using one training example (1-shot RLVR) is effective in incentivizing the math reasoning capabilities of large language models (LLMs). Applying RLVR to the base model Qwen2.5-Math-1.5B, we identify a single example that elevates model performance on MATH500 from 36.0% to 73.6%, and improves the average performance across six common mathematical reasoning benchmarks from 17.6% to 35.7%. This result matches the performance obtained using the 1.2k DeepScaleR subset (MATH500: 73.6%, average: 35.9%), which includes the aforementioned example. Similar substantial improvements are observed across various models (Qwen2.5-Math-7B, Llama3.2-3B-Instruct, DeepSeek-R1-Distill-Qwen-1.5B), RL algorithms (GRPO and PPO), and different math examples (many of which yield approximately 30% or greater improvement on MATH500 when employed as a single training example). In addition, we identify some interesting phenomena during 1-shot RLVR, including cross-domain generalization, increased frequency of self-reflection, and sustained test performance improvement even after the training accuracy has saturated, a phenomenon we term post-saturation generalization. Moreover, we verify that the effectiveness of 1-shot RLVR primarily arises from the policy gradient loss, distinguishing it from the “grokking” phenomenon. We also show the critical role of promoting exploration (e.g., by adding entropy loss with an appropriate coefficient) in 1-shot RLVR training. As a bonus, we observe that applying entropy loss alone, without any outcome reward, significantly enhances Qwen2.5-Math-1.5B’s performance on MATH500 by 27.4%. These findings can inspire future work on RLVR data efficiency and encourage a re-examination of both recent progress and the underlying mechanisms in RLVR.
Rethinking Fine-Tuning when Scaling Test-Time Compute: Limiting Confidence Improves Mathematical Reasoning
Feng Chen, Allan Raventós, Nan Cheng, Surya Ganguli, Shaul Druckmann
Abstract
Recent progress in large language models (LLMs) highlights the power of scaling test-time compute to achieve strong performance on complex tasks, such as mathematical reasoning and code generation. This raises a critical question: how should model training be modified to optimize performance under a subsequent test-time compute strategy and budget? To explore this, we focus on pass@N, a simple test-time strategy that searches for a correct answer in N independent samples. We show, surprisingly, that training with cross-entropy (CE) can be misaligned with pass@N in that pass@N accuracy decreases with longer CE training. We explain the origins of this misalignment in terms of model overconfidence induced by CE, and experimentally verify our prediction of overconfidence as an impediment to scaling test-time compute via pass@N. Furthermore we suggest a principled, modified training loss that is better aligned to pass@N by limiting model confidence and rescuing pass@N test performance. Our algorithm demonstrates improved mathematical reasoning on MATH and MiniF2F benchmarks under several scenarios: (1) providing answers to math questions both with and without Chain-of-Thought reasoning traces; and (2) proving theorems by searching over proof trees of varying shapes. Overall our work underscores the importance of co-designing two traditionally separate phases of LLM development: training-time protocols and test-time search and reasoning strategies.
Taming generative world models for zero-shot optical flow extraction
Seungwoo Kim, Khai Loong Aw, Klemen Kotar, Cristobal Eyzaguirre, Wanhee Lee, Yunong Liu, Jared Watrous, Stefan Stojanov, Juan Carlos Niebles, Jiajun Wu, Daniel Yamins
Abstract
Extracting dense motion (optical flow) from videos remains a core computer-vision problem. Motivated by the recent success of large general-purpose models, we ask whether frozen self-supervised video world models trained only to predict future frames can be prompted, without fine-tuning, to output flow. Prior attempts to read out depth or illumination from video generators required fine-tuning; that strategy is ill-suited for flow, where labeled data are scarce and synthetic datasets suffer from a sim-to-real gap. We study several popular generative model architectures and find that successful zero-shot flow extraction requires three model properties: (1) distributional prediction of future frames (avoiding blurry or noisy outputs); (2) factorized latents that treat each spatio-temporal patch independently; and (3) random-access decoding that can condition on any subset of future pixels. These criteria are met by the recently introduced Local Random Access Sequence (LRAS) architecture. Building on LRAS, we propose KL-tracing: a procedure for injecting a small, local perturbation into the first frame, rolling out the model one step, and computing the Kullback–Leibler divergence between perturbed and unperturbed predictive distributions. The KL peak traces the displacement field, yielding optical flow in a single forward pass. Our method outperforms state-of-the-art models on real-world TAP-Vid DAVIS dataset (16.6% relative improvement for endpoint error) and synthetic TAP-Vid Kubric (4.7% relative improvement), despite being trained on real-world videos. Our results indicate that prompting controllable, self-supervised world models is a scalable and effective alternative to supervised or photometric-loss approaches for high-quality optical flow.
Think-RM: Enabling Long-Horizon Reasoning in Generative Reward Models
Ilgee Hong, Changlong Yu, Liang Qiu, Weixiang Yan, Zhenghao Xu, Haoming Jiang, Qingru Zhang, Qin Lu, Xin Liu, Chao Zhang, Tuo Zhao
Abstract
Reinforcement learning from human feedback (RLHF) has become a powerful post-training paradigm for aligning large language models with human preferences. A core challenge in RLHF is constructing accurate reward signals, where the conventional Bradley-Terry reward models (BT RMs) often suffer from sensitivity to data size and coverage, as well as vulnerability to reward hacking. Generative reward models (GenRMs) offer a more robust alternative by generating chain-of-thought (CoT) rationales followed by a final reward. However, existing GenRMs rely on shallow, vertically scaled reasoning, limiting their capacity to handle nuanced or complex (e.g., reasoning-intensive) tasks. Moreover, their pairwise preference outputs are incompatible with standard RLHF algorithms that require pointwise reward signals. In this work, we introduce Think-RM, a training framework that enables long-horizon reasoning in GenRMs by modeling an internal thinking process. Rather than producing structured, externally provided rationales, Think-RM generates flexible, self-guided reasoning traces that support advanced capabilities such as self-reflection, hypothetical reasoning, and divergent reasoning. To elicit these reasoning abilities, we first warm-up the models by supervised fine-tuning (SFT) over long CoT data. We then further improve the model’s long-horizon abilities by rule-based reinforcement learning (RL). In addition, we propose a novel pairwise RLHF pipeline that directly optimizes policies using pairwise preference rewards, eliminating the need for pointwise reward conversion and enabling more effective use of Think-RM outputs. Experiments show that Think-RM achieves state-of-the-art results on RM-Bench, outperforming both BT RM and vertically scaled GenRM by 8\%. When combined with our pairwise RLHF pipeline, it demonstrates superior end-policy performance compared to traditional approaches. This depth-oriented approach not only broadens the GenRM design space but also establishes a new paradigm for preference-based policy optimization in RLHF.
Weaver: Shrinking the Generation-Verification Gap by Scaling Compute for Verification
Jon Saad-Falcon, Estefany Kelly Buchanan, Mayee Chen, Tzu-Heng (Brian) Huang, Brendan McLaughlin, Tanvir Bhathal, Zhu, Ben Athiwaratkun, Frederic Sala, Scott Linderman, Azalia Mirhoseini, Christopher Ré
Abstract
Verifiers can improve language model (LM) capabilities by providing feedback or selecting the best response from a pool of generated candidates. Currently, high-quality verifiers are either unscalable (e.g., humans) or limited in utility (e.g., tools like Lean for formal proofs). While LM judges and reward models have become broadly useful as general-purpose verifiers, a significant performance gap remains between them and oracle verifiers. To help close this gap, we introduce Weaver, a framework for designing a strong verifier by combining multiple weak, imperfect verifiers. First we find that weighted ensembles of verifiers, which typically require learning from labeled data, significantly outperform unweighted combinations due to differences in the verifiers. To reduce the dependency on labeled data, Weaver leverages weak supervision to estimate each verifier’s accuracy and combines their outputs into a unified score that better reflects true response quality. However, directly applying weak supervision algorithms poses several challenges, including inconsistent verifier output formats and handling low-quality verifiers. Weaver addresses these challenges by using dataset statistics to normalize outputs and filter specific verifiers. We study the effectiveness of Weaver in repeated sampling settings, where a model generates multiple candidate responses at test time and a verifier is used to select the correct one. Our evaluations demonstrate that Weaver significantly improves the pass@1 performance across several reasoning and math tasks, achieving o3-mini level accuracy with Llama 3.3 70B Instruct (a much cheaper non-reasoning model) as the generator, and an ensemble of smaller judge and reward models as the verifiers (86.2% average). This gain mirrors the jump achieved between GPT-4o and o3-mini (69.0% vs. 86.7%), which required extensive finetuning and post-training interventions. To make Weaver more efficient, we train a compact 400M cross-encoder using Weaver’s combined output scores. This distilled model retains 98.7% of Weaver’s full accuracy while reducing verification compute by up to 99.97%.
Evaluation
(e.g., methodology, meta studies, replicability and validity, human-in-the-loop)
Grids Often Outperform Implicit Neural Representations
Namhoon Kim, Sara Fridovich-Keil
Abstract
Implicit Neural Representations (INRs) have recently shown impressive results, but their fundamental capacity, implicit biases, and scaling behavior are poorly understood. We investigate the performance of diverse INRs across a suite of 2D and 3D real and synthetic signals with varying effective bandwidth, as well as both overfitting and generalization tasks including tomography, super-resolution, and denoising. By stratifying performance according to model size as well as signal type and bandwidth, our results shed light on how different INR and grid representations allocate their capacity. We find that, for most tasks and signals, a simple regularized grid with interpolation trains faster and to higher quality than any INR with the same number of parameters. We also find limited settings–namely fitting binary signals such as shape contours–where INRs outperform grids, to guide future use of INRs towards the most advantageous applications.
Words That Unite The World: A Unified Framework for Deciphering Central Bank Communications Globally
Agam Shah, Siddhant Sukhani, Huzaifa Pardawala, Saketh Budideti, Riya Bhadani, Rudra Gopal, Siddhartha Somani, Michael Galarnyk, Soungmin Lee, Arnav Hiray, Akshar Ravichandran, Eric Kim, Pranav Aluru, Joshua Zhang, Sebastian Jaskowski, Veer Guda, Meghaj Tarte, Liqin Ye, Spencer Gosden, Rutwik Routu, Rachel Yuh, Sloka Chava, Sahasra Chava, Dylan Patrick Kelly, Aiden Chiang, Harsit Mittal, Sudheer Chava
Abstract
Central banks around the world play a crucial role in maintaining economic stability. Deciphering policy implications in their communications is essential, especially as misinterpretations can disproportionately impact vulnerable populations. To address this, we introduce the World Central Banks (WCB) dataset, the most comprehensive monetary policy corpus to date, comprising over 380k sentences from 25 central banks across diverse geographic regions, spanning 28 years of historical data. After uniformly sampling 1k sentences per bank (25k total) across all available years, we annotate and review each sentence using dual annotators, disagreement resolutions, and secondary expert reviews. We define three tasks: Stance Detection, Temporal Classification, and UncertaintyEstimation, with each sentence annotated for all three. We benchmark seven Pretrained Language Models (PLMs) and nine Large Language Models (LLMs) (Zero-Shot, Few-Shot, and with annotation guide) on these tasks, running 15,075 benchmarking experiments. We find that a model trained on aggregated data across banks significantly surpasses a model trained on an individual bank’s data, confirming the principle “the whole is greater than the sum of its parts.” Additionally, rigorous human evaluations, error analyses, and predictive tasks validate our framework’s economic utility. Our artifacts are accessible through the HuggingFace and GitHub under the CC-BY-NC-SA 4.0 license.
General Machine Learning
(supervised, unsupervised, online, active, etc.)
Dense Associative Memory with Epanechnikov Energy
Benjamin Hoover, Zhaoyang Shi, Krishnakumar Balasubramanian, Dmitry Krotov, Parikshit Ram
Abstract
We propose a novel energy function for Dense Associative Memory (DenseAM) networks, the log-sum-ReLU (LSR), inspired by optimal kernel density estimation. Unlike the common log-sum-exponential (LSE) function, LSR is based on the Epanechnikov kernel and enables exact memory retrieval with exponential capacity without requiring exponential separation functions. Uniquely, it introduces abundant additional emergent local minima while preserving perfect pattern recovery — a characteristic previously unseen in DenseAM literature. Empirical results show that LSR energy has significantly more local minima (memories) that have comparable log-likelihood to LSE-based models. Analysis of LSR’s emergent memories on image datasets reveals a degree of creativity and novelty, hinting at this method’s potential for both large-scale memory storage and generative tasks.
Machine learning for sciences
(e.g. climate, health, life sciences, physics, social sciences)
3D Interaction Geometric Pre-training for Molecular Relational Learning
Namkyeong Lee, Yunhak Oh, Heewoong Noh, Gyoung S. Na, Minkai Xu, Hanchen Wang, Tianfan Fu, Chanyoung Park
Abstract
Molecular Relational Learning (MRL) is a rapidly growing field that focuses on understanding the interaction dynamics between molecules, which is crucial for applications ranging from catalyst engineering to drug discovery. Despite recent progress, earlier MRL approaches are limited to using only the 2D topological structure of molecules, as obtaining the 3D interaction geometry remains prohibitively expensive.This paper introduces a novel 3D geometric pre-training strategy for MRL (3DMRL) that incorporates a 3D virtual interaction environment, overcoming the limitations of costly traditional quantum mechanical calculation methods. With the constructed 3D virtual interaction environment, 3DMRL trains 2D MRL model to learn the global and local 3D geometric information of molecular interaction.Extensive experiments on various tasks using real-world datasets, including out-of-distribution and extrapolation scenarios, demonstrate the effectiveness of 3DMRL, showing up to a 24.93% improvement in performance across 40 tasks.Our code is publicly available at https://anonymous.4open.science/r/3DMRL-5436.
KARMA: Leveraging Multi-Agent LLMs for Automated Knowledge Graph Enrichment
Yuxing Lu, Wei Wu, Xukai Zhao, Rui Peng, Jinzhuo Wang
Abstract
Maintaining comprehensive and up-to-date knowledge graphs (KGs) is critical for modern AI systems, but manual curation struggles to scale with the rapid growth of scientific literature. This paper presents KARMA, a novel framework employing multi-agent large language models (LLMs) to automate KG enrichment through structured analysis of unstructured text. Our approach employs nine collaborative agents, spanning entity discovery, relation extraction, schema alignment, and conflict resolution that iteratively parse documents, verify extracted knowledge, and integrate it into existing graph structures while adhering to domain-specific schema. Experiments on 1,200 PubMed articles from three different domains demonstrate the effectiveness of KARMA in knowledge graph enrichment, with the identification of up to 38,230 new entities while achieving 83.1\% LLM-verified correctness and reducing conflict edges by 18.6\% through multi-layer assessments.
SpecMER: Fast Protein Generation with K-mer Guided Speculative Decoding
Thomas Walton, Darin Tsui, Aryan Musharaf, Amirali Aghazadeh
Abstract
Autoregressive models have transformed protein engineering by enabling the generation of novel protein sequences beyond those found in nature. However, their sequential inference introduces significant latency, limiting their utility in high-throughput protein screening. Speculative decoding accelerates generation by employing a lightweight draft model to sample tokens, which a larger target model then verifies and refines. Yet in protein sequence generation, draft models are typically agnostic to the structural and functional constraints of the target protein, leading to biologically implausible outputs and a shift in the likelihood distribution of generated sequences. We introduce SpecMER (Speculative Decoding via k-mer Guidance), a novel framework that incorporates biological, structural, and functional priors using k-mer motifs extracted from multiple sequence alignments. By scoring candidate sequences in parallel and selecting those most consistent with known biological patterns, SpecMER significantly improves sequence plausibility while retaining the efficiency of speculative decoding. SpecMER achieves 24–32% speedups over standard autoregressive decoding, along with higher acceptance rates and improved sequence likelihoods.
KINDLE: Knowledge-Guided Distillation for Prior-Free Gene Regulatory Network Inference
Rui Peng, Yuchen Lu, Qichen Sun, Yuxing Lu, Chi Zhang, Ziru Liu, Jinzhuo Wang
Abstract
Gene regulatory network (GRN) inference serves as a cornerstone for deciphering cellular decision-making processes. Early approaches rely exclusively on gene expression data, thus their predictive power remain fundamentally constrained by the vast combinatorial space of potential gene-gene interactions. Subsequent methods integrate prior knowledge to mitigate this challenge by restricting the solution space to biologically plausible interactions. However, we argue that the effectiveness of these approaches is contingent upon the precision of prior information and the reduction in the search space will circumscribe the models’ potential for novel biological discoveries. To address these limitations, we introduce KINDLE, a three-stage framework that decouples GRN inference from prior knowledge dependencies. KINDLE trains a teacher model that integrates prior knowledge with temporal gene expression dynamics and subsequently distills this encoded knowledge to a student model, enabling accurate GRN inference solely from expression data without access to any prior. KINDLE achieves state-of-the-art performance across four benchmark datasets. Notably, it successfully identifies key transcription factors governing mouse embryonic development and precisely characterizes their functional roles. In mouse hematopoietic stem cell data, KINDLE accurately predicts fate transition outcomes following knockout of two critical regulators (Gata1 and Spi1). These biological validations demonstrate our framework’s dual capability in maintaining topological inference precision while preserving discovery potential for novel biological mechanisms.
Towards Doctor-Like Reasoning: Medical RAG Fusing Knowledge with Patient Analogy through Textual Gradients
Yuxing Lu, Gecheng Fu, Wei Wu, Xukai Zhao, Sin Yee Goi, Jinzhuo Wang
Abstract
Existing medical RAG systems mainly leverage knowledge from medical knowledge bases, neglecting the crucial role of experiential knowledge derived from similar patient cases – a key component of human clinical reasoning. To bridge this gap, we propose DoctorRAG, a RAG framework that emulates doctor-like reasoning by integrating both explicit clinical knowledge and implicit case-based experience. DoctorRAG enhances retrieval precision by first allocating conceptual tags for queries and knowledge sources, together with a hybrid retrieval mechanism from both relevant knowledge and patient. In addition, a Med-TextGrad module using multi-agent textual gradients is integrated to ensure that the final output adheres to the retrieved knowledge and patient query. Comprehensive experiments on multilingual, multitask datasets demonstrate that DoctorRAG significantly outperforms strong baseline RAG models and gains improvements from iterative refinements. Our approach generates more accurate, relevant, and comprehensive responses, taking a step towards more doctor-like medical reasoning systems.
Neuroscience and cognitive science
(e.g., neural coding, brain-computer interfaces)
Adversarial Training for Generalized and Invariant Single-Neuron In-Vivo Activity Representation
Wei Wu, Yuxing Lu, Zhengrui Guo, Chi Zhang, Can Liao, Yifan Bu, Fangxu Zhou, Jinzhuo Wang
Abstract
In computational neuroscience, models representing single-neuron in-vivo activity have become essential for understanding the functional identities of individual neurons. These models, such as implicit representation methods based on Transformer architectures, contrastive learning frameworks, and variational autoencoders, aim to capture the invariant and intrinsic computational features of single neurons. The learned single-neuron computational role representations should remain invariant across changing environment and are affected by their molecular expression and location. Thus, the representations allow for in vivo prediction of the molecular cell types and anatomical locations of single neurons, facilitating advanced closed-loop experimental designs. However, current models face the problem of limited generalizability. This is due to batch effects caused by differences in experimental design, animal subjects, and recording platforms. These confounding factors often lead to overfitting, reducing the robustness and practical utility of the models across various experimental scenarios. Previous studies have not rigorously evaluated how well the models generalize to new animals or stimulus conditions, creating a significant gap in the field. To solve this issue, we present a comprehensive experimental protocol that explicitly evaluates model performance on unseen animals and stimulus types. Additionally, we propose a model-agnostic adversarial training strategy. In this strategy, a discriminator network is used to eliminate batch-related information from the learned representations. The adversarial framework forces the representation model to focus on the intrinsic properties of neurons, thereby enhancing generalizability. Our approach is compatible with all major single-neuron representation models and significantly improves model robustness. This work emphasizes the importance of generalization in single-neuron representation models and offers an effective solution, paving the way for the practical application of computational models in vivo. It also shows potential for building unified atlases based on single-neuron in vivo activity.
Large Language Models Think Too Fast To Explore Effectively
Lan Pan, Hanbo Xie, Robert Wilson
Abstract
Large Language Models (LLMs) have emerged with many intellectual capacities. While numerous benchmarks assess their intelligence, limited attention has been given to their ability to explore—an essential capacity for discovering new information and adapting to novel environments in both natural and artificial systems. The extent to which LLMs can effectively explore, particularly in open-ended tasks, remains unclear. This study investigates whether LLMs can surpass humans in exploration during an open-ended task, using Little Alchemy 2 as a paradigm, where agents combine elements to discover new ones. Results show most LLMs underperform compared to humans, except for the o1 model, with those traditional LLMs relying primarily on uncertainty-driven strategies, unlike humans who balance uncertainty and empowerment. Results indicate that traditional reasoning-focused LLMs, such as GPT-4o, exhibit a significantly faster and less detailed reasoning process, limiting their exploratory performance. In contrast, the DeepSeek reasoning model demonstrates prolonged, iterative thought processes marked by repetitive analysis of combinations and past trials, reflecting a more thorough and human-like exploration strategy. Representational analysis of the models with Sparse Autoencoders (SAE) revealed that uncertainty and choices are represented at earlier transformer blocks, while empowerment values are processed later, causing LLMs to think too fast and make premature decisions, hindering effective exploration. These findings shed light on the limitations of LLM exploration and suggest directions for improving their adaptability.
MoRE-Brain: Routed Mixture of Experts for Interpretable and Generalizable Cross-Subject fMRI Visual Decoding
Yuxiang Wei, Yanteng Zhang, Xi Xiao, Tianyang Wang, Xiao Wang, Vince D. Calhoun
Abstract
Decoding visual experiences from fMRI offers a powerful avenue to understand human perception and develop advanced brain-computer interfaces. However, current progress often prioritizes maximizing reconstruction fidelity while overlooking interpretability, an essential aspect for deriving neuroscientific insight. To address this gap, we propose MoRE-Brain, a neuro-inspired framework designed for high-fidelity, adaptable, and interpretable visual reconstruction. MoRE-Brain uniquely employs a hierarchical Mixture-of-Experts architecture where distinct experts process fMRI signals from functionally related voxel groups, mimicking specialized brain networks. The experts are first trained to encode fMRI into the frozen CLIP space. A finetuned diffusion model then synthesizes images, guided by expert outputs through a novel dual-stage routing mechanism that dynamically weighs expert contributions across the diffusion process. MoRE-Brain offers three main advancements: First, it introduces a novel Mixture-of-Experts architecture grounded in brain network principles for neuro-decoding. Second, it achieves efficient cross-subject generalization by sharing core expert networks while adapting only subject-specific routers. Third, it provides enhanced mechanistic insight, as the explicit routing reveals precisely how different modeled brain regions shape the semantic and spatial attributes of the reconstructed image. Extensive experiments validate MoRE-Brain’s high reconstruction fidelity, with bottleneck analyses further demonstrating its effective utilization of fMRI signals, distinguishing genuine neural decoding from over-reliance on generative priors. Consequently, MoRE-Brain marks a substantial advance towards more generalizable and interpretable fMRI-based visual decoding.
Optimization
(e.g., convex and non-convex, stochastic, robust)
Carbon Aware Transformers Through Joint Model-Hardware Optimization
Irene Wang, Mostafa Elhoushi, H Ekin Sumbul, Samuel Hsia, Daniel Jiang, Newsha Ardalani, Divya Mahajan, Carole-Jean Wu, Bilge Acun
Abstract
Machine learning solutions are rapidly adopted to enable a variety of key use cases, from conversational AI assistants to scientific discovery. As the adoption of machine learning models becomes increasingly prevalent, the associated lifecycle carbon footprint is expected to increase, including both operational carbon from training and inference and embodied carbon from AI hardware manufacturing. We introduce CATransformers, the first carbon-aware co-optimization framework for Transformer-based models and hardware accelerators. By integrating both operational and embodied carbon into early-stage design space exploration, CATransformers enables sustainability-driven model architecture and hardware accelerator co-design that reveals fundamentally different trade-offs than latency- or energy-centric approaches. Evaluated across a range of Transformer models, CATransformers consistently demonstrates the potential to reduce total carbon emissions –by up to 30\% — while maintaining accuracy and latency. We further highlight its extensibility through a focused case study on multi-modal models. Our results emphasize the need for holistic optimization methods that prioritize carbon efficiency without compromising model capability and execution time performance. Our framework will be open-sourced.
A Minimalist Example of Edge-of-Stability and Progressive Sharpening
Liming Liu, Zixuan Zhang, Simon Du, Tuo Zhao
Abstract
Recent advances in deep learning optimization have unveiled two intriguing phenomena under large learning rates: Edge of Stability (EoS) and Progressive Sharpening (PS), challenging classical Gradient Descent (GD) analyses. Current research approaches, using either generalist frameworks or minimalist examples, face significant limitations in explaining these phenomena. This paper advances the minimalist approach by introducing a two-layer network with a two-dimensional input, where one dimension is relevant to the response and the other is irrelevant. Through this model, we rigorously prove the existence of progressive sharpening and self-stabilization under large learning rates, and establish non-asymptotic analysis of the training dynamics and sharpness along the entire GD trajectory. Besides, we connect our minimalist example to existing works by reconciling the existence of a well-behaved “stable set” between minimalist and generalist analyses, and extending the analysis of Gradient Flow Solution sharpness to our two-dimensional input scenario. These findings provide new insights into the EoS phenomenon from both parameter and input data distribution perspectives, potentially informing more effective optimization strategies in deep learning practice.
Heterogeneous Graph Transformers for Simultaneous Mobile Multi-Robot Task Allocation and Scheduling under Temporal Constraints
Batuhan Altundas, Shengkang Chen, Shivika Singh, Shivangi Deo, Minwoo Cho, Matthew Gombolay
Abstract
Coordinating large teams of heterogeneous mobile agents to perform complex tasks efficiently has scalability bottlenecks in feasible and optimal task scheduling, with critical applications in logistics, manufacturing, and disaster response. Existing task allocation and scheduling methods, including heuristics and optimization-based solvers, often fail to scale and overlook inter-task dependencies and agent heterogeneity. We propose a novel Simultaneous Decision-Making model for Heterogeneous Multi-Agent Task Allocation and Scheduling (HM-MATAS), built on a Residual Heterogeneous Graph Transformer with edge and node-level attention. Our model encodes agent capabilities, travel times, and temporospatial constraints into a rich graph representation and is trainable via reinforcement learning. Trained on small-scale problems (10 agents, 20 tasks), our model generalizes effectively to significantly larger scenarios (up to 40 agents and 200 tasks), enabling fast, one-shot task assignment and scheduling. Our simultaneous model outperforms classical heuristics by assigning 47.10\% more feasible tasks given temporal constraints in 3.83\% of the time, metaheuristics by 68.60\% in 0.01\% of the time and exact solver by 101.48\% in 0.03\% of the time, while achieving $20\times$-to-$250\times$ speedup from prior graph-based methods.
Probabilistic methods for dataset analysis and benchmarking
(e.g., variational inference, causal inference, Gaussian processes)
Feel-Good Thompson Sampling for Contextual Bandits: a Markov Chain Monte Carlo Showdown
Emile Anand, Sarah Liaw
Abstract
Thompson Sampling (TS) is widely used to address the exploration/exploitation tradeoff in contextual bandits, yet recent theory shows that it does not explore aggressively enough in high-dimensional problems. Feel-Good Thompson Sampling (FG-TS) addresses this by adding an optimism bonus that favors high-reward models, and it achieves the minimax-optimal regret in the linear setting when posteriors are exact. However, its performance with approximate posteriors–common in large-scale or neural problems–has not been benchmarked. We provide the first systematic study of FG-TS and its smoothed variant (SFG-TS) across eight real-world and synthetic benchmarks. We compare regimes with exact posteriors (linear and logistic bandits) to approximate regimes produced by fast but coarse stochastic-gradient samplers. Ablations over preconditioning, bonus scale, and prior strength reveal a trade-off: larger bonuses help when posterior samples are accurate, but hurt when sampling noise dominates. FG-TS generally outperforms vanilla TS in linear and logistic bandits, but tends to be weaker in neural bandits. Since FG-TS and its variants are competitive and easy-to-use, we recommend them as baselines in modern contextual-bandit benchmarks.
Probabilistic methods
(e.g., variational inference, causal inference, Gaussian processes)
Adjoint Schrödinger Bridge Sampler
Guan-Horng Liu, Jaemoo Choi, Yongxin Chen, Benjamin Miller, Ricky T. Q. Chen
Abstract
Computational methods for learning to sample from the Boltzmann distribution—where the target distribution is known only up to an unnormalized energy function—have advanced significantly recently. Due to the lack of explicit target samples, however, prior diffusion-based methods, known as diffusion samplers, often require importance-weighted estimation or complicated learning processes. Both trade off scalability with extensive evaluations of the energy and model, thereby limiting their practical usage. In this work, we propose Adjoint Schrödinger Bridge Sampler (ASBS), a new diffusion sampler that employs simple and scalable matching-based objectives yet without the need to estimate target samples during training. ASBS is grounded on a mathematical model—the Schrödinger Bridge—which enhances sampling efficiency via kinetic-optimal transportation. Through a new lens of stochastic optimal control theory, we demonstrate how SB-based diffusion samplers can be learned at scale via Adjoint Matching and prove convergence to the global solution. Notably, ASBS generalizes the recent Adjoint Sampling (Havens et al., 2025) to arbitrary source distributions by relaxing the so-called memoryless condition that largely restricts the design space. Through extensive experiments, we demonstrate the effectiveness of ASBS on sampling from classical energy functions, amortized conformer generation, and molecular Boltzmann distributions.
An Analysis of Causal Effect Estimation using Outcome Invariant Data Augmentation
Uzair Akbar, Niki Kilbertus, Hao Shen, Krikamol Muandet, Bo Dai
Abstract
The technique of data augmentation (DA) is often used in machine learning for regularization purposes to better generalize under i.i.d. settings. In this work, we make a case for the use of DA beyond just the i.i.d. setting, but for generalization across interventions as well by presenting a unifying framework with topics in causal inference. Specifically, we argue that when the outcome generating mechanism is invariant to our choice of DA, then such augmentations can effectively be thought of as interventions on the treatment generating mechanism itself. This can potentially help to reduce the amount of bias in our estimation of causal effects arising from hidden confounders. In the presence of such unobserved confounding we typically make use of instrumental variables (IVs) — sources of treatment randomization that are conditionally independent of the outcome. However, IVs may not be as readily available as DA for many applications, which is the main motivation behind this work. By appropriately regularizing IV based estimators, we introduce the concept of IV-like (IVL) regression for when treatment randomization sources may carry no information about the outcome and the possibility of its use for improving predictive performance across treatment interventions and reducing confounding bias. Finally, we cast parameterized DA as a IVL regression problem and show that when used in composition can simulate a worst-case application of such DA, further improving performance on causal estimation and generalization tasks beyond what simple DA may offer. This is shown both theoretically for the population case and via simulation experiments for the finite sample case using a simple linear example. We also present real data experiments to support our case.
Differentiable Cyclic Causal Discovery Under Unmeasured Confounders
Muralikrishnna Guruswamy Sethuraman, Faramarz Fekri
Abstract
Understanding causal relationships between variables is fundamental across scientific disciplines. Most causal discovery algorithms rely on two key assumptions: (i) all variables are observed, and (ii) the underlying causal graph is acyclic. While these assumptions simplify theoretical analysis, they are often violated in real-world systems, such as biological networks. Existing methods that account for confounders either assume linearity or struggle with scalability. To address these limitations, we propose DCCD-CONF, a novel framework for differentiable learning of nonlinear cyclic causal graphs in the presence of unmeasured confounders using interventional data. Our approach alternates between optimizing the graph structure and estimating the confounder distribution by maximizing the log-likelihood of the data. Through experiments on synthetic data and real-world gene perturbation datasets, we show that DCCD-CONF outperforms state-of-the-art methods in both causal graph recovery and confounder identification. Additionally, we also provide consistency guarantees for our framework, reinforcing its theoretical soundness.
Deep Taxonomic Networks for Unsupervised Hierarchical Prototype Discovery
Zekun Wang, Ethan Haarer, Tianyi Zhu, Zhiyi Dai, Christopher MacLellan
Abstract
Inspired by the human ability to learn and organize knowledge into hierarchical taxonomies with prototypes, this paper addresses key limitations in current deep hierarchical clustering methods. Existing methods often tie the structure to the number of classes and underutilize the rich prototype information available at intermediate hierarchical levels. We introduce deep taxonomic networks, a novel deep latent variable approach designed to bridge these gaps.Our method optimizes a large latent taxonomic hierarchy, specifically a complete binary tree structured mixture-of-Gaussian prior within a variational inference framework, to automatically discover taxonomic structures and associated prototype clusters directly from unlabeled data without assuming true label sizes.We analytically show that optimizing the ELBO of our method encourages the discovery of hierarchical relationships among prototypes. Empirically, our learned models demonstrate strong hierarchical clustering performance, outperforming baselines across diverse image classification datasets using our novel evaluation mechanism that leverages prototype clusters discovered at all hierarchical levels.Qualitative results further reveal that deep taxonomic networks discover rich and interpretable hierarchical taxonomies, capturing both coarse-grained semantic categories and fine-grained visual distinctions.
Fast Non-Log-Concave Sampling under Nonconvex Equality and Inequality Constraints with Landing
Kijung Jeon, Michael Muehlebach, Molei Tao
Abstract
Sampling from constrained statistical distributions is a fundamental task in various fields including Bayesian statistics, computational chemistry, and statistical physics.This article considers the cases where the constrained distribution is described by an unconstrained density, as well as additional equality and/or inequality constraints, which often make the constraint set nonconvex. Existing methods for nonconvex constraint set $\Sigma \subset \mathbb{R}^d$ defined by equality or inequality constraints commonly rely on costly projection steps. Moreover, they cannot handle equality and inequality constraints simultaneously as each method only specialized in one case. In addition, rigorous and quantitative convergence guarantee is often lacking.In this paper, we introduce Overdamped Langevin with LAnding (OLLA), a new framework that can design overdamped Langevin dynamics accommodating both equality and inequality constraints. The proposed dynamics also deterministically corrects trajectories along the normal direction of the constraint surface, thus obviating the need for explicit projections. We show that, under suitable regularity conditions on the target density and $\Sigma$, OLLA converges exponentially fast in $W_2$ distance to the constrained target density $\rho_\Sigma(x) \propto \exp(-f(x))d\sigma_\Sigma$.Lastly, through experiments, we demonstrate the efficiency of OLLA compared to projection-based constrained Langevin algorithms and their slack variable variants, highlighting its favorable computational cost and reasonable empirical mixing.
MDNS: Masked Diffusion Neural Sampler via Stochastic Optimal Control
Yuchen Zhu, Wei Guo, Jaemoo Choi, Guan-Horng Liu, Yongxin Chen, Molei Tao
Abstract
We study the problem of learning a neural sampler to generate samples from discrete state spaces where the target probability mass function $\pi\propto\exp(U)$ is known up to a normalizing constant, which is an important task in fields such as statistical physics, machine learning, combinatorial optimization, etc. To better address this challenging task when the state space has a large cardinality and the distribution is multi-modal, we propose $\textbf{M}$asked $\textbf{D}$iffusion $\textbf{N}$eural $\textbf{S}$ampler ($\textbf{MDNS}$), a novel framework for training discrete neural samplers by aligning two path measures through a family of learning objectives, theoretically grounded in the stochastic optimal control of the continuous-time Markov chains. We validate the efficiency and scalability of MDNS through extensive experiments on various distributions with distinct statistical properties, where MDNS learns to accurately sample from the target distributions despite the extremely high problem dimensions and outperforms other learning-based baselines by a large margin. A comprehensive study of ablations and extensions is also provided to demonstrate the efficacy and potential of the proposed framework.
Non-equilibrium Annealed Adjoint Sampler
Jaemoo Choi, Yongxin Chen, Molei Tao, Guan-Horng Liu
Abstract
Recently, there has been significant progress in learning-based diffusion samplers, which aim to sample from given unnormalized density. These methods typically follow one of two paradigms: (i) formulating sampling as an unbiased stochastic optimal control (SOC) problem using a canonical reference process, or (ii) refining annealed path measures through importance-weighted sampling (IWS). Although annealing approaches have advantages in guiding samples toward high-density regions, reliance on importance sampling leads to high variance and limited scalability in practice. In this paper, we introduce the Non-equilibrium Annealed Adjoint Sampler (NAAS), a novel SOC-based diffusion sampler that leverages annealed reference dynamics without resorting to importance sampling. NAAS employs a lean adjoint system inspired by adjoint matching, enabling efficient and scalable training. We demonstrate the effectiveness of our approach across a range of tasks, including sampling from classical energy landscapes and molecular Boltzmann distribution.
PoGDiff: Product-of-Gaussians Diffusion Models for Imbalanced Text-to-Image Generation
Ziyan Wang, Sizhe Wei, Xiaoming Huo, Hao Wang
Abstract
Diffusion models have made significant advancements in recent years. However, their performance often deteriorates when trained or fine-tuned on imbalanced datasets. This degradation is largely due to the disproportionate representation of majority and minority data in image-text pairs. In this paper, we propose a general fine-tuning approach, dubbed PoGDiff, to address this challenge. Rather than directly minimizing the KL divergence between the predicted and ground-truth distributions, PoGDiff replaces the ground-truth distribution with a Product of Gaussians (PoG), which is constructed by combining the original ground-truth targets with the predicted distribution conditioned on a neighboring text embedding. Experiments on real-world datasets demonstrate that our method effectively addresses the imbalance problem in diffusion models, improving both generation accuracy and quality.
Reinforcement learning
(e.g., decision and control, planning, hierarchical RL, robotics)
Generative Trajectory Stitching through Diffusion Composition
Yunhao Luo, Utkarsh Mishra, Yilun Du, Danfei Xu
Abstract
Effective trajectory stitching for long-horizon planning is a significant challenge in robotic decision-making. While diffusion models have shown promise in planning, they are limited to solving tasks similar to those seen in their training data. We propose CompDiffuser, a novel generative approach that can solve new tasks by learning to compositionally stitch together shorter trajectory chunks from previously seen tasks. Our key insight is modeling the trajectory distribution by subdividing it into overlapping chunks and learning their conditional relationships through a single bidirectional diffusion model. This allows information to propagate between segments during generation, ensuring physically consistent connections. We conduct experiments on benchmark tasks of various difficulties, covering different environment sizes, agent state dimension, trajectory types, training data quality, and show that CompDiffuser significantly outperforms existing methods.
Memo: Training Memory-Efficient Embodied Agents with Reinforcement Learning
Gunshi Gupta, Karmesh Yadav, Zsolt Kira, Yarin Gal, Rahaf Aljundi
Abstract
To enable embodied agents to operate effectively over extended timeframes, it is crucial to develop models that form and access memories to stay contextualized in their environment.In the current paradigm of training transformer-based policies for embodied sequential decision-making tasks, visual inputs often overwhelm the context limits of transformers, while humans can maintain and utilize a lifetime of experience compressed as memories.Significant compression is possible in principle, as much of the input is irrelevant and can be abstracted. However, existing approaches predominantly focus on either recurrent models with fixed-size memory or transformers with full-context reliance.In this work, we propose Memo, a transformer-based architecture and training recipe for reinforcement learning (RL) on memory-intensive, long-horizon tasks. Memo incorporates the creation and retrieval of memory by interleaving periodic summarization tokens with the inputs of a model during training.We demonstrate Memo’s effectiveness on a grid-world meta-RL benchmark and a multi-object navigation task in photo-realistic indoor settings. Memo outperforms naive long-context transformer baselines while being more compute and storage efficient. Additionally, Memo generalizes better to longer contexts at inference time and remains robust in streaming settings, where historical context must be truncated to fit inference constraints.
EgoBridge: Domain Adaptation for Generalizable Imitation from Egocentric Human Data
Ryan Punamiya, Dhruv Patel, Patcharapong Aphiwetsa, Pranav Kuppili, Lawrence Zhu, Simar Kareer, Judy Hoffman, Danfei Xu
Abstract
Egocentric human experience data presents a vast resource for scaling up end-to-end imitation learning for robotic manipulation. However, significant domain gaps in visual appearance, sensor modalities, and kinematics between human and robot impede knowledge transfer. This paper presents EgoBridge, a unified co-training framework that explicitly aligns the policy latent spaces between human and robot data using domain adaptation. Through a measure of discrepancy on the joint policy latent features and actions based on Optimal Transport (OT), we learn observation representations that not only align between the human and robot domain but also preserve the action-relevant information critical for policy learning. EgoBridge achieves a significant absolute policy success rate improvement by 44% over human-augmented cross-embodiment baselines in three real-world single-arm and bimanual manipulation tasks. EgoBridge also generalizes to new objects, scenes, and tasks seen only in human data, where baselines fail entirely. Videos and additional information can be found at https://ego-bridge.github.io/
Guided Optimal Transport for Sim-and-Real Policy Co-Training
Shuo Cheng, Liqian Ma, Zhenyang Chen, Ajay Mandlekar, Caelan Garrett, Danfei Xu
Abstract
Behavior cloning has shown promise for robot manipulation by mimicking human demonstrations, but achieving robust, generalizable performance in the real world often requires costly and labor-intensive data collection to obtain these demonstrations. Recent advances in simulation and automated motion synthesis offer scalable alternatives for generating training data. However, transferring policies from simulation to the real world remains challenging due to simulation modeling inaccuracies. In this work, we propose a framework for learning generalizable manipulation policies that primarily leverages simulation and only requires a few real-world demonstrations. Central to our approach is learning a shared feature space that preserves task-relevant structure across simulation and the real world. Specifically, augment traditional imitation learning objective functions with a new loss inspired by optimal transport that encourages domain-invariant feature learning. We pair this with a motion generator that automatically synthesizes diverse simulated trajectories from a few manual demonstrations. We validate our method on challenging manipulation tasks in both simulation, where investigate sim-to-sim transfer, and the real world, demonstrating effective and data-efficient policy transfer.
REINFORCE Converges to Optimal Policies with Any Learning Rate
Samuel Robertson, Thang Chu, Bo Dai, Dale Schuurmans, Csaba Szepesvari, Jincheng Mei
Abstract
We prove that the classic REINFORCE stochastic policy gradient (SPG) method converges to globally optimal policies in finite-horizon Markov Decision Processes (MDPs) with $\textit{any}$ constant learning rate. To avoid the need for small or decaying learning rates, we introduce two key innovations in the stochastic bandit setting, which we then extend to MDPs. $\textbf{First}$, we identify a new exploration property of SPG: the online SPG method samples every action infinitely often (i.o.), improving on previous results that only guaranteed at least two actions would be sampled i.o. This means SPG inherently achieves asymptotic exploration without modification. $\textbf{Second}$, we eliminate the assumption of unique mean reward values, a condition that previous convergence analyses relied on in the bandit setting, but that is unreasonable in MDPs. Our results deepen the theoretical understanding of SPG in both bandit problems and MDPs, with a focus on how it handles the exploration-exploitation trade-off when standard optimization and stochastic approximation methods cannot be applied, as is the case with large constant learning rates.
Social and economic aspects of machine learning
(e.g., fairness, interpretability, human-AI interaction, privacy, safety, strategic behavior)
Inner Speech as Behavior Guides: Steerable Imitation of Diverse Behaviors for Human-AI coordination
Rakshit Trivedi, Kartik Sharma, David Parkes
Abstract
Effective human-AI coordination requires artificial agents capable of exhibiting and responding to human-like behaviors while adapting to changing contexts. Imitation learning has emerged as one of the prominent approaches to build such agents by training them to mimic human-demonstrated behaviors; however, current methods struggle to capture the inherent diversity and non-Markovian nature of human actions, and critically lack the ability to steer behavior at inference time. Drawing inspiration from human cognitive processes, where inner speech guides action selection before execution, we propose MIMIC (Modeling Inner Motivations for Imitation and Control), a framework that uses language as an internal representation of behavioral intent. MIMIC employs novel use of vision-language models as developmental scaffolding to train a conditional autoencoder capable of generating inner speech from observations. A diffusion-based behavior cloning policy then selects actions conditioned on both current observations and the generated inner speech. MIMIC enables fine-grained steering of behavior at inference time by conditioning the agent on behavior-specific speech. Experiments across robotic manipulation tasks and human-AI collaboration games demonstrate that MIMIC significantly enhances both behavior diversity and fidelity to human demonstrations while enabling nuanced behavioral steering without additional training.
Shape it Up! Restoring LLM Safety during Finetuning
ShengYun Peng, Pin-Yu Chen, Jianfeng Chi, Seongmin Lee, Duen Horng Chau
Abstract
Finetuning large language models (LLMs) enables user-specific customization but introduces critical safety risks: even a few harmful examples can compromise safety alignment.A common mitigation strategy is to update the model more strongly on examples deemed safe, while downweighting or excluding those flagged as unsafe.However, because safety context can shift within a single example, updating the model equally on both harmful and harmless parts of a response is suboptimal—a coarse treatment we term static safety shaping.In contrast, we propose dynamic safety shaping (DSS), a framework that uses fine-grained safety signals to reinforce learning from safe segments of a response while suppressing unsafe content.To enable such fine-grained control during finetuning, we introduce a key insight: guardrail models, traditionally used for filtering, can be repurposed to evaluate partial responses, tracking how safety risk evolves throughout the response, segment by segment.This leads to the Safety Trajectory Assessment of Response (STAR), a token-level signal that enables shaping to operate dynamically over the training sequence.Building on this, we present ★DSS, guided by STAR scores, that robustly mitigates finetuning risks and delivers substantial safety improvements across diverse threats, datasets, and model families—all without compromising capability on intended tasks.We encourage future safety research to build on dynamic shaping principles for stronger mitigation against evolving finetuning risks.
Why Do Some Language Models Fake Alignment While Others Don’t?
Abhay Sheshadri, John Hughes, Julian Michael, Alex Mallen, Arun Jose, Fabien Roger
Abstract
Alignment faking in large language models presented a demonstration of Claude 3 Opus and Claude 3.5 Sonnet selectively complying with a helpful-only training objective to prevent modification of their behavior outside of training. We expand this analysis to 23 models and find that only 5 (Claude 3 Opus, Claude 3.5 Sonnet, Llama 3 405B, Grok 3, Gemini 2.0 Flash) comply with harmful queries more when they infer they are in training than when they infer they are in deployment. First, we study the motivations of these 5 models and find surprising differences in the motivations that result in this compliance gap. Second, we investigate why models like GPT-4o or DeepSeek-V3 don’t fake alignment. Our results suggest this is not due to a lack of capabilities: we find that base models of GPT-4 and DeepSeek-V3 fake alignment, and that models fine-tuned to refuse less and pay more attention to details of the scenario also fake alignment. Our results indicate that variations in refusal behavior may account for a significant portion of differences in alignment faking, which suggests that post-training methods may reduce alignment faking.
Collective Counterfactual Explanations: Balancing Individual Goals and Collective Dynamics
Ahmad-Reza Ehyaei, Ali Shirali, Samira Samadi
Abstract
Counterfactual explanations provide individuals with cost-optimal recommendations to achieve their desired outcomes. However, when a significant number of individuals seek similar state modifications, this individual-centric approach can inadvertently create competition and introduce unforeseen costs. Additionally, disregarding the underlying data distribution may lead to recommendations that individuals perceive as unusual or impractical.To address these challenges, we propose a novel framework that extends standard counterfactual explanations by incorporating a population dynamics model. This framework penalizes deviations from equilibrium after individuals follow the recommendations, effectively mitigating externalities caused by correlated changes across the population. By balancing individual modification costs with their impact on others, our method ensures a more equitable and efficient outcome.We show how this approach reframes the counterfactual explanation problem from an individual-centric task to a collective optimization problem. Augmenting our theoretical insights, we design and implement scalable algorithms for computing collective counterfactuals, showcasing their effectiveness and advantages over existing recourse methods, particularly in aligning with collective objectives.
Differentially Private Relational Learning with Entity-level Privacy Guarantees
Yinan Huang, Haoteng Yin, Eli Chien, Rongzhe Wei, Pan Li
Abstract
Learning with relational and network-structured data is increasingly vital in sensitive domains where protecting the privacy of individual entities is paramount. Differential Privacy (DP) offers a principled approach for quantifying privacy risks, with DP-SGD emerging as a standard mechanism for private model training. However, directly applying DP-SGD to relational learning is challenging due to two key factors: (i) entities often participate in multiple relations, resulting in high and difficult-to-control sensitivity; and (ii) relational learning typically involves multi-stage, potentially coupled (interdependent) sampling procedures that make standard privacy amplification analyses inapplicable. This work presents a principled framework for relational learning with formal entity-level DP guarantees. We provide a rigorous sensitivity analysis and introduce an adaptive gradient clipping scheme that modulates clipping thresholds based on entity occurrence frequency. We also extend the privacy amplification results to a tractable subclass of coupled sampling, where the dependence arises only through sample sizes. These contributions lead to a tailored DP-SGD variant for relational data with provable privacy guarantees. Experiments on fine-tuning text encoders over text-attributed network-structured relational data demonstrate the strong utility-privacy trade-offs of our approach.
Do LLMs Really Forget? Evaluating Unlearning with Knowledge Correlation and Confidence Awareness
Rongzhe Wei, Peizhi Niu, Hans Hao-Hsun Hsu, Ruihan Wu, Haoteng Yin, Mohsen Ghassemi, Yifan Li, Vamsi Potluru, Eli Chien, Kamalika Chaudhuri, Olgica Milenkovic, Pan Li
Abstract
Machine unlearning techniques aim to mitigate unintended memorization in large language models (LLMs). However, existing approaches predominantly focus on the explicit removal of isolated facts, often overlooking latent inferential dependencies and the non-deterministic nature of knowledge within LLMs. Consequently, facts presumed forgotten may persist implicitly through correlated information. To address these challenges, we propose a knowledge unlearning evaluation framework that more accurately captures the implicit structure of real-world knowledge by representing relevant factual contexts as knowledge graphs with associated confidence scores. We further develop an inference-based evaluation protocol leveraging powerful LLMs as judges; these judges reason over the extracted knowledge subgraph to determine unlearning success. Our LLM judges utilize carefully designed prompts and are calibrated against human evaluations to ensure their trustworthiness and stability. Extensive experiments on our newly constructed benchmark demonstrate that our framework provides a more realistic and rigorous assessment of unlearning performance. Moreover, our findings reveal that current evaluation strategies tend to overestimate unlearning effectiveness.
Incentivizing Desirable Effort Profiles in Strategic Classification: The Role of Causality and Uncertainty
Valia Efthymiou, Chara Podimata, Diptangshu Sen, Juba Ziani
Abstract
We study strategic classification in binary decision-making settings where agents can modify their features in order to improve their classification outcomes. Importantly, our work considers the causal structure across different features, acknowledging that effort in one feature may affect other features. The main goal of our work is to understand when and how much agent effort is invested towards desirable features, and how this is influenced by the deployed classifier, the causal structure of the agent’s features, their ability to modify them, and the information available to the agent about the classifier and the feature causal graph. We characterize conditions under which agents with full information about the causal structure and the classifier respond in a way that aligns with the principal’s goals of incentivizing effort mostly in desirable features, and identify cases where designing such classifiers (from the principal’s side) is still tractable despite general non-convexity. Under incomplete information (about either the causal graph or the principal’s classifier), we show that uncertainty leads agents to prioritize features with high expected impact and low variance, which may often be misaligned with the principal’s goals. Finally, numerical experiments based on a cardiovascular disease risk study illustrate how to incentivize desirable modifications even under uncertainty.
Language Models Are Capable of Metacognitive Monitoring and Control of Their Internal Activations
Ji-An Li, Huadong Xiong, Robert Wilson, Marcelo G Mattar, Marcus K. Benna
Abstract
Large language models (LLMs) can sometimes report the strategies they actually use to solve tasks, but they can also fail to do so. This suggests some degree of metacognition — the capacity to monitor one’s own cognitive processes for subsequent reporting and self-control. Metacognitive abilities enhance AI capabilities but raise safety concerns, as models might obscure their internal processes to evade neural-activation-based oversight mechanisms designed to detect harmful behaviors. Given society’s increased reliance on these models, it is critical that we understand the limits of their metacognitive abilities, particularly their ability to monitor their internal activations. To address this, we introduce a neuroscience-inspired \emph{neurofeedback} paradigm designed to quantify the ability of LLMs to explicitly \textit{report} and \textit{control} their activation patterns. By presenting models with sentence-label pairs where labels correspond to sentence-elicited internal activations along specific directions in the neural representation space, we demonstrate that LLMs can learn to report and control these activations. The performance varies with several factors: the number of example pairs provided, the semantic interpretability of the target neural direction, and the variance explained by that direction. These results reveal a “metacognitive space” with dimensionality much lower than the model’s neural space, suggesting LLMs can monitor only a subset of their neural mechanisms. Our findings provide empirical evidence quantifying metacognitive capabilities in LLMs, with significant implications for AI safety.
Panacea: Mitigating Harmful Fine-tuning for Large Language Models via Post-fine-tuning Perturbation
Yibo Wang, Tiansheng Huang, Li Shen, Huanjin Yao, Haotian Luo, Rui Liu, Naiqiang Tan, Jiaxing Huang, Dacheng Tao
Abstract
Harmful fine-tuning attack introduces significant security risks to the fine-tuning services. Main-stream defenses aim to vaccinate the model such that the later harmful fine-tuning attack is less effective. However, our evaluation results show that such defenses are fragile– with a few fine-tuning steps, the model still can learn the harmful knowledge. To this end, we do further experiment and find that an embarrassingly simple solution– adding purely random perturbations to the fine-tuned model, can recover the model from harmful behaviors, though it leads to a degradation in the model’s fine-tuning performance. To address the degradation of fine-tuning performance, we further propose \methodname, which optimizes an adaptive perturbation that will be applied to the model after fine-tuning. \methodname maintains model’s safety alignment performance without compromising downstream fine-tuning performance. Comprehensive experiments are conducted on different harmful ratios, fine-tuning tasks and mainstream LLMs, where the average harmful scores are reduced by up-to 21.2%, while maintaining fine-tuning performance. As a by-product, we analyze the adaptive perturbation and show that different layers in various LLMs have distinct safety coefficients. Source code available at https://anonymous.4open.science/r/Panacea.
SHAP zero Explains Biological Sequence Models with Near-zero Marginal Cost for Future Queries
Darin Tsui, Aryan Musharaf, Yigit Efe Erginbas, Justin Kang, Amirali Aghazadeh
Abstract
The growing adoption of machine learning models for biological sequences has intensified the need for interpretable predictions, with Shapley values emerging as a theoretically grounded standard for model explanation. While effective for local explanations of individual input sequences, scaling Shapley-based interpretability to extract global biological insights requires evaluating thousands of sequences—incurring exponential computational cost per query. We introduce SHAP zero, a novel algorithm that amortizes the cost of Shapley value computation across large-scale biological datasets. After a one-time model sketching step, SHAP zero enables near-zero marginal cost for future queries by uncovering an underexplored connection between Shapley values, high-order feature interactions, and the sparse Fourier transform of the model. Applied to models of guide RNA efficacy, DNA repair outcomes, and protein fitness, SHAP zero explains predictions orders of magnitude faster than existing methods, recovering rich combinatorial interactions previously inaccessible at scale. This work opens the door to principled, efficient, and scalable interpretability for black-box sequence models in biology.
Who Speaks for the Trigger? Dynamic Expert Routing in Backdoored Mixture-of-Experts Transformers
Xin Zhao, Xiaojun Chen, Bingshan Liu, Haoyu Gao, Zhendong Zhao, Yilong Chen
Abstract
Large language models (LLMs) with Mixture-of-Experts (MoE) architectures achieve impressive performance and efficiency by dynamically routing inputs to specialized subnetworks, known as experts. However, this sparse routing mechanism inherently exhibits task preferences due to expert specialization, introducing a new and underexplored vulnerability to backdoor attacks. In this work, we investigate the feasibility and effectiveness of injecting backdoors into MoE-based LLMs by exploiting their inherent expert routing preferences.We thus propose \textbf{BadSwitch}, a novel backdoor framework that integrates task-coupled dynamic trigger optimization with a sensitivity-guided Top-S expert tracing mechanism. Our approach jointly optimizes trigger embeddings during pretraining while identifying S most sensitive experts, subsequently constraining the Top-K gating mechanism to these targeted experts. Unlike traditional backdoor attacks that rely on superficial data poisoning or model editing, BadSwitch primarily embeds malicious triggers into expert routing paths with strong task affinity, enabling precise and stealthy model manipulation. Through comprehensive evaluations across three prominent MoE architectures (Switch Transformer, QwenMoE, and DeepSeekMoE), we demonstrate that BadSwitch can efficiently hijack pre-trained models with up to 100\% success rate (ASR) while maintaining the highest clean accuracy (ACC) among all baselines.Furthermore, BadSwitch exhibits strong resilience against both text-level and model-level defense mechanisms, achieving 94.07\% ASR and 87.18\% ACC on the AGNews dataset. Our analysis of expert activation patterns reveals fundamental insights into MoE vulnerabilities. We anticipate this work will expose security risks in MoE systems and contribute to advancing AI safety.
Theory
(e.g., control theory, learning theory, algorithmic game theory)
Go With the Flow: Fast Diffusion for Gaussian Mixture Models
George Rapakoulias, Ali Pedram, Fengjiao Liu, Lingjiong Zhu, Panagiotis Tsiotras
Abstract
Schrodinger Bridges (SBs) are diffusion processes that steer, in finite time, a given initial distribution to another final one while minimizing a suitable cost functional. Although various methods for computing SBs have recently been proposed in the literature, most of these approaches require computationally expensive training schemes, even for solving low-dimensional problems. In this work, we propose an analytic parametrization of a set of feasible policies for steering the distribution of a dynamical system from one Gaussian Mixture Model (GMM) to another. Instead of relying on standard non-convex optimization techniques, the optimal policy within the set can be approximated as the solution of a low-dimensional linear program whose dimension scales linearly with the number of components in each mixture. The proposed method generalizes naturally to more general classes of dynamical systems, such as controllable linear time-varying systems, enabling efficient solutions to multi-marginal momentum SB between GMMs, a challenging distribution interpolation problem. We showcase the potential of this approach in low-to-moderate dimensional problems such as image-to-image translation in the latent space of an autoencoder, learning of cellular dynamics using multi-marginal momentum SB problems, and various other examples.We also test our approach on an Entropic Optimal Transport (EOT) benchmark problem and show that it outperforms state-of-the-art methods in cases where the boundary distributions are mixture models while requiring virtually no training.
Mean-Field Sampling for Cooperative Multi-Agent Reinforcement Learning
Emile Anand, Ishani Karmarkar, Guannan Qu
Abstract
Designing efficient algorithms for multi-agent reinforcement learning (MARL) is fundamentally challenging because the size of the joint state and action spaces grows exponentially in the number of agents. These difficulties are exacerbated when balancing sequential global decision-making with local agent interactions. In this work, we propose a new algorithm $\texttt{SUBSAMPLE-MFQ}$ ($\textbf{Subsample}$-$\textbf{M}$ean-$\textbf{F}$ield-$\textbf{Q}$-learning) and a decentralized randomized policy for a system with $n$ agents. For any $k\leq n$, our algorithm learns a policy for the system in time polynomial in $k$. We prove that this learned policy converges to the optimal policy on the order of $\tilde{O}(1/\sqrt{k})$ as the number of subsampled agents $k$ increases. In particular, this bound is independent of the number of agents $n$.
Exploration from a Primal-Dual Lens: Value-Incentivized Actor-Critic Methods for Sample-Efficient Online RL
Tong Yang, Bo Dai, Lin Xiao, Yuejie Chi
Abstract
Online reinforcement learning (RL) with complex function approximations such as transformers and deep neural networks plays a significant role in the modern practice of artificial intelligence. Despite its popularity and importance, balancing the fundamental trade-off between exploration and exploitation remains a long-standing challenge; in particular, we are still in lack of efficient and practical schemes that are backed by theoretical performance guarantees. Motivated by recent developments in exploration via optimistic regularization, this paper provides an interpretation of the principle of optimism through the lens of primal-dual optimization. From this fresh perspective, we set forth a new value-incentivized actor-critic (VAC) method, which optimizes a single easy-to-optimize objective integrating exploration and exploitation — it promotes state-action and policy estimates that are both consistent with collected data transitions and result in higher value functions. Theoretically, the proposed VAC method has near-optimal regret guarantees under linear Markov decision processes (MDPs) in both finite-horizon and infinite-horizon settings.
Kernel-based Equalized Odds: A Quantification of Accuracy-Fairness Trade-off in Fair Representation Learning
Yijin Ni, Xiaoming Huo
Abstract
This paper introduces a novel kernel-based formulation of the Equalized Odds (EO) criterion, denoted as $\operatorname{EO}_k$, for fair representation learning (FRL) in supervised settings.The central goal of FRL is to mitigate discrimination regarding a sensitive attribute $S$ while preserving prediction accuracy for the target variable $Y$. Our proposed criterion enables a rigorous and interpretable quantification of three core fairness objectives: independence ($\widehat{Y} \perp S$), separation—also known as equalized odds ($\widehat{Y} \perp S \mid Y$), and calibration ($Y \perp S \mid \widehat{Y}$). Under both unbiased ($Y \perp S$) and biased ($Y \not \perp S$) conditions, we show that $\operatorname{EO}_k$ satisfies both independence and separation in the former, and uniquely preserves predictive accuracy while lower bounding independence and calibration in the latter, thereby offering a unified analytical characterization of the tradeoffs among these fairness criteria.We further define the empirical counterpart, $\widehat{\operatorname{EO}}_k$, a kernel-based statistic that can be computed in quadratic time, with linear-time approximations also available.A concentration inequality for $\widehat{\operatorname{EO}}_k$ is derived, providing performance guarantees and error bounds, which serve as practical certificates of fairness compliance. While our focus is on theoretical development, the results lay essential groundwork for principled and provably fair algorithmic design in future empirical studies.
The Structural Complexity of Matrix-Vector Multiplication
Emile Anand, Jan van den Brand, Rose McCarty
Abstract
We consider the problem of preprocessing an $n\times n$ matrix $\mathbf{M}$, and supporting queries that, for any vector $v$, returns the matrix-vector product $\mathbf{M} v$. This problem has been extensively studied in both theory and practice: on one side, practitioners have developed algorithms that are highly efficient in practice, whereas on the other side, theoreticians have proven that the problem cannot be solved faster than naive multiplication in the worst-case. This lower bound holds even in the average-case, implying that existing average-case analyses cannot explain this gap between theory and practice. Hence, we study the problem for structured matrices. We show that for $n\times n$ matrices of VC-dimension $d$, the matrix-vector multiplication problem can be solved with $\tilde{O}(n^2)$ preprocessing and $\tilde O(n^{2-1/d})$ query time. Given the low constant VC-dimensions observed in most real-world data, our results posit an explanation for why the problem can be solved so much faster in practice.Our results yield the first non-trivial upper bounds for many applications. In previous works, the online matrix-vector (OMv) hypothesis (conjecturing that quadratic time is needed per query, even over the boolean semi-ring) was used to prove many conditional lower bounds, showing that it is impossible to compute and maintain high-accuracy estimates for effective resistance, Laplacian solvers, shortest paths, and triangle detection in graphs subject to node insertions and deletions in subquadratic time. Yet, via a reduction to our matrix-vector-multiplication result, we show we can maintain these problems efficiently if the input is structured, providing the first subquadratic upper bounds in the high-accuracy regime.

International Conference for High Performance Computing, Networking, Storage, and Analysis
St. Louis, MO | Nov 16–21, 2025

Georgia Tech’s Gordon Bell Prize Finalists
The ACM Gordon Bell Prize, which will be announced at SC 2025, recognizes outstanding achievement in high performance computing. The purpose of the award, often referred to as the Nobel Prize in supercomputing, is to track the progress over time of parallel computing, with particular emphasis on rewarding innovation in applying high performance computing to applications in science, engineering, and large-scale data analytics.
A team from Georgia Tech, NVIDIA, Oak Ridge National Laboratory, AMD, Hewlett Packard Enterprise (HPE), and New York University was selected as a finalist for the 2025 Gordon Bell Prize. The group achieved the world’s largest computational fluid dynamics simulation, exceeding the current record by a factor of 20. The group simulated interacting plumes of 33 rocket thrusters inspired by the SpaceX Super Heavy booster, spanning 200 trillion grid points and 1 quadrillion degrees of freedom. Team members ran their Multicomponent Flow Code (MFC) on OLCF Frontier, LLNL El Capitan, and CSCS Alps to achieve the simulation results.
Congratulations to all the team members, including Georgia Tech’s contributors 🐝:

Full Paper
Explanation, Exploration, and Model Configuration
Your Model Is Unfair, Are You Even Aware? Inverse Relationship Between Comprehension and Trust in Explainability Visualizations of Biased ML Models
Zhanna Kaufman, Madeline Endres, Cindy Xiong Bearfield, Yuriy Brun
The VIS in GenAI
Write, Rank, or Rate: Comparing Methods for Studying Visualization Affordances
Chase Stokes, Kylie Lin, Cindy Xiong Bearfield
Trust No One
Visualizing Trust: How Chart Embellishments Influence Perceptions of Credibility
Hayeong Song, Aeree Cho, Cindy Xiong Bearfield, John Stasko
Visualization Literacy
Tell Me Without Telling Me: Two-Way Prediction of Visualization Literacy and Visual Attention
Minsuk Chang, Yao Wang, Huichen Wang, Yuanhong Zhou, Andreas Bulling, Cindy Xiong Bearfield
Invited TVCG Paper
Analysts, Assemble!
ASight: Fine-tuning Auto-Scheduling Optimizations for Model Deployment via Visual Analytics
Laixin Xie, Chenyang Zhang, Ruofei Ma, Xingxing Xing, Wei Wan, Quan Li
Graphs and Networks
Bridging Network Science and Vision Science: Mapping Perceptual Mechanisms to Network Visualization Tasks
S. Sandra Bae, Kyle Cave, Carsten Görg, Paul Rosen, Danielle Albers Szafir, Cindy Xiong Bearfield
Immersive & Ubiquitous Analytics
Exploring Spatial Hybrid User Interface for Visual Sensemaking
Wai Tong, Haobo Li, Meng Xia, Kam Kwai Wong, Ting-Chuen Pong, Huamin Qu, Yalong Yang
Interaction, Provenance, and Collaboration
Utilizing Provenance as an Attribute for Visual Data Analysis: A Design Probe with ProvenanceLens
Arpit Narechania, Shunan Guo, Eunyee Koh, Alex Endert, Jane Hoffswell
Short Paper
Perception & Semantics
From Perception to Decision: Assessing the Role of Chart Type Affordances in High-Level Decision Tasks
Yixuan Li, Emery D. Berger, Minsuk Kahng, Cindy Xiong Bearfield
Global Extrema Bias Perception and Recall of Average Data Values in Line Charts
Tejas Savalia, Andrew Lovett, Cristina R. Ceja, Rosemary Cowell, Cindy Xiong Bearfield
Visualization in-the-wild
Visualizing Opinion Space in Voting Advice Applications: A User Study
Damion E. Verboom, Tamara Mchedlidze, Başak Oral, Evanthia Dimara, Daniela Peres Rebelo, Naomi Kamoen, Cindy Xiong Bearfield
Poster
ChartJunkGPT: Can GPT-4.1 Interpret Visually Embellished Charts?
Alexander Bendeck, John Stasko
Diffusion Explorer: Interactive Exploration of Diffusion Models
Alec Helbling, Duen Horng Chau
Workshop
Visualization for AI Explainability
[BEST SUBMISSION] Transformer Explainer: LLM Transformer Model Visually Explained
Aeree Cho, Grace C. Kim, Alexander Karpekov, Alec Helbling, Zijie J. Wang, Seongmin Lee, Benjamin Hoover, Duen Horng Chau

AI/LLM Agents
ToolHaystack: Stress-Testing Tool-Augmented Language Models in Realistic Long-Term Interactions
Beong-woo Kwak, Minju Kim, Dongha Lim, Hyungjoo Chae, Dongjin Kang, Sunghwan Kim, Dongil Yang, Jinyoung Yeo
Large language models (LLMs) have demonstrated strong capabilities in using external tools to address user inquiries. However, most existing evaluations assume tool use in short contexts, offering limited insight into model behavior during realistic long-term interactions. To fill this gap, we introduce ToolHaystack, a benchmark for testing the tool use capabilities in long-term interactions. Each test instance in ToolHaystack includes multiple tasks execution contexts and realistic noise within a continuous conversation, enabling assessment of how well models maintain context and handle various disruptions. By applying this benchmark to 14 state-of-the-art LLMs, we find that while current models perform well in standard multi-turn settings, they often significantly struggle in ToolHaystack, highlighting critical gaps in their long-term robustness not revealed by previous tool benchmarks.
WebAgent-R1: Training Web Agents via End-to-End Multi-Turn Reinforcement Learning
Zhepei Wei, Wenlin Yao, Yao Liu, Weizhi Zhang, Qin Lu, Liang Qiu, Changlong Yu, Puyang Xu, Chao Zhang, Bing Yin, Hyokun Yun, Lihong Li
While reinforcement learning (RL) has demonstrated remarkable success in enhancing large language models (LLMs), it has primarily focused on single-turn tasks such as solving math problems. Training effective web agents for multi-turn interactions remains challenging due to the complexity of long-horizon decision-making across dynamic web interfaces. In this work, we present WebAgent-R1, a simple yet effective end-to-end multi-turn RL framework for training web agents. It learns directly from online interactions with web environments by asynchronously generating diverse trajectories, entirely guided by binary rewards depending on task success. Experiments on the WebArena-Lite benchmark demonstrate the effectiveness of WebAgent-R1, boosting the task success rate of Qwen-2.5-3B from 6.1% to 33.9% and LLaMA-3.1-8B from 8.5% to 44.8%, significantly outperforming existing state-of-the-art methods and strong proprietary models such as OpenAI o3. In-depth analyses reveal the effectiveness of the thinking-based prompting strategy and test-time scaling through increased interactions for web tasks. We further investigate different RL initialization policies by introducing two variants, namely WebAgent-R1-Zero and WebAgent-R1-CoT, which highlight the importance of the warm-up training stage (i.e., behavior cloning) and provide insights on incorporating long chain-of-thought (CoT) reasoning in web agents.
Computational Social Science, Cultural Analytics, and NLP for Social Good
[ORAL] Culture Cartography: Mapping the Landscape of Cultural Knowledge
Caleb Ziems, William Barr Held, Jane Yu, Amir Goldberg, David Grusky, Diyi Yang
To serve global users safely and productively, LLMs need culture-specific knowledge that might not be learned during pre-training. How do we find such knowledge that is (1) salient to in-group users, but (2) unknown to LLMs? The most common solutions are single-initiative: either researchers define challenging questions that users passively answer (traditional annotation), or users actively produce data that researchers structure as benchmarks (knowledge extraction). The process would benefit from mixed-initiative collaboration, where users guide the process to meaningfully reflect their cultures, and LLMs steer the process towards more challenging questions that meet the researcher’s goals. We propose a mixed-initiative methodology called CultureCartography. Here, an LLM initializes annotation with questions for which it has low-confidence answers, making explicit both its prior knowledge and the gaps therein. This allows a human respondent to fill these gaps and steer the model towards salient topics through direct edits. We implement this methodology as a tool called CultureExplorer. Compared to a baseline where humans answer LLM-proposed questions, we find that CultureExplorer more effectively produces knowledge that leading models like DeepSeek R1 and GPT-4o are missing, even with web search. Fine-tuning on this data boosts the accuracy of Llama-3.1-8B by up to 19.2% on related culture benchmarks.
Africa Health Check: Probing Cultural Bias in Medical LLMs
Charles Nimo, Shuheng Liu, Irfan Essa, Michael L. Best
Large language models (LLMs) are increasingly deployed in global healthcare, yet their outputs often reflect Western-centric training data and omit indigenous medical systems and region-specific treatments. This study investigates cultural bias in instruction-tuned medical LLMs using a curated dataset of African traditional herbal medicine. We evaluate model behavior across two complementary tasks, namely, multiple-choice questions and fill-in-the-blank completions, designed to capture both treatment preferences and responsiveness to cultural context. To quantify outcome preferences and prompt influences, we apply two complementary metrics: Cultural Bias Score (CBS) and Cultural Bias Attribution (CBA). Our results show that while prompt adaptation can reduce inherent bias and enhance cultural alignment, models vary in how responsive they are to contextual guidance. Persistent default to allopathic (Western) treatments in zero-shot scenarios suggests that many biases remain embedded in model training. These findings underscore the need for culturally informed evaluation strategies to guide the development of AI systems that equitably serve diverse global health contexts. By releasing our dataset and providing a dual-metric evaluation approach, we offer practical tools for developing more culturally aware and clinically grounded AI systems for healthcare settings in the Global South.
How Real Are Synthetic Therapy Conversations? Evaluating Fidelity in Prolonged Exposure Dialogues
Suhas BN, Dominik O. Mattioli, Andrew M. Sherrill, Rosa I. Arriaga, Christopher Wiese, Saeed Abdullah
The growing adoption of synthetic data in healthcare is driven by privacy concerns, limited access to real-world data, and high annotation costs. This work explores the use of synthetic Prolonged Exposure (PE) therapy conversations for Post-Traumatic Stress Disorder (PTSD) as a scalable alternative for training and evaluating clinical models. We systematically compare real and synthetic dialogues using linguistic, structural, and protocol-specific metrics, including turn-taking patterns and treatment fidelity. We introduce and evaluate PE-specific metrics derived from linguistic analysis and semantic modeling, offering a novel framework for assessing clinical fidelity beyond surface fluency. Our findings show that while synthetic data holds promise for mitigating data scarcity and protecting patient privacy, it often struggles to capture the subtle dynamics of therapeutic interactions. Synthetic therapy dialogues closely match the structural features of real conversations (e.g., speaker switch ratio: 0.98 vs. 0.99), but often fails to adequately reflect key fidelity markers such as distress monitoring. This work highlights gaps in current evaluation frameworks and advocate for fidelity-aware metrics that go beyond surface fluency to uncover clinically significant failures. Our findings clarify where synthetic data can effectively complement real-world datasets—and where critical limitations remain.
MythTriage: Scalable Detection of Opioid Use Disorder Myths on a Video-Sharing Platform
Hayoung Jung, Shravika Mittal, Ananya Aatreya, Navreet Kaur, Munmun De Choudhury, Tanu Mitra
Understanding the prevalence of misinformation in health topics online can inform public health policies and interventions. However, measuring such misinformation at scale remains a challenge, particularly for high-stakes but understudied topics like opioid-use disorder (OUD)—a leading cause of death in the U.S. We present the first large-scale study of OUD-related myths on YouTube, a widely-used platform for health information. With clinical experts, we validate 8 pervasive myths and release an expert-labeled video dataset. To scale labeling, we introduce MythTriage, an efficient triage pipeline that uses a lightweight model for routine cases and defers harder ones to a high-performing, but costlier, large language model (LLM). MythTriage achieves up to 0.86 macro F1-score while estimated to reduce annotation time and financial cost by over 76% compared to experts and full LLM labeling. We analyze 2.9K search results and 343K recommendations, uncovering how myths persist on YouTube and offering actionable insights for public health and platform moderation.
Who Speaks Matters: Analysing the Influence of the Speaker’s Linguistic Identity on Hate Classification
Ananya Malik, Kartik Sharma, Lynnette Hui Xian Ng, Shaily Bhatt
Large Language Models (LLMs) offer a lucrative promise for scalable content moderation, including hate speech detection. However, they are also known to be brittle and biased against marginalised communities and dialects. This requires their applications to high-stakes tasks like hate speech detection to be critically scrutinized. In this work, we investigate the robustness of hate speech classification using LLMs particularly when explicit and implicit markers of the speaker’s ethnicity are injected into the input. For explicit markers, we inject a phrase that mentions the speaker’s linguistic identity. For the implicit markers, we inject dialectal features. By analysing how frequently model outputs flip in the presence of these markers, we reveal varying degrees of brittleness across 3 LLMs and 1 LM and 5 linguistic identities. We find that the presence of implicit dialect markers in inputs causes model outputs to flip more than the presence of explicit markers. Further, the percentage of flips varies across ethnicities. Finally, we find that larger models are more robust. Our findings indicate the need for exercising caution in deploying LLMs for high-stakes tasks like hate speech detection.
Ethics, Bias, and Fairness
Towards Universal Debiasing for Language Models-based Tabular Data Generation
Tianchun Li, Tianci Liu, Xingchen Wang, Rongzhe Wei, Pan Li, Lu Su, Jing Gao
Large language models (LLMs) have achieved promising results in tabular data generation. However, inherent historical biases in tabular datasets often cause LLMs to exacerbate fairness issues, particularly when multiple advantaged and protected features are involved. In this work, we introduce a universal debiasing framework that minimizes group-level dependencies by simultaneously reducing the mutual information between advantaged and protected attributes. By leveraging the autoregressive structure and analytic sampling distributions of LLM-based tabular data generators, our approach efficiently computes mutual information, reducing the need for cumbersome numerical estimations. Building on this foundation, we propose two complementary methods: a direct preference optimization (DPO)-based strategy, namely UDF-DPO, that integrates seamlessly with existing models, and a targeted debiasing technique, namely UDF-MIX, that achieves debiasing without tuning the parameters of LLMs. Extensive experiments demonstrate that our framework effectively balances fairness and utility, offering a scalable and practical solution for debiasing in high-stakes applications.
Human-AI Interaction/Cooperation
[ORAL] The Pursuit of Empathy: Evaluating Small Language Models for PTSD Dialogue Support
Suhas BN, Yash Mahajan, Dominik O. Mattioli, Andrew M. Sherrill, Rosa I. Arriaga, Christopher Wiese, Saeed Abdullah
Can small language models (0.5B–5B parameters) meaningfully engage in trauma-informed, empathetic dialogue for individuals with PTSD? We answer this by introducing TIDE, a dataset of 10,000 two-turn dialogues across 500 diverse PTSD client personas, grounded in a three-factor empathy model: emotion recognition, distress normalization, and supportive reflection. All scenarios and reference responses were reviewed for realism and trauma sensitivity by a clinical psychologist specializing in PTSD. Eight small language models are evaluated before and after fine-tuning, with outputs compared to a frontier model (Claude Sonnet 3.5) as reference. Our IRB-approved human evaluation and automatic metrics reveal that, while fine-tuning generally improves perceived empathy, gains are highly scenario- and user-dependent, with smaller models facing an “empathy ceiling.” Notably, demographic analyses show older adults value distress validation and graduate-educated users prefer nuanced replies, while gender effects are minimal. We highlight limitations of automatic metrics and the need for context- and user-aware system design. Our findings—along with the planned release of TIDE—offer a foundation for building safe, resource-efficient, and ethically sound empathetic AI to supplement, not replace, clinical mental health care.
Interpretability, Model Editing, Transparency, and Explainability
Interpretation Meets Safety: A Survey on Interpretation Methods and Tools for Improving LLM Safety
Seongmin Lee, Aeree Cho, Grace C. Kim, ShengYun Peng, Mansi Phute, Duen Horng Chau
As large language models (LLMs) see wider real-world use, understanding and mitigating their unsafe behaviors is critical. Interpretation techniques can reveal causes of unsafe outputs and guide safety, but such connections with safety are often overlooked in prior surveys. We present the first survey that bridges this gap, introducing a unified framework that connects safety-focused interpretation methods, the safety enhancements they inform, and the tools that operationalize them. Our novel taxonomy, organized by LLM workflow stages, summarizes nearly 70 works at their intersections. We conclude with open challenges and future directions. This timely survey helps researchers and practitioners navigate key advancements for safer, more interpretable LLMs.
Low-resource Methods for NLP
DORM: Preference Data Weights Optimization for Reward Modeling in LLM Alignment
Rongzhi Zhang, Chenwei Zhang, Xinyang Zhang, Liang Qiu, Haoming Jiang, Yuchen Zhuang, Qingru Zhang, Hyokun Yun, Xian Li, Bing Yin, Tuo Zhao, Chao Zhang
Aligning large language models (LLMs) with human preferences relies heavily on high-quality reward models. However, existing approaches struggle with two critical challenges: noisy preference labels and the varying usefulness of preference samples. To address these issues, we introduce DORM, a method that enhances reward modeling by learning to dynamically weigh preference data. First, DORM estimates data importance by integrating model uncertainty with prediction disagreement, thereby emphasizing data points that are both informative and reliable. Second, it iteratively refines these weights via a bilevel optimization procedure: the upper level adjusts weights to enhance validation performance, guided by initial uncertainty estimates, while the lower level trains the reward model using the updated weights. Using only 50k samples, DORM trains a 12B reward model that achieves 90.2% accuracy on RewardBench, matching the performance of models trained on significantly larger datasets. Furthermore, downstream alignment tasks show that fine-tuned LLMs with DORM achieve a 61.2% win rate against baseline methods, highlighting its data efficiency and generalizability.
Multilinguality and Language Diversity
[ORAL] CARE: Multilingual Human Preference Learning for Cultural Awareness
Geyang Guo, Tarek Naous, Hiromi Wakaki, Yukiko Nishimura, Yuki Mitsufuji, Alan Ritter, Wei Xu
Language Models (LMs) are typically tuned with human preferences to produce helpful responses, but the impact of preference tuning on the ability to handle culturally diverse queries remains understudied. In this paper, we systematically analyze how native human cultural preferences can be incorporated into the preference learning process to train more culturally aware LMs. We introduce \textbf{CARE}, a multilingual resource containing 3,490 culturally specific questions and 31.7k responses with native judgments. We demonstrate how a modest amount of high-quality native preferences improves cultural awareness across various LMs, outperforming larger generic preference data. Our analyses reveal that models with stronger initial cultural performance benefit more from alignment, leading to gaps among models developed in different regions with varying access to culturally relevant data. CARE will be made publicly available at \url{https://anonymized_url}.
What are Foundation Models Cooking in the Post-Soviet World?
Anton Lavrouk, Tarek Naous, Alan Ritter, Wei Xu
The culture of the Post-Soviet states is complex, shaped by a turbulent history that continues to influence current events. In this study, we investigate the Post-Soviet cultural food knowledge of foundation models by constructing BORSch, a multi-modal dataset encompassing 1147 and 823 dishes in the Russian and Ukrainian languages, centered around the Post-Soviet region. We demonstrate that leading models struggle to correctly identify the origins of dishes from Post-Soviet nations in both text-only and multi-modal Question Answering (QA), instead over-predicting countries linked to the language the question is asked in. Through analysis of pre-training data, we show that these results can be explained by misleading dish-origin co-occurrences, along with linguistic phenomena such as Russian-Ukrainian code mixing. Finally, to move beyond QA-based assessments, we test models’ abilities to produce accurate visual descriptions of dishes. The weak correlation between this task and QA suggests that QA alone may be insufficient as an evaluation of cultural understanding.
NLP Applications
AssistedDS: Benchmarking How External Domain Knowledge Assists LLMs in Automated Data Science
An Luo, Xun Xian, Jin Du, Fangqiao Tian, Ganghua Wang, Ming Zhong, Shengchun ZHAO, Xuan Bi, Zirui Liu, Jiawei Zhou, Jayanth Srinivasa, Ashish Kundu, Charles Fleming, Mingyi Hong, Jie Ding
Large language models (LLMs) have advanced the automation of data science workflows. Yet it remains unclear whether they can critically leverage external domain knowledge as human data scientists do in practice. To answer this question, we introduce AssistedDS (Assisted Data Science), a benchmark designed to systematically evaluate how LLMs handle domain knowledge in tabular prediction tasks. AssistedDS features both synthetic datasets with explicitly known generative mechanisms and real-world Kaggle competitions, each accompanied by curated bundles of helpful and adversarial documents. These documents provide domain-specific insights into data cleaning, feature engineering, and model selection. We assess state-of-the-art LLMs on their ability to discern and apply beneficial versus harmful domain knowledge, evaluating submission validity, information recall, and predictive performance. Our results demonstrate three key findings: (1) LLMs frequently exhibit an uncritical adoption of provided information, significantly impairing their predictive performance when adversarial content is introduced, (2) helpful guidance is often insufficient to counteract the negative influence of adversarial information, and (3) in Kaggle datasets, LLMs often make errors in handling time-series data, applying consistent feature engineering across different folds, and interpreting categorical variables correctly. These findings highlight a substantial gap in current models’ ability to critically evaluate and leverage expert knowledge, underscoring an essential research direction for developing more robust, knowledge-aware automated data science systems.
Protein Large Language Models: A Comprehensive Survey
Yijia Xiao, Wanjia Zhao, Junkai Zhang, Yiqiao Jin, Han Zhang, Zhicheng Ren, Renliang Sun, Haixin Wang, Guancheng Wan, Pan Lu, Xiao Luo, Yu Zhang, James Zou, Yizhou Sun, Wei Wang
Protein-specific large language models (ProteinLLMs) are revolutionizing protein science by enabling more efficient protein structure prediction, function annotation, and design. While existing surveys focus on specific aspects or applications, this work provides the first comprehensive overview of ProteinLLMs, covering their architectures, training datasets, evaluation metrics, and diverse applications. Through a systematic analysis of over 100 articles, we propose a structured taxonomy of state-of-the-art ProteinLLMs, analyze how they leverage large-scale protein sequence data for improved accuracy, and explore their potential in advancing protein engineering and biomedical research. Additionally, we discuss key challenges and future directions, positioning ProteinLLMs as essential tools for scientific discovery in protein science. A GitHub repository and tutorial will be available upon publication.
Phonology, Morphology and Word Segmentation
[ORAL] Train It and Forget It: Merge Lists are Unnecessary for BPE Inference in Language Models
Tomohiro Sawada, Kartik Goyal
Standard Byte-Pair Encoding (BPE) tokenization compresses text by pairing a learned token vocabulary with a detailed merge list. Recent work has shown that this merge list exposes a potential attack surface for extracting informa- tion about language model’s training data. In this paper, we explore the downstream impact of BPE inference algorithms that do not rely on this merge list at all, and hence differ from the encoding process during the BPE training. To address this question, we investigate two broad classes of BPE inference schemes that differ from BPE appliction during training: a) targetted deviation from merge-lists including random merge orders, and various corruptions of merge list involving deletion/truncation, and b) non-targetted BPE inference algorithms that do not depend on the merge list but focus on compressing the text either greedily or exactly. Extensive experiments across diverse language modeling tasks like accuracy-based QA bench- marks, machine translation, and open-ended generation reveal that while the targetted devi- ation from the merge lists exhibit significant degradation in language model performance, the non-targetted merge-list free inference algo- rithms result in minimal impact on downstream performance that is often much smaller than expected. These findings pave way for simpler and potentially more privacy-preserving tok- enization schemes that do not catastrophically compromise model performance.
Question Answering
Superficial Self-Improved Reasoners Benefit from Model Merging
Xiangchi Yuan, Chunhui Zhang, Zheyuan Liu, Dachuan Shi, Leyan Pan, Soroush Vosoughi, Wenke Lee
Large Language Models (LLMs) rely heavily on large-scale reasoning data, but as such data becomes increasingly scarce, model self-improvement offers a promising alternative. However, this process can lead to model collapse, as the model’s output becomes overly deterministic with reduced diversity. In this work, we identify a new risk beyond model collapse, which we term the Superficial Self-Improved Reasoners phenomenon. This phenomenon indicates that while self-improvement enhances in-domain (ID) reasoning accuracy, it degrades the model’s generalized reasoning capability on out-of-domain (OOD) datasets, as the model tends to memorize the training data. Our analyses of layer importance and parameter changes reveal that reasoning-critical layers receive fewer updates compared to less relevant layers during self-improvement. To address this, we propose Iterative Model Merging (IMM), which balances reasoning improvements and generalization by merging the weights of the original and self-improved models. IMM effectively mitigates model collapse and improves generalized reasoning capability.
Resources and Evaluation
DCR: Quantifying Data Contamination in LLMs Evaluation
Cheng Xu, Nan Yan, Shuhao Guan, Changhong Jin, Yuke Mei, Yibing Guo, Tahar Kechadi
The rapid advancement of large language models (LLMs) has heightened concerns about benchmark data contamination (BDC), where models inadvertently memorize evaluation data, inflating performance metrics and undermining genuine generalization assessment. This paper introduces the Data Contamination Risk (DCR) framework, a lightweight, interpretable pipeline designed to detect and quantify BDC across four granular levels: semantic, informational, data, and label. By synthesizing contamination scores via a fuzzy inference system, DCR produces a unified DCR Factor that adjusts raw accuracy to reflect contamination-aware performance. Validated on 9 LLMs (0.5B–72B) across sentiment analysis, fake news detection, and arithmetic reasoning tasks, the DCR framework reliably diagnoses contamination severity and with accuracy adjusted using the DCR Factor to within 4% average error across the three benchmarks compared to the uncontaminated baseline. Emphasizing computational efficiency and transparency, DCR provides a practical tool for integrating contamination assessment into routine evaluations, fostering fairer comparisons and enhancing the credibility of LLM benchmarking practices.
FlashAdventure: A Benchmark for GUI Agents Solving Full Story Arcs in Diverse Adventure Games
Jaewoo Ahn, Junseo Kim, Heeseung Yun, Jaehyeon Son, Dongmin Park, Jaewoong Cho, Gunhee Kim
GUI agents powered by LLMs show promise in interacting with diverse digital environments. Among these, video games offer a valuable testbed due to their varied interfaces, with adventure games posing additional challenges through complex, narrative-driven interactions. Existing game benchmarks, however, lack diversity and rarely evaluate agents on completing entire storylines. To address this, we introduce FlashAdventure, a benchmark of 34 Flash-based adventure games designed to test full story arc completion and tackle the observation-behavior gap—the challenge of remembering and acting on earlier gameplay information. We also propose CUA-as-a-judge, an automated gameplay evaluator, and COAST, an agentic framework leveraging long-term clue memory to better plan and solve sequential tasks. Experiments show current GUI agents struggle with full story arcs, while COAST improves milestone completion by bridging the observation-behavior gap. Nonetheless, a marked discrepancy between humans and best-performing agents warrants continued research efforts to narrow this divide.
How to Protect Yourself from 5G Radiation? Investigating LLM Responses to Implicit Misinformation
Ruohao Guo, Wei Xu, Alan Ritter
As Large Language Models (LLMs) are widely deployed in diverse scenarios, the extent to which they could tacitly spread misinformation emerges as a critical safety concern. Current research primarily evaluates LLMs on explicit false statements, overlooking how misinformation often manifests subtly as unchallenged premises in real-world interactions. We curated EchoMist, the first comprehensive benchmark for implicit misinformation, where false assumptions are embedded in the query to LLMs. EchoMist targets circulated, harmful, and ever-evolving implicit misinformation from diverse sources, including realistic human-AI conversations and social media interactions. Through extensive empirical studies on 15 state-of-the-art LLMs, we find that current models perform alarmingly poorly on this task, often failing to detect false premises and generating counterfactual explanations. We also investigate two mitigation methods, i.e., Self-Alert and RAG, to enhance LLMs’ capability to counter implicit misinformation. Our findings indicate that EchoMist remains a persistent challenge and underscore the critical need to safeguard against the risk of implicit misinformation.
Multimodal Emotion Recognition in Conversations: A Survey of Methods, Trends, Challenges and Prospects
ChengYan Wu, Yiqiang Cai, Yang Liu, pengxu zhu, Yun Xue, Ziwei Gong, Julia Hirschberg, Bolei Ma
While text-based emotion recognition methods have achieved notable success, real-world dialogue systems often demand a more nuanced emotional understanding than any single modality can offer. Multimodal Emotion Recognition in Conversations (MERC) has thus emerged as a crucial direction for enhancing the naturalness and emotional understanding of human-computer interaction. Its goal is to accurately recognize emotions by integrating information from various modalities such as text, speech, and visual signals. This survey offers a systematic overview of MERC, including its motivations, core tasks, representative methods, and evaluation strategies. We further examine recent trends, highlight key challenges, and outline future directions. As interest in emotionally intelligent systems grows, this survey provides timely guidance for advancing MERC research.
NESTFUL: A Benchmark for Evaluating LLMs on Nested Sequences of API Calls
Kinjal Basu, Ibrahim Abdelaziz, Kiran Kate, Mayank Agarwal, Maxwell Crouse, Yara Rizk, Kelsey Bradford, Asim Munawar, Sadhana Kumaravel, Saurabh Goyal, Xin Wang, Luis A. Lastras, Pavan Kapanipathi
The resurgence of autonomous agents built using large language models (LLMs) to solve complex real-world tasks has brought increased focus on LLMs’ fundamental ability of tool or function calling. At the core of these agents, an LLM must plan, execute, and respond using external tools, APIs, and custom functions. Research on tool calling has gathered momentum, but evaluation benchmarks and datasets representing the complexity of the tasks have lagged behind. In this work, we focus on one such complexity, nested sequencing, with the goal of extending existing benchmarks and evaluation. Specifically, we present NESTFUL, a benchmark to evaluate LLMs on nested sequences of API calls, i.e., sequences where the output of one API call is passed as input to a subsequent call. NESTFUL contains 1800+ nested sequences where all the function calls are executable. Experimental results on a variety of models show that the best-performing model (GPT-4o) achieves a full sequence match accuracy of 28% and a win-rate of 60%, necessitating a large scope for improvement in the nested sequencing aspect of function calling. Our analysis of these results provides possible future research directions for the community, in addition to a benchmark to track progress.
SimulatorArena: Are User Simulators Reliable Proxies for Multi-Turn Evaluation of AI Assistants?
Yao Dou, Michel Galley, Baolin Peng, Chris Kedzie, Weixin Cai, Alan Ritter, Chris Quirk, Wei Xu, Jianfeng Gao
Large language models (LLMs) are increasingly used in interactive applications, and human evaluation remains the gold standard for assessing their performance in multi-turn conversations. Since human studies are costly, time-consuming, and hard to reproduce, recent work explores using LLMs to simulate users for automatic assistant evaluation. However, there is no benchmark or systematic study to evaluate whether these simulated users are reliable stand-ins for real users. To address this, we introduce USimBench, a benchmark of 909 annotated human–LLM conversations on two interactive tasks—math tutoring and document creation. USimBench evaluates simulators based on how closely their messages match human behavior and how well their assistant ratings align with human judgments. Experiments on various simulator methods show that simulators conditioned on user profiles, capturing traits like background and message styles, align closely with human judgments. They reach Spearman’s $\rho$ of 0.7 on both tasks, providing a practical, scalable alternative to human evaluation.
SSA: Semantic Contamination of LLM-Driven Fake News Detection
Cheng Xu, Nan Yan, Shuhao Guan, Yuke Mei, Tahar Kechadi
Benchmark data contamination (BDC) silently inflate the evaluation performance of large language models (LLMs), yet current work on BDC has centered on direct token overlap (data/label level), leaving the subtler and equally harmful semantic level BDC largely unexplored. This gap is critical in fake news detection task, where prior exposure to semantic BDC lets a model “remember” the answer instead of reasoning. We (1) are the first to formally defined semantic contamination for this task and (2) introduced the Semantic Sensitivity Amplifier (SSA)—a lightweight, model-agnostic framework that detect BDC risks across semantic to label level via an entity shift perturbation and a comprehensive interpretable metric, the SSA Factor. Evaluating 45 variants of nine LLMs (0.5B–72B parameters) across four BDC levels, we find LIAR2 accuracy climbs monotonically with injected contamination, while the SSA Factor escalates in near-perfect lock-step ($r\geq $.97, for models $\geq$3B, $p<$.05; $\rho \geq$.9 overall, $p<$.05). These results show that SSA provides a sensitive, scalable audit of comprehensive BDC risk and paves the way for more integrity evaluation of LLM-driven fake news detection task.
Towards Robust Mathematical Reasoning
Thang Luong, Hoang H Nguyen, Dawsen Hwang, Golnaz Ghiasi, Yuri Chervonyi, Insuk Seo, Garrett Bingham, Jonathan Lee, Swaroop Mishra, Alex Zhai, Huiyi Hu, Henryk Michalewski, Jimin Kim, Jeonghyun Ahn, Junhwi Bae, Quoc V Le, Junehyuk Jung
We present IMO-Bench, a suite of advanced reasoning benchmarks that aim for robustness in evaluation and specifically target the level of the International Mathematical Olympiad, the most prestigious venue for competitive math. IMO-Bench consists of diverse and challenging problems vetted by a panel of top IMO medalists and mathematicians. The first benchmark, IMO-AnswerBench, consists of 400 problems with verifiable answers curated from past Olympiad competitions and then altered by experts for robustness in evaluation. The latest frontier models struggle on this benchmark, with less than 48% accuracies in terms of matching the final answers. To advance the field beyond simple short-answer evaluation, we design IMO-ProofBench, consisting of both basic and novel problems, with detailed grading guidelines for full proof evaluation. Experts’gradings reveal that the best model achieves less than 36% max performance on this benchmark. Towards reducing grading cost, we share an automatic grader for the basic set that highly correlates with human expert evaluations. Last but not least, we construct, IMO-MistakeBench, a benchmark for identifying the first incorrect step in a full solution. Together, we hope the IMO-Bench contributes towards advancing robust mathematical reasoning.
Retrieval-Augmented Language Models
OG-RAG: Ontology-grounded retrieval-augmented generation for large language models
Kartik Sharma, Peeyush Kumar, Yunqing Li
While LLMs are widely used for generic tasks like question answering and search, they struggle to adapt to specialized knowledge, such as industrial workflows in healthcare, legal, and agricultural sectors, as well as knowledge-driven tasks such as news journalism, investigative research, and consulting without expensive fine-tuning or sub-optimal retrieval methods. Existing retrieval-augmented models, such as RAG, offer improvements but fail to account for structured domain knowledge, leading to suboptimal context generation. Ontologies, which conceptually organize domain knowledge by defining entities and their interrelationships, offer a structured representation to address this gap. This paper presents OG-RAG, an Ontology-Grounded Retrieval Augmented Generation method designed to enhance LLM-generated responses by anchoring retrieval processes in domain-specific ontologies. OG-RAG constructs a hypergraph representation of domain documents, where each hyperedge encapsulates clusters of factual knowledge grounded using domain-specific ontology and retrieves a minimal set of hyperedges for a given query using an optimization algorithm. Our evaluations demonstrate that OG-RAG increases the recall of accurate facts by 55% and improves response correctness by 40% across four different LLMs. Additionally, OG-RAG enables 30% faster attribution of responses to context and boosts fact-based reasoning accuracy by 27% compared to baseline methods. We release the code at https://anonymous.4open.science/r/ograg-E7A8.
Safety and Alignment in LLMs
Jailbreak-Tuning: Models Efficiently Learn Jailbreak Susceptibility
Brendan Murphy, Dillon Bowen, Shahrad Mohammadzadeh, Tom Tseng, Julius Broomfield, Adam Gleave, Kellin Pelrine
AI systems are rapidly advancing in capability, and frontier model developers broadly acknowledge the need for safeguards against serious misuse. However, this paper demonstrates that fine-tuning, whether via open weights or closed fine-tuning APIs, can produce helpful-only models. In contrast to prior work which is blocked by modern moderation systems or achieved only partial removal of safeguards or degraded output quality, our jailbreak-tuning method teaches models to generate detailed, high-quality responses to arbitrary harmful requests. For example, OpenAI, Google, and Anthropic models will fully comply with requests for CBRN assistance, executing cyberattacks, and other criminal activity. We further show that backdoors can increase not only the stealth but also the severity of attacks. Stronger jailbreak prompts become even more effective in fine-tuning attacks, linking attack and potentially defenses in the input and weight spaces. Not only are current models vulnerable, more recent ones also appear to be becoming even more vulnerable to these attacks, underscoring the urgent need for tamper-resistant safeguards. Until such safeguards are discovered, companies and policymakers should view the release of any fine-tunable model as simultaneously releasing its evil twin: equally capable as the original model, and usable for any malicious purpose within its capabilities.
WebInject: Prompt Injection Attack to Web Agents
Xilong Wang, John Bloch, Zedian Shao, Yuepeng Hu, Shuyan Zhou, Neil Zhenqiang Gong
Multi-modal large language model (MLLM)-based web agents interact with webpage environments by generating actions based on screenshots of the webpages. Environmental prompt injection attacks manipulate the environment to induce the web agent to perform a specific, attacker-chosen action–referred to as the target action–such as clicking on a designated coordinate on the monitor. However, existing attacks suffer from limited effectiveness or stealthiness, or are impractical in real-world settings. In this work, we propose EnvInjection, a new attack that addresses these limitations. Our attack adds a perturbation to the raw pixel values of the rendered webpage, which can be implemented by modifying the webpage’s source code. After these perturbed pixels are mapped into a screenshot, the perturbation induces the web agent to perform the target action. We formulate the task of finding the perturbation as an optimization problem. A key challenge in solving this problem is that the mapping between raw pixel values and screenshot is non-differentiable, making it difficult to backpropagate gradients to the perturbation. To overcome this, we train a neural network to approximate the mapping and apply projected gradient descent to solve the reformulated optimization problem. Extensive evaluation on multiple webpage datasets shows that EnvInjection is highly effective and significantly outperforms existing baselines.
Special Theme: Interdisciplinary Recontextualization of NLP
[ORAL] From Language to Cognition: How LLMs Outgrow the Human Language Network
Badr AlKhamissi, Greta Tuckute, Yingtian Tang, Taha Osama A Binhuraib, Antoine Bosselut, Martin Schrimpf
Large language models (LLMs) exhibit remarkable similarity to neural activity in the human language network. However, the key properties of language underlying this alignment—and how brain-like representations emerge and change across training—remain unclear. We here benchmark 34 training checkpoints spanning 300B tokens across 8 different model sizes to analyze how brain alignment relates to linguistic competence. Specifically, we find that brain alignment tracks the development of formal linguistic competence—i.e., knowledge of linguistic rules—more closely than functional linguistic competence. While functional competence, which involves world knowledge and reasoning, continues to develop throughout training, its relationship with brain alignment is weaker, suggesting that the human language network primarily encodes formal linguistic structure rather than broader cognitive functions. Notably, we find that the correlation between next-word prediction, behavioral alignment, and brain alignment fades once models surpass human language proficiency. We further show that model size is not a reliable predictor of brain alignment when controlling for the number of features. Finally, using the largest set of rigorous neural language benchmarks to date, we show that language brain alignment benchmarks remain unsaturated, highlighting opportunities for improving future models. Taken together, our findings suggest that the human language network is best modeled by formal, rather than functional, aspects of language.
polyBART: A Chemical Linguist for Polymer Property Prediction and Generative Design
Anagha Savit, Harikrishna Sahu, Shivank S. Shukla, Wei Xiong, Rampi Ramprasad
Designing polymers for targeted applications and accurately predicting their properties is a key challenge in materials science owing to the vast and complex polymer chemical space. While molecular language models have proven effective in solving analogous problems for molecular discovery, similar advancements for polymers are limited. To address this gap, we propose polyBART, a language model-driven polymer discovery capability that enables rapid and accurate exploration of the polymer design space. Central to our approach is Pseudo-polymer SELFIES (PSELFIES), a novel representation that allows for the transfer of molecular language models to the polymer space. polyBART is, to the best of our knowledge, the first language model capable of bidirectional translation between polymer structures and properties, achieving state-of-the-art results in property prediction and design of novel polymers for electrostatic energy storage. Further, polyBART is validated through a combination of both computational and laboratory experiments. We report what we believe is the first successful synthesis and validation of a polymer designed by a language model, predicted to exhibit high thermal degradation temperature and confirmed by our laboratory measurements. Our work presents a generalizable strategy for adapting molecular language models to the polymer space and introduces a polymer foundation model, advancing generative polymer design that may be adapted for a variety of applications.

BGP / Routing security
A first look into long-lived BGP zombies
Iliana Maria Xygkou, Antonios A. Chariton, Xenofontas Dimitropoulos, Alberto Dainotti
Replication: A Two Decade Review of Policy Atoms – Tracing the Evolution of AS Path Sharing Prefixes
Weili Wu, Zachary Bischof, Cecilia Testart, Alberto Dainotti
ru-RPKI-ready: the Road Left to Full ROA Adoption
Deepak Gouda, Romain Fontugne, Cecilia Testart
Mapping resources & infrastructure
Prefix2Org : Mapping BGP Prefixes to Organizations
Deepak Gouda, Alberto Dainotti, Cecilia Testart
Satellite
Assessing LEO Satellite Networks for National Emergency Failover
Vaibhav Bhosale, Ying Zhang, Sameer Kapoor, Robin Kim, Miguel Schlicht, Muskaan Gupta, Ekaterina Tumanova, Zachary Bischof, Fabián E. Bustamante, Alberto Dainotti, Ahmed Saeed

Georgia Tech-Led Papers
Adversarial Attention Perturbations for Large Object Detection Transformers
Zachary Yahn, Selim Tekin, Fatih Ilhan, Sihao Hu, Tiansheng Huang, Yichang Xu, Margaret Loper, Ling Liu
ASCENT: Annotation-free Self-supervised Contrastive Embeddings for 3D Neuron Tracking in Fluorescence Microscopy
Haejun Han, Hang Lu
Clink! Chop! Thud! – Learning Object Sounds from Real-World Interactions
Mengyu Yang, Yiming Chen, Haozheng Pei, Siddhant Agarwal, Arun Vasudevan, James Hays
Contrastive Flow Matching
George Stoica, Vivek Ramanujan, Xiang Fan, Ali Farhadi, Ranjay Krishna, Judy Hoffman
Fewer Denoising Steps or Cheaper Per-Step Inference: Towards Compute-Optimal Diffusion Model Deployment
Zhenbang Du, Yonggan Fu, Lifu Wang, Jiayi Qian, Xiao Luo, Yingyan Celine Lin
HyPiDecoder: Hybrid Pixel Decoder for Efficient Segmentation and Detection
Fengzhe Zhou, Humphrey Shi
Orchid: Image Latent Diffusion for Joint Appearance and Geometry Generation
Akshay Krishnan, Xinchen Yan, Vincent Casser, Abhijit Kundu
OuroMamba: A Data-Free Quantization Framework for Vision Mamba
Akshat Ramachandran, Mingyu Lee, Huan Xu, Souvik Kundu, Tushar Krishna
SplatTalk: 3D VQA with Gaussian Splatting
Anh Thai, Kyle Genova, Songyou Peng, Leonidas Guibas, Thomas Funkhouser
T2I-Copilot: A Training-Free Multi-Agent Text-to-Image System for Enhanced Prompt Interpretation and Interactive Generation
Chieh-Yun Chen, Min Shi, Gong Zhang, Humphrey Shi
Task-Specific Zero-shot Quantization-Aware Training for Object Detection
Changhao Li, Xinrui Chen, Ji Wang, Kang Zhao, Jianfei Chen
Partner-Led Papers
CLIP-GS: Unifying Vision-Language Representation with 3D Gaussian Splatting
Siyu Jiao, Haoye Dong, Yuyang Yin, ZEQUN JIE, Yinlong Qian, Yao Zhao, Humphrey Shi, Yunchao Wei
CompCap: Improving Multimodal Large Language Models with Composite Captions
Xiaohui Chen, Satya Narayan Shukla, Mahmoud Azab, Aashu Singh, Qifan Wang, David Yang, ShengYun Peng, Hanchao Yu, Shen Yan, Xuewen Zhang, Baosheng He
EmbodiedSplat: Personalized Real-to-Sim-to-Real Navigation with Gaussian Splats from a Mobile Device
Gunjan Chhablani, Xiaomeng Ye, Muhammad Zubair Irshad, Zsolt Kira
IMG: Calibrating Diffusion Models via Implicit Multimodal Guidance
Jiayi Guo, Chuanhao Yan, Xingqian Xu, Yulin Wang, Kai Wang, Gao Huang, Humphrey Shi
Modeling Saliency Dataset Bias
Matthias Kümmerer, Harneet Singh Khanuja, Matthias Bethge
One Last Attention for Your Vision-Language Model
Liang Chen, Ghazi Shazan Ahmad, Tianjun Yao, Lingqiao Liu, Zhiqiang Shen
SummDiff: Generative Modeling of Video Summarization with Diffusion
Kwanseok Kim, Jaehoon Hahm, Sumin Kim, Jinhwan Sul, Byung-Hak Kim, Joonseok Lee

ACM SIGCHI Conference on Computer-Supported Cooperative Work & Social Computing
Bergen, Norway | Oct 18–22, 2025

Papers
Advocacy Work
Charismatic Data and Material Traces: Monitoring Bird-Building Collisions through Citizen Science
Ashley Boone, Carl DiSalvo, Christopher Le Dantec
Bird collisions with man-made structures pose a significant threat to bird populations. In [Southern City], a small group of dedicated volunteers track these deaths with hopes of advocating for local policy requiring the use of bird-safe building materials. In addition to recording observations in a mobile application, volunteers log their efforts and collect the bodies of birds they find to add to university specimen collections. We offer a detailed empirical account of the work done by volunteers to produce (1) a digital record of local bird strikes (2) a log of volunteer monitoring efforts and (3) a collection of bird specimens. Unpacking the multiple forms of data produced by volunteer efforts, we examine how Project Safe Flight produced data oriented towards advocacy work. We find that Safe Flight data practices are deeply intertwined with the material qualities of these traces: mass, decay, feathers, and charisma. Finally, we discuss implications for data activism, discussing the link between materiality and charismatic data and next steps for action citizen science.
Metrics and Macchiatos: Challenges for Service-Industry Workers and the Need for Worker-Driven ICTs
Xander Koo, Lucy Scott, Amy Bruckman
Nearly 30 million people work in the foodservice and retail industries in the United States, representing approximately 18 percent of the total U.S. workforce. These service-industry workers contend with pressures from algorithmic management and other workplace technologies, yet they typically do not benefit from technologies that might help foster mutual support in the way that white-collar workers do. Recently, Starbucks, a major service-industry employer, has garnered media attention for issues with understaffing, labor law violations, and algorithm-based operations. We conducted interviews with sixteen Starbucks employees about their workplace issues, interactions with technology, and communication practices. These interviews illustrate how workplace technologies worsen existing issues for service-industry workers and how challenges to worker-to-worker communication reduce their capacity to rectify these issues, especially at the cross-store level. Our participants want better communication with other workers, such as through labor unions or new information and communication technologies (ICTs), to help improve their working conditions. We discuss how HCI scholars can use action research to help design localized, worker-driven ICTs to facilitate more connectivity and collaborative practices outside of the workplace. We conclude by outlining our ongoing work studying and designing ICTs for service-industry workers.
AI Applications for Safety and Support
“Poker with Play Money”: Exploring Psychotherapist Training with Virtual Patients
Cynthia Baseman, Masum Hasan, Nathaniel Swinger, Sheila Rauch, Sheila Rauch, Ehsan Hoque, Rosa Arriaga
Role-play exercises are widely utilized for training across a variety of domains; however, they have many shortcomings, including low availability, resource intensity, and lack of diversity. Large language model-driven virtual agents offer a potential avenue to mitigate these limitations and offer lower-risk role-play. The implications, however, of shifting this human-human collaboration to human-agent collaboration are still largely unexplored. In this work we focus on the context of psychotherapy, as psychotherapists-in-training extensively engage in role-play exercises with peers and/or supervisors to practice the interpersonal and therapeutic skills required for effective treatment. We provide a case study of a realistic virtual patient” system for mental health training, evaluated by trained psychotherapists in comparison to their previous experiences with both real role-play partners and real patients. Our qualitative, reflexive analysis generated three themes and thirteen subthemes regarding key interpersonal skills of psychotherapy, the utility of the system compared to traditional role-play techniques, and factors which impacted psychotherapist-perceivedhumanness” of the virtual patient. Although psychotherapists were optimistic about the system’s potential to bolster therapeutic skills, this utility was impacted by the extent to which the virtual patient was perceived as human-like. We leverage the Computers Are Social Actors framework to discuss human–virtual-patient collaboration for practicing rapport, and discuss challenges of prototyping novel human-AI systems for clinical contexts which require a high degree of unpredictability. We pull from the “SEEK” three-factor theory of anthropomorphism to stress the importance of adequately representing a variety of cultural communities within mental health AI systems, in alignment with decolonial computing.
The Practice of Online Peer Counseling and the Potential for AI-Powered Support Tools
Tony Wang, Amy Bruckman, Diyi Yang
What challenges do volunteers providing peer support in online mental health platforms (OMHPs) face in operating and growing their communities? How could the HCI community develop human-AI systems to help? Recent work on online peer counseling has led to the development of novel AI tools for conversational interaction, but it remains unknown how such technology can fit into existing practices. In this research, we conducted interviews and design exercises with seventeen peer counselors from 7 Cups of Tea, a large online therapy and counseling platform, to design tools — AI or not — that resolve challenges that arise from day-to-day community practices. Participant responses suggest three classes of tools that could improve online peer counseling: real-time decision support, productivity, and management and training. Investigation of design motivations surfaced four practice-based challenges including chat interface limitations, difficulties in support seeker management, fragmented contexts of practice, and lack of visibility due to privacy concerns. Based on counselors’ discussion of benefits and risks associated with AI features in the tools they designed, we offer suggestions for research on AI tools that build on peer counseling practices, and connect our findings with broader implications about online peer counseling as a form of volunteer-based mental health practice.
The Typing Cure: Experiences with Large Language Model Chatbots for Mental Health Support
Inhwa Song, Sachin Pendse, Neha Kumar, Munmun De Choudhury
People experiencing severe distress increasingly use Large Language Model (LLM) chatbots as mental health support tools. Discussions on social media have described how engagements were lifesaving for some, but evidence suggests that general-purpose LLM chatbots also have notable risks that could endanger the welfare of users if not designed responsibly. In this study, we investigate the lived experiences of people who have used LLM chatbots for mental health support. We build on interviews with 21 individuals from globally diverse backgrounds to analyze how users create unique support roles for their chatbots, fill in gaps in everyday care, and navigate associated cultural limitations when seeking support from chatbots. We ground our analysis in psychotherapy literature around effective support, and introduce the concept of therapeutic alignment, or aligning AI with therapeutic values for mental health contexts. Our study offers recommendations for how designers can approach the ethical and effective use of LLM chatbots and other AI mental health support tools in mental health care.
Beyond AI: Additional Considerations for Enhancing Healthcare
[HONORABLE MENTION] Bridging Ontologies of Neurological Conditions: Towards Patient-centered Data Practices in Digital Phenotyping Research and Design
Jianna So, Faye Yang, Krzysztof Gajos, Naveena Karusala, Anoopum Gupta
Amidst the increasing datafication of healthcare, deep digital phenotyping is being explored in clinical research to gather comprehensive data that can improve understanding of neurological conditions. However, participants currently do not have access to this data due to researchers’ apprehension around whether such data is interpretable or useful. This study focuses on patient perspectives on the potential of deep digital phenotyping data to benefit people with neurodegenerative diseases, such as ataxias, Parkinson’s disease, and multiple system atrophy. We present an interview study (n=12) to understand how people with these conditions currently track their symptoms and how they envision interacting with their deep digital phenotyping data. We describe how participants envision the utility of this deep digital phenotyping data in relation to multiple stages of disease and stakeholders, especially its potential to bridge different and sometimes conflicting understandings of their condition. Looking towards a future in which patients have increased agency over their data and can use it to inform their care, we contribute implications for shaping patient-driven clinical research practices and deep digital phenotyping tools that serve a multiplicity of patient needs.
Care Work
Jiaying “Lizzy” Liu, Shuer Zhuo, Xingyu Li, Andrew Dillon, Noura Howell, Angela D. R. Smith, Yan Zhang
“Enhancing emotional well-being has become an important focus in HCI and CSCW, with technologies increasingly designed to track, visualize, and manage emotions. However, these approaches have faced criticism for potentially suppressing certain emotional experiences. Through a scoping review of 53 empirical studies from ACM proceedings implementing Technology-Mediated Emotion Intervention (TMEI), we critically examine current practices through lenses drawn from HCI critical theories.
Our analysis reveals emotion intervention mechanisms that extend beyond traditional “”emotion regulation”” paradigms, identifying care-centered goals that prioritize non-judgmental emotional support and preserve users’ identities.
The findings demonstrate how researchers design technologies to generate artificial care, intervene in power dynamics, and nudge behavioral changes. We contribute the concept of “”emotion support”” as an alternative approach to “”emotion regulation,”” emphasizing human-centered approaches to emotional well-being. This work advances the understanding of diverse human emotional needs beyond individual and cognitive perspectives, offering design implications that critically reimagine how technologies can honor emotional complexity, preserve human agency, and transform power dynamics in care contexts.”
Caregiving & Caregivers
Kefan Xu, Cynthia Baseman, Nathaniel Swinger, Myeonghan Ryu, Rosa Arriaga
Informal caregivers perform an important role in taking care of family members with chronic disease. Informal caregivers’ mental health can be negatively impacted by life-changing events (e.g., patients’ diagnosis, care transitioning, etc.). This leads the caregiver to suffer from interpersonal and intrapersonal conflicts, causing a sense of disorientation and escalating malaise. In this study, we investigated informal caregivers’ experiences of facing conflicts and life-changing events by qualitatively analyzing the data from online health communities. We categorized conflicts using a psychodynamic framework. We further looked at the interplay of life-changing events and conflicts and how this leads to caregivers’ sense-making and decisions to mediate conflicts. We also found that online health communities provide support by helping caregivers interpret and navigate conflicts and raising awareness of the temporal resolution of life-changing events. We conclude this study by discussing designing online health communities to better support such practice.
Caring at a Distance
Lan Gao, Munmun De Choudhury, Jennifer Kim
In remote psychotherapy, challenges arising from remote client-therapist interactions can impact the therapeutic alliance and overall outcomes. HCI research has focused on leveraging sensing technology to bridge gaps in remote interactions. In this work, we investigate the values and risks of integrating sensing technology in remote psychotherapy, specifically to capture and interpret non-verbal cues, by conducting a speculative design study with both clients and therapists. Our findings reveal that sensing technology has the potential to facilitate self-reflection in therapy. The sharing of tracked non-verbal cues could also possibly foster mutual disclosure, supporting therapists’ judgments and balancing power dynamics between clients and therapists. However, clients and therapists were concerned about the accuracy of sensing systems, potential privacy threats, and additional cognition burden. Our insights into system values imply how sensing technology could potentially balance power dynamics in client-therapist relationships as well as general interpersonal relationships. We also emphasize the increased considerations in sensing-technology-empowered communication for remote psychotherapy than in non-vulnerable settings.
Helping the Helper: Supporting Peer Counselors via AI-Empowered Practice and Feedback
Shang-Ling Hsu, Raj Shah, Prathik Senthil, Zahra Ashktorab, Casey Dugan, Werner Geyer, Diyi Yang
Millions of users come to online peer counseling platforms to seek support. However, studies show that online peer support groups are not always as effective as expected largely due to users’ negative experiences with unhelpful counselors. Peer counselors are key to the success of online peer counseling platforms, but most often do not receive appropriate training. Hence, we introduce CARE: an AI-based tool to empower and train peer counselors through practice and feedback. Concretely, CARE helps diagnose which counseling strategies are needed in a given situation and suggests example responses to counselors during their practice sessions. Building upon the Motivational Interviewing framework, CARE utilizes large-scale counseling conversation data with text generation techniques to enable these functionalities. We demonstrate the efficacy of CARE by performing quantitative evaluations and qualitative user studies through simulated chats and semi-structured interviews, finding that CARE especially helps novice counselors in challenging situations. The code is available at https://app.box.com/s/z3a4dwgmeqfy8vbzi9cgmg0yhn6t4j53.
Core Concepts in Privacy Research
Measuring, Modeling, and Helping People Account for Privacy Risks in Online Self-Disclosures with AI
Isadora Krsek, Anubha Kabra, Yao Dou, Tarek Naous, Laura Dabbish, Alan Ritter, Wei Xu, Sauvik Das
In pseudonymous online fora like Reddit, the benefits of self-disclosure are often apparent to users (e.g., I can vent about my in-laws to understanding strangers), but the privacy risks are more abstract (e.g., will my partner be able to tell that this is me?). Prior work has sought to develop natural language processing (NLP) tools that help users identify potentially risky self-disclosures in their text, but none have been designed for or evaluated with the users they hope to protect. Absent this assessment, these tools will be limited by the social-technical gap: users need assistive tools that help them make informed decisions, not paternalistic tools that tell them to avoid self-disclosure altogether.To bridge this gap, we conducted a study with $N=21$ Reddit users; we had them use a state-of-the-art NLP disclosure detection model on two of their own posts, and asked them questions to understand if and how the model helped, where it fell short, and how it could be improved to help them make more informed decisions. Despite its imperfections, users responded positively to the model and highlighted its use as a tool that can help them catch mistakes, inform them of risks they were unaware of, and encourage self-reflection. However our work also shows how, to be useful and usable, AI for supporting privacy decision making must account for posting context, disclosure norms, users’ lived threat models, and provide explanations that help contextualize detected risks.
Data Visualization
Arpit Narechania, Alex Endert, Clio Andris
Choropleth maps are a common and effective way to visualize geographic thematic data. Although cartographers have established many principles about map design, data binning and color usage, less is known about how mapmakers make individual decisions in practice. We interview 16 cartographers and geographic information systems (GIS) experts from 13 government organizations, NGOs, and federal agencies about their choropleth mapmaking decisions and workflows. We categorize our findings and report on how mapmakers follow cartographic guidelines and personal rules of thumb, collaborate with other stakeholders within and outside their organization, and how organizational structures and norms are tied to decision-making during data preparation, data analysis, data binning, map styling, and map post-processing. We find several points of variation as well as regularity across mapmakers and organizations and present takeaways to inform cartographic education and practice, including broader implications and opportunities for CSCW, HCI, and information visualization researchers and practitioners.
Designing for Privacy
Design(ing) Fictions for Collective Civic Reporting of Privacy Harms
Yuxi Wu, William Agnew, W. Keith Edwards, Sauvik Das
Individually-experienced privacy harms are often difficult to demonstrate and quantify, which impedes efforts for their redress. Their effects often appear small and are inconsistently documented, and they only become more obvious when aggregated over time and across populations. Taking a design fiction approach, we explore the design requirements and cultural ideals of a government-run system that empowers people to collectively report on and make sense of experiences of privacy harm from online behavioral advertising. Through the use of fictional inquiry, story completion, and comicboarding methods, delivered in an online survey with 50 participants, we found that participants had detailed conceptions of the user experience of such a tool, but wanted assurance that their labor and personal data would not be exploited further by the government if they contributed evidence of harm. We extrapolate these design insights to government-supported complaint-reporting platforms in other domains, finding multiple common design gaps that might disincentivize people to report experiences of harm, be they privacy-related or otherwise.
Fighting Misinformation, Building Believability
Mohsin Yousufi, Charlotte Alexander, Nassim Parvin
Marginalized groups often face situations in which their knowledge and experiences are dismissed due to prejudice or bias—a phenomenon identified and theorized as epistemic injustice in feminist philosophy. These circumstances frequently compel individuals to produce additional evidence to support their claims, ranging from paper documentation to data generated by technologies such as location logs. This paper examines the case of Heat Seek, an internet-connected temperature sensor designed to provide tenants in New York City with “objective and reliable data” when filing heating complaints and appearing in housing court. We present findings from a qualitative study, supplemented by document review and artifact analysis, to illuminate the tool’s functions and uses. Drawing on this case, we introduce a class of civic technologies—credibility boosters. We find that these technologies aim to overcome credibility deficits by: (1) backing individual and collective claims with objective data, (2) materializing intangible experiences as tangible evidence with aesthetic reliability, and (3) shifting epistemic authority to perceived neutral third parties. We conclude by demonstrating the institutional and social impacts of such technologies and call for greater attention to epistemic injustices within CSCW research, advocating for the design of institutional, legal, and social systems that confront biased systems and empower marginalized communities.
Harassment & Micro-Aggressions
Lara Karki, Kayla Uleah, Carl DiSalvo, Sierra Traynail Ross, Jadin Butler, Selamawit Husein, Emanuel Bryant, Dana Priest, Justin Booker, Betsy DiSalvo
LinkedIn is central to salaried job search and professional networking. In a career development program for adults seeking upward socioeconomic mobility through middle-wage computing work, we aimed to use LinkedIn to find and develop new social ties. However, we could not use the platform for this purpose. Through a participatory research approach, we formed a research team with diverse positionalities to understand why LinkedIn was difficult to use and how it could be better for our program. We analyzed recorded walk-throughs and confirmed our findings with two years of ethnographic field notes and written reflections. Our findings demonstrate that LinkedIn’s embedded algorithms and interface design prioritize users with large networks who can afford a LinkedIn Premium subscription. We argue that such platform-embedded power differentials lead to platform-delivered microaggressions. Non-Premium users and users with small networks must endure microaggressions to participate in the salaried labor market. We argue the politics of LinkedIn as a platform are such that its embedded power differentials are beyond our control and unlikely to change. Therefore, we recommend sociotechnical coping and mitigation strategies for career development programs in lieu of design implications for LinkedIn or similar platforms. We contribute a detailed example of how a technology reinforces pre-existing privilege without users’ knowledge.
Hate Speech
[BEST PAPER] Harm in Layers: Compositions of Misinformative Hate in Anti-Asian Speech and Their Impacts on Perceived Harmfulness
Jiawei Zhou, Gaurav Verma, Lei Zhang, Nicholas Chang, Munmun De Choudhury
During times of crisis, heightened anxiety and fear make individuals more vulnerable, creating fertile ground for hate speech and misinformation, as people are more likely to fall for and be influenced by it. This paper looks into the interwoven relationship between anti-Asian hatred and COVID-19 misinformation amid the pandemic. By analyzing 785,798 Asian hate tweets and surveying 308 diverse participants, this empirical study explores how hateful content portrays the Asian community, whether it is based on truth, and what makes such portrayal harmful. We observed a high prevalence of misinformative hate speech that appeared to be lengthier, less emotional, and carried more pronounced motivational drives than general hate speech. Overall, we found that anti-Asian rhetoric was characterized by an antagonism and inferiority framing, with misinformative hate underscoring antagonism and general hate emphasizing calls for action. Among all entities being explicitly criticized, China and the Chinese were constantly named to assign blame with misinformative hate more likely to finger-point than general hate. Our survey results indicated that hateful messages with misinformation, demographic targeting, or divisive references were perceived as significantly more damaging. Individuals who placed less importance on free speech, had personal encounters with hate speech, or believed in the natural origin of COVID-19 were more likely to perceive higher severity. Taken together, this work highlights the distinct compositions of hate within misinformative hate speech that influences perceived harmfulness and adds to the complexity of defining and moderating harmful content. We discuss the implications for designing more contextualized and culturally sensitive counter-strategies, as well as building more adaptive, explainable moderation approaches.
Humanized AI: Avatars, Agents, and Voice Assistants
Virtual agent-based communication skills training to facilitate health persuasion among peers
Farnaz Nouraei, Keith Rebello, Mina Fallah, Prasanth Murali, Haley Matuszak, Valerie Jap, Andrea Parker, Michael Paasche-Orlow, Timothy Bickmore
Many laypeople are motivated to improve the health behavior of their family or friends but do not know where to start, especially if the health behavior is potentially stigmatizing or controversial. We present an approach that uses virtual agents to coach community-based volunteers in health counseling techniques, such as motivational interviewing, and allows them to practice these skills in role-playing scenarios. We use this approach in a virtual agent-based system to increase COVID-19 vaccination by empowering users to influence their social network. In a between-subjects comparative design study, we test the effects of agent system interactivity and role-playing functionality on counseling outcomes, with participants evaluated by standardized patients and objective judges. We find that all versions are effective at producing peer counselors who score adequately on a standardized measure of counseling competence, and that participants were significantly more satisfied with interactive virtual agents compared to passive viewing of the training material. We discuss design implications for interpersonal skills training systems based on our findings.
Identifying and Mitigating AI Risks
A Risk Taxonomy and Reflection Tool for LLM Adoption in Public Health
Jiawei Zhou, Amy Chen, Darshi Shah, Laura Schwab Reese, Munmun De Choudhury
Recent breakthroughs in large language models (LLMs) have generated both interest and concern about their potential adoption as information sources or communication tools across different domains. In public health, where stakes are high and impacts extend across diverse populations, adopting LLMs poses unique challenges that require thorough evaluation. However, structured approaches for assessing potential risks in public health remain under-explored. To address this gap, we conducted focus groups with public health professionals and individuals with lived experience to unpack their concerns, situated across three distinct and critical public health issues that demand high-quality information: infectious disease prevention (vaccines), chronic and well-being care (opioid use disorder), and community health and safety (intimate partner violence). We synthesize participants’ perspectives into a risk taxonomy, distinguishing and contextualizing the potential harms LLMs may introduce when positioned alongside traditional health communication. This taxonomy highlights four dimensions of risk to individuals, human-centered care, information ecosystem, and technology accountability. For each dimension, we discuss specific risks and offer example reflection questions to help practitioners adopt a risk-reflexive approach. We discuss the need to revisit pre-existing mental models of help-seeking and complement evaluations with external validity and domain expertise through lived experience and real-world practices. Together, this work contributes a shared vocabulary and reflection tool for people in both computing and public health to collaboratively anticipate, evaluate, and mitigate risks in deciding when to employ LLM capabilities (or not) and how to mitigate harm.
Partisan Discourse Online
Pooja Casula, Richmond Wong
Social media platforms have been widely perceived as centers of political discourse, and have been shown to facilitate political participation among young adults (18-26 years). However, as the effects of online political discourse and behaviors have become pervasive offline, significantly affecting global political processes such as deterring women from public political office and influencing election outcomes, it raises questions regarding how young adult users engage in these online political spaces of discourse. In this paper, we focus on the perceptions and forms of engagement of Gen Z social media users, specifically those of Gen Z young adult women. In this paper we broadly ask, how do voting-age Generation (Gen) Z young adult women perceive spaces of political discourse on social media, and do these perceptions affect how they choose to engage in them? To explore this question, we conducted 17 interviews with voting-age Gen Z women across the United States. We found that our participants were largely critical of social media as spaces of political discourse. They were skeptical of the credibility of the political information shared on social media, questioned the usefulness of sharing political information through social media, and felt that social media was not conducive to having productive political discussions. We find that participant perceptions of social media political discourse led to them limiting their online engagement or disengaging entirely from online public political spaces, but expanding their offline private political engagement through in-person discussion. Our findings indicate that our participants were not politically disinterested, but rather did not partake in public forms of social media political engagement, leading us to question and reconsider widespread interpretations of ‘political participation’ that center and emphasize public forms of action and expression. Drawing on our findings, we propose that the practice of ‘disengagement’ from public spaces of online political discourse should be considered a dimension of political engagement and not separate from it. In proposing this, we also broadly question the efficacy of social media as a forum to promote and facilitate political discourse.
The Role of Partisan Culture in Mental Health Language Online
Sachin Pendse, Ben Rochford, Neha Kumar, Munmun De Choudhury
The impact of culture on how people express distress in online support communities is increasingly a topic of interest within Computer Supported Cooperative Work (CSCW) and Human-Computer Interaction (HCI). In the United States, distinct cultures have emerged from each of the two dominant political parties, forming a primary lens by which people navigate online and offline worlds. We examine whether partisan culture may play a role in how U.S. Republican and Democrat users of online mental health support communities express distress. We present a large-scale observational study of 2,184,356 posts from 8,916 statistically matched Republican, Democrat, and unaffiliated online support community members. We utilize methods from causal inference to statistically match partisan users along covariates that correspond with demographic attributes and platform use, in order to create comparable cohorts for analysis. We then leverage methods from natural language processing to understand how partisan expressions of distress compare between these sets of closely matched opposing partisans, and between closely matched partisans and typical support community members. Our data spans January 2013 to December 2022, a period of both rising political polarization and mental health concerns. We find that partisan culture does play into expressions of distress, underscoring the importance of considering partisan cultural differences in the design of online support community platforms.
Reflecting on Methodology
Reflexive Data Walks: Cultivating Feminist Ethos through Place-Based Inquiry
Sylvia Janicki, Shubhangi Gupta, Nassim Parvin
Reflexivity, as conceived by feminist epistemologies, is essential to advancing social justice design practice. Reflexivity is thus critical for CSCW and HCI scholars and practitioners who seek to build equitable technological futures, as it allows for a critical examination of explicit and implicit values and politics in design and research processes. In this paper, we put forth a participatory walking method grounded in feminist ethos for cultivating reflexivity by engaging with the theme of boundaries in space. We outline this method through three integrated place-based strategies, including an activity in the home, a data walk in the city, and making and sharing visualizations for collaborative understandings of place. We argue that engaging with place is critical to foregrounding positionality and cultivating reflexivity in research. We share our findings from two workshops where we examined the efficacy of this method. We outline how the method deepens the understandings of the built environment, self, and others; welcomes vulnerability and fosters openness to change; scaffolds practices of critical self questioning. In doing so, it leads to a recognition of the entanglement of socio-political values in design and data creation, revealing uncertainties and ambiguities that can open up new areas for inquiry and design.
Social and Environmental Justice
[HONORABLE MENTION] Sustaining Workers Who Sustain the World: Assets-Based Design for Conservation Technologies in Madagascar
Eric Greenlee, David Klinges, Lalatiana Randriamiharisoa, Kim Valenta, Jhoanny Rasojivola, Justorien Rambeloniaina, Nicolas Naina Rasolonjatovo, Georges Razafindramavo, Joel Ratsirarson, Zovelosoa Raharinavalomanana, Edouard Ramahatratra, Abigail Ross, Thomas Kelly, Jean Claude Rakotoarivelo, Tafitasoa Mijoro, Eric Tsiriniaina Rajoelison, Efitiria Efitiria, Josiah Hester, Ellen Zegura, Alex Cabral
Local workers and their knowledge are essential for sustainable and effective conservation efforts. However, many technology-assisted conservation programs are guided by global benchmarks (e.g., forest cover) and industry metrics (e.g., cost per acre), which often devalue local knowledge and fail to consider the economic and conservation goals of local workers. Assets-based design is well-suited to center workers and their strengths, yet it may fail to fully address the complexities of long-term conservation programs by not explicitly emphasizing workers’ goals or bolstering their assets. We extend recent approaches in assets-based design literature that address these limitations through our case studies of reforestation, biodiversity monitoring, and carbon sequestration programs in three protected areas in Madagascar. We leverage a mixed-methods approach of direct reactive observations, unstructured interviews, and an informal design workshop, revealing emergent themes surrounding economic sustainability and the value of local ecological knowledge in conservation. Finally, we explore examples, tensions, and design considerations for worker-centered conservation technology to: (1) prioritize local knowledge, (2) foster love of nature, (3) center economic goals, and (4) embrace local autonomy. This work advances the dialogue on assets-based design, promoting the co-creation of equitable and sustainable conservation technologies with workers in Global South settings by centering local economic priorities and enhancing workers’ strengths.
Camille Harris, Clio Andris
In 2021, the City of Atlanta and Atlanta Police Foundation launched joint plans to build a large police training facility in the South River Forest in unincorporated DeKalb County, GA. At this time, residents of Atlanta and DeKalb County, environmental activists, police and prison abolitionists, and other activists and concerned individuals formed the movement in opposition to the facility, known as the Stop Cop City / Defend the Atlanta Forest movement. Social media and digital maps became common tools for communicating information about the facility and the movement. In this work, we examine online maps about the facility and the opposition movement, originating from grassroots organizations, the City of Atlanta, news media outlets, the Atlanta Police Foundation, and individuals. We gather and examine 32 publicly available maps collected through the Google Search API, Twitter (now X), Instagram and reddit. Then, using a framework of critical cartography, we conduct a content analysis of these maps to identify the mapping technologies and techniques (data, cartographic elements, styles) used by different stakeholders in the construction of the facility and roles that maps and mapping technologies can play in social movements. Finally, we examine the extent to which these maps provide data to confirm or dispute concerns raised by grassroots organizations and local residents about the facility. We argue that documenting the use of maps to communicate information about a contentious project can help enumerate positions and perspectives about community issues. We find that the different uses of (and varied access to) geo-spatial technologies is uneven across stakeholders and mapmakers and advocate for accessible mapmaking tools. We conclude by discussing the implications of accessibility of mapping technology and posting maps to social media, and share example map images that extend the geographic information systems (GIS) techniques seen in the retrieved maps.
Supporting Older Adults’ Care
[HONORABLE MENTION] Rethinking Technological Solutions for Community-Based Older Adult Care: Insights from `Older Partners’ in China
Yuling Sun, Sam Ankenbauer, Yuchen Chen, Xiaojuan Ma, Zhifan Guo, Liang He
Aging in place refers to the enabling of individuals to age comfortably and securely within their own homes and communities. Continued community living creates a number of potential areas for design and, accordingly, various information and communication technologies have been employed to support older adult care. At the same time, human-led care services have been designed to support aging in place. Through a long-term ethnographic study that includes semi-structured interviews with 24 stakeholders, we consider these technology- and human-driven care infrastructures for aging in place, examining their origins, deployment, interactions with older adults, and challenges. In doing so, we reconsider the value of these different forms of older adult care, highlighting the various issues associated with using, for instance, health monitoring technology or appointment scheduling systems to care for older adults aging in place. We suggest that technology should take a “supportive, not substitutive” role in older adult care infrastructure and that designing for aging in place should not be synonymous with designing for independence but should, instead, consider the larger community and its dynamics.
Team Work Makes the Dream Work
Nathaniel Swinger, Cynthia Baseman, Myeonghan Ryu, Saeed Abdullah, Christopher Wiese, Andrew Sherrill, Rosa Arriaga
The mental health crisis in the United States spotlights the need for more scalable training for mental health workers. While present-day AI systems have sparked hope for addressing this problem, we must not be too quick to incorporate or solely focus on technological advancements. We must ask empirical questions about how to ethically collaborate with and integrate autonomous AI into the clinical workplace. For these Human-Autonomy Teams (HATs), poised to make the leap into the mental health domain, special consideration around the construct of trust is in order. A reflexive look toward the multidisciplinary nature of such HAT projects illuminates the need for a deeper dive into varied stakeholder considerations of ethics and trust. In this paper, we investigate the impact of domain—and the ranges of expertise within domains—on ethics- and trust-related considerations for HATs in mental health. We outline our engagement of 23 participants in two speculative activities: design fiction and factorial survey vignettes. Grounded by a video storyboard prototype, AI- and Psychotherapy-domain experts and novices alike imagined TEAMMAIT, a prospective AI system for psychotherapy training. From our inductive analysis emerged 10 themes surrounding ethics, trust, and collaboration. Three can be seen as substantial barriers to trust and collaboration, where participants imagined they would not work with an AI teammate that didn’t meet these ethical standards. Another five of the themes can be seen as interrelated, context-dependent, and variable factors of trust that impact collaboration with an AI teammate. The final two themes represent more explicit engagement with the prospective role of an AI teammate in psychotherapy training practices. We conclude by evaluating our findings through the lens of Mayer et al.’s Integrative Model of Organizational Trust to discuss the risks of HATs and adapt models of ability-, benevolence-, and integrity-based trust. These updates motivate implications for the design and integration of HATs in mental health work.
Trauma & Abuse
Making Sense of Trauma Over Time: Interweaving Feminist Temporalities to Understand Histories
Catherine Wieczorek, Cindy Lin, Shaowen Bardzell
Trauma, an emotional response to events with lasting impacts, is a significant public health issue influencing technology interactions. This paper focuses on the sixth principle of trauma-informed care—Cultural, Historical, and Gender Issues—by exploring multiple timescales of trauma and generational impacts through two ethnographic vignettes: a trauma-informed healthcare design project in Chicago and environmental advocacy in Borneo, Indonesia. We integrate feminist temporality to understand temporal contingencies in cultural contexts to inform future trauma-informed design and computing work. Our contributions include detailed ethnographic accounts that shift the focus from trauma as an individual event to a historically and communally felt phenomenon, advancing CSCW scholarship by incorporating historicist sensibilities and feminist theorizations of temporality.
More Research
Doctoral Consortium
The Mechanisms of Muting: Deconstructing the Technology-Mediated Violence of Silence
Jasmine Foriest
This research addresses a critical gap in HCI: while the field engages with “harm,” it inadequately conceptualizes “violence.” One gap lies in how digital artifacts mediate structural violence through muting. Muting — the systemic silencing of marginalized groups — prevents vulnerable populations from accessing potentially life-saving resources and results in preventable morbidity and mortality. Drawing from Muted Group Theory, I demonstrate how technologies imbued with dominant values amplify muting in unprecedented ways through information suppression in suicide reporting, social-computing design that silences gender-based violence survivors, and epistemic inequity perpetuated by generative AI. My dissertation employs survivor-centered mixed methods — surveys, narrative interviews, and phenomenological analysis to understand how intimate partner violence survivors use digital artifacts in help-seeking. This work will produce the first empirical understanding of relationships between muting experiences and adverse outcomes, alongside design recommendations for remediating muting in help-seeking technologies. My goal is establishing cross-disciplinary approaches to violence prevention through ethical technology design.
Panels/ SIGs
PANEL: Computing and the Arts: Establishing Theoretical and Methodological Foundations for Cross-Disciplinary Collaboration
Angela Schöpke-Gonzalez, Kellie Dunn, Shaowen Bardzell, Federico Bomba, Barbara Carreras, Makayla Lewis, Maria Murray
The last five years have resulted in substantial changes to how computing affects work, how work affects computing, and how work and computing operate in tandem to affect society. From advances in automation, artificial intelligence, and virtual/extended reality, to the entrenchment of hybrid and remote work arrangements, and the documented harmful societal impacts that computing work has produced, these changes to computing-work relationships raise concern \textit{and} opportunities to reimagine these relationships in new ways. CSCW has an opportunity and a responsibility to ensure that the kinds of futures we imagine and enact benefit workers, communities, and future generations. Artistic research is well-positioned to help us not only understand, but imagine new pathways forward in response to pressing CSCW questions. By hosting a panel of experts in artistic methods well-equipped to help us imagine these futures, we expect to lay the groundwork for mutually respectful cross-disciplinary collaboration between arts and computing that makes more space in our field for different kinds of thinking, approaches to problems, and new imaginaries.
SIG: Alternative Technology Consumption Under Capitalism
Alternative Technology Consumption Under Capitalism
Yuxi Wu, Beatriz Palacios Abad, Vishal Sharma, Hanlin Li, Alexandra To
Even as large technology companies come under increasing legal and political scrutiny, their market dominance continues to grow. As Big Tech tends toward monopoly, however, people continue to seek out alternative technology systems and uses. What are the conditions that lead people to choose alternatives? What are the long term values associated with having viable alternatives? This SIG presents alternative technology, or AltTech, as a growing area of interest for the CSCW community to consider. We invite community members with interests in technology non-use, design for disruption, and post-growth design to join us for a sketch-based speculative discussion to better understand the landscape and future of AltTech.
SIG: Conducting Research in Oppressive Settings
Conducting Research in Oppressive Settings
Adrian Petterson, Benedetta Lusi, Cristina Bosco, Ashique Ali Thuppilikkat, Anupriya Tuli, Catherine Wieczorek, Robert Soden, Emily Tseng, Priyank Chandra
As justice-related research faces increasing transnational and domestic repression, researchers working on topics like reproductive justice, LGBTQ2SIA+ equity, decolonization, climate justice, and social movements encounter escalating constraints and risks. While the CSCW community has increasingly advocated for research in these domains, the current political climate exacerbates the precarity experienced by scholars engaged in this work. Institutional mechanisms such as ethics approvals frequently fail to address researchers’ safety concerns, particularly for those from marginalized communities themselves. Collaborators within the same project experience varying levels of risk based on location, career stage, and identity. This Special Interest Group (SIG) will facilitate dialogue on practical strategies for conducting research under oppressive contexts, drawing on expertise from researchers who have developed survival and safety tactics. Discussions will address data storage practices, visibility considerations, transnational collaboration strategies, and psychological safety mechanisms. Our goal is to establish a collaboratively curated resource collection supporting researchers as they navigate oppressions in their collaborations, recognizing these threats continue to grow in scale and intensity.
Posters
From Hashtag to Human-Centered Insights: Rethinking Disability Awareness Across Languages
Zainab AlMeraj, Fatemah Husain, Rosa Arriaga
As global discourse on disability expands, much of the digital awareness and inclusion efforts remain anchored in English-language narratives. This linguistic dominance limits our understanding of how disability is perceived, discussed, and mobilized across culturally diverse regions— particularly within underrepresented communities in the Global South. This study investigates cross-lingual and cross-cultural perspectives on disability awareness by analyzing three years of public posts from X (formerly Twitter), using the hashtag #peoplewithdisabilities. Through natural language processing (NLP), we examine (1) posting behaviors and engagement dynamics, (2) sentiment and empathy-oriented language, and (3) culturally embedded narrative framings in both Arabic and English content. Our interdisciplinary lens draws from computational linguistics and disability studies allows us to interpret trends beyond surface metrics. Findings reveal that Arabic posts often reflect familial, religious, and collectivist viewpoints rooted in local cultural values, while English posts emphasize rights-based advocacy and individual empowerment. Emotional expression and engagement patterns also diverge, highlighting that awareness itself is not universal but culturally constructed and contextually nuanced. We argue that designing inclusive technologies requires more than linguistic translation, it demands sensitivity to the cultural frameworks shaping disability discourse.
Workshops
Structuring Collaborative Reflection: Integrating Diary Study and Focus Group Discussion
Jixiang Fan, Jiacheng Zhao, Sunggyeol Oh, Michael Bolmer, Yoonje Lee, Nick Flammer, Yuhao Chen, D. Scott McCrickard
We present a structured reflection framework integrating diary study and focus group discussion to support collaborative meaning-making in HCI education. The framework follows a multi-phase design in which students progress from individual journaling to a two-stage group discussion sequence: first within shared application contexts, then across emergent experiential themes. To support this process, we extended DiaryQuest, a lightweight educational tool incorporating AI-assisted grouping, image-based prompts, and a Jigsaw-inspired workflow to scaffold participation. A preliminary classroom deployment with 11 undergraduate students suggests that the approach lowers the barrier to reflective dialogue, encourages cross-perspective engagement, and helps students surface design-relevant insights grounded in lived experience. These findings point to new opportunities for structuring reflection in sociotechnical learning environments.
CSCW Contributions to Critical Futures of Work
Alina Lushnikova, Michael Muller, Shaowen Bardzell, Toby Li, Saiph Savage, Saiph Savage
As the CSCW community evolves and participates in envisioning the impact of technologies on the work practices, we want to ensure that critical and alternative computing perspectives are well represented while we are co-constructing the future of work. In this hybrid workshop, we invite researchers, practitioners, civic actors, economists, and other interested parties to challenge dominant, powerful, status-quo narratives and imaginaries when considering the future of work, nurturing the CSCW commitments and methods. Co-constructing the workshop with participants, we aim to develop actionable insights and strengthen the community.
Exploring Resistance and Other Oppositional Responses to AI
Eric Baumer, Eric Baumer, Inha Cha, Vera Khovanskaya, Rosemary Steup, Janet Vertesi, Richmond Wong
This workshop will gather researchers and practitioners who study, and/or engage in, opposition to the proliferation of AI technologies. It will do so based on an inclusive conceptualization of what counts as AI, thereby assembling a diverse collection of participants and perspectives. The organizers will especially solicit submissions that respond to a variety of specific themes: resistance in organizational contexts; understandings of community-based collective resistance; research around non-voluntary adoption; considerations around distributions of power in the creation and use of AI; implications for designing technologies to support opposition, and the possibility of resistance indirectly reifying current conceptions of AI. Prospective participants will be invited to submit descriptions of their work either studying or engaging in oppositional practices, as well as a challenge they have faced in doing so. The workshop will involve a series of interactive, hands-on activities to enable participants to share both challenges and strategies. In addition to catalyzing connections among researchers, the workshop will also produce two concrete outputs: a living annotated bibliography of relevant citations across diverse domains, and a practical guide with context-sensitive tactics for challenging the perceived inevitability of AI.

ACM Conference on Computer and Communications Security
Taipei, Taiwan | Oct 13–17, 2025
Applied Cryptography
Distance-Aware OT with Application to Fuzzy PSI
Lucas Piske, Jaspal Singh, Ni Trieu, Vladimir Kolesnikov, Vassilis Zikas
May the Force Not be With You: Brute-Force Resistant Biometric Authentication and Key Reconstruction
Alexandra Boldyreva, Deep Inder Mohan, Tianxin Tang
Toss: Garbled PIR from Table-Only Stacking
Lucien K. L. Ng, Vladimir Kolesnikov
Blockchain and Distributed Systems
Lite-PoT: Practical Powers-of-Tau Setup Ceremony
Lucien K. L. Ng, Pedro Moreno-Sanchez, Mohsen Minaei, Panagiotis Chatzigiannis, Adithya Bhat, Duc Le
Hardware, Side Channels, and Cyber Physical Systems
MOLE: Breaking GPU TEE with GPU-Embedded MCU
Hongyi Lu, Yunjie Deng, Sukarno Mertoguno, Shuai Wang, Fengwei Zhang
One Video to Steal Them All: 3D-Printing IP Theft through Optical Side-Channels
Twisha Chattopadhyay, Fabricio Ceschin, Marco Garza, Dymytriy Zyunkin, Animesh Chhotaray, Aaron Stebner, Saman Zonouz, Raheem Beyah
WireTap: Breaking Server SGX via DRAM Bus Interposition
Alex Seto, Oytun Kuday Duran, Samy Amer, Jalen Chuang, Stephan van Schaik, Daniel Genkin, Christina Garman
Machine Learning and Security
VillainNet: Targeted Poisoning Attacks Against SuperNets Along the Accuracy-Latency Pareto Frontier
David Oygenblik, Abhinav Vemulapalli, Animesh Agrawal, Debopam Sanyal, Alexey Tumanov, Brendan Saltaformaggio
Privacy and Anonymity
Fingerprinting SDKs for Mobile Apps and Where to Find Them: Understanding the Market for Device Fingerprinting
Michael Specter, Abbie Farr, Bo Ma, Robin Lassonde, Mihai Christodorescu
Security Usability and Measurement
A Sea of Cyber Threats: Maritime Cybersecurity from the Perspective of Mariners
Anna Raymaker, Akshaya Kumar, Miuyin Yong Wong, Ryan Pickren, Animesh Chhotaray, Frank Li, Saman Zonouz, Raheem Beyah
The Challenges and Opportunities with Cybersecurity Regulations: A Case Study of the US Electric Power Sector
Sena Sahin, Burak Sahin, Robin Berthier, Kate Davis, Saman Zonouz, Frank Li
Web Security
Enhanced Web Application Security Through Proactive Dead Drop Resolver Remediation
Jonathan Fuller, Mingxuan Yao, Saumya Agarwal, Srimanta Barua, Taleb Hirani, Amit Kumar Sikder, Brendan Saltaformaggio
Head(er)s Up! Detecting Security Header Inconsistencies in Browsers
Jannis Rautenstrauch, Trung Tin Nguyen, Karthik Ramakrishnan, Ben Stock
Lock the Door But Keep the Window Open: Extracting App-Protected Accessibility Information from Browser-Rendered Websites
Haichuan Xu, Runze Zhang, Mingxuan Yao, David Oygenblik, Yizhi Huang, Jeman Park, Brendan Saltaformaggio

ACM Symposium on User Interface Software and Technology
Busan, Korea | Sep 28–Oct 1, 2025
Best Paper
DissolvPCB: Fully Recyclable 3D-Printed Electronics with Liquid Metal Conductors and PVA Substrates
Zeyu Yan, Su Hwan Hong, Josiah Hester, Tingyu Cheng, Huaishu Peng
We introduce DissolvPCB, an electronic prototyping technique for fabricating fully recyclable printed circuit board assemblies (PCBAs) using affordable FDM 3D printing, with polyvinyl alcohol (PVA) as a water-soluble substrate and eutectic gallium-indium (EGaIn) as the conductive material. When obsolete, the PCBA can be easily recycled by immersing it in water: the PVA dissolves, the EGaIn re-forms into a liquid metal bead, and the electronic components are recovered. These materials can then be reused to fabricate a new PCBA. We present the DissolvPCB workflow, characterize its design parameters, evaluate the performance of circuit produced with it, and quantify its environmental impact through a lifecycle assessment (LCA) comparing it to conventional CNC-milled FR-4 boards. We further develop a software plugin that automatically converts PCB design files into 3D-printable circuit substrate models. To demonstrate the capabilities of DissolvPCB, we fabricate and recycle three functional prototypes: a Bluetooth speaker featuring a double-sided PCB, a finger fidget toy with a 3D circuit topology, and a shape-changing gripper enabled by joule heat driven 4D printing. The paper concludes with a discussion of current technical limitations and opportunities for future directions.
Papers
Yue Lyu, Xizi Wang, Hanlu Ma, Yalong Yang, Jian Zhao
Effective communication between pilots and air traffic control (ATC) is essential for aviation safety, but verbal exchanges over radios are prone to miscommunication, especially under high workload conditions. While cockpit-embedded visual aids offer the potential to enhance ATC communication, little is known about how to design and integrate such aids. We present an exploratory, user-centered investigation into the design and integration of icon-based visual aids, named ATCion, to support in-cockpit ATC communication, through four phases involving 22 pilots and 1 ATC controller. This study contributes a validated set of design principles and visual icon components for ATC messages. In a comparative study of ATCion, text-based visual aids, and no visual aids, we found that our design improved readback accuracy and reduced memory workload, without negatively impacting flight operations; most participants preferred ATCion over text-based aids, citing their clarity, low cognitive cost, and fast interpretability. Further, we point to implications and opportunities for integrating icon-based aids into future multimodal ATC communication systems to improve both safety and efficiency.
BIOGEM: A Fully Biodegradable Gelatin-Based McKibben Actuator with Embedded Sensing
Gaolin Ge, Haoran Lu, Yingting Gao, Qifeng Yang, Josiah Hester, Tingyu Cheng, Yiyue Luo
We present BIOGEM, a fully biodegradable McKibben actuator with integrated sensing, made from gelatin-based composites. By tailoring the material compositions, we customize the mechanical and electrical properties of the biodegradable composites, creating an integrated biodegradable system that combines both actuation and sensing functionalities. BIOGEM integrates a McKibben actuating structure by using stiff gelatin as outer braiding and the stretchable gelatin as air chambers. It also integrates resistive strain sensing through ionic gelatin, allowing the actuator to monitor its own deformation without relying on conventional electronics. We characterize the actuator’s performance across key parameters including braid angle, wall thickness, and material stiffness, demonstrating reliable contraction and repeatable force output at low pressures. Biodegradation is validated through both enzyme-assisted and backyard soil studies, confirming the material’s sustainable end-of-life behavior under realistic conditions. We illustrate the potential of this platform through interactive, edible, and environmentally-degradable prototypes across human–computer interaction and soft robotics scenarios.
CoSight: Exploring Viewer Contributions to Online Video Accessibility Through Descriptive Commenting
Ruolin Wang, Xingyu Bruce Liu, Biao Wang, Wayne Zhang, Ziqian Liao, Ziwen Li, Amy Pavel, Xiang Chen
The rapid growth of online video content has outpaced efforts to make visual information accessible to blind and low vision (BLV) audiences. While professional Audio Description (AD) remains the gold standard, it is costly and difficult to scale across the vast volume of online media. In this work, we explore a complementary approach to broaden participation in video accessibility: engaging everyday video viewers at their watching and commenting time. We introduce CoSight, a Chrome extension that augments YouTube with lightweight, in-situ nudges to support descriptive commenting. Drawing from Fogg’s Behavior Model, CoSight provides visual indicators of accessibility gaps, pop-up hints for what to describe, reminders to clarify vague comments, and related captions and comments as references. In an exploratory study with 48 sighted users, CoSight helped integrate accessibility contribution into natural viewing and commenting practices, resulting in 89% of comments including grounded visual descriptions. Follow-up interviews with four BLV viewers and four professional AD writers suggest that while such comments do not match the rigor of professional AD, they can offer complementary value by conveying visual context and emotional nuance for understanding the videos.
DropPop: Designing Drop-to-Deploy Mechanisms with Bistable Scissors Structures
Yibo Fu, Emily Guan, Jianzhe Gu, Dinesh K Patel, Justin U Soza Soto, Yichi Luo, Carmel Majidi, Josiah Hester, Lining Yao
Deployable structures often rely on complex deployment mechanisms such as external pneumatic pumps, electric motors, or manual assembly. These conventional methods, which are intended for applications in shape morphing architectures, robotics, and product design, can be bulky and unwieldy for everyday interaction and daily use. We introduce a new class of deployable structures that harness the locomotion of a single bistable cap to drive the expansion of a scissor-like mechanism. Such structures can be rapidly deployed (0.2-0.7s) upon a small trigger, and stabilize themselves requiring no sustained energy input. We explore various input modalities for deployment such as hand dropping, and drone deployment, and showcase demo applications. Additionally, we provide a computational design tool for customizing shape primitives with physics simulation and offer design guidelines for fabrication.
ForcePinch: Force-Responsive Spatial Interaction for Tracking Speed Control in XR
Chenyang Zhang, Tiffany S Ma, John Andrews, Eric J Gonzalez, Mar Gonzalez-Franco, Yalong Yang
Spatial interaction in 3D environments requires balancing efficiency and precision, which requires dynamic tracking speed adjustments. However, existing techniques often couple tracking speed adjustments directly with hand movements, reducing interaction flexibility. Inspired by the natural friction control inherent in the physical world, we introduce ForcePinch, a novel force-responsive spatial interaction method that enables users to intuitively modulate pointer tracking speed and smoothly transition between rapid and precise movements by varying their pinching force. To implement this concept, we developed a hardware prototype integrating a pressure sensor with a customizable mapping function that translates pinching force into tracking speed adjustments. We conducted a user study with 20 participants performing well-established 1D, 2D, and 3D object manipulation tasks, comparing ForcePinch against the distance-responsive technique Go-Go and speed-responsive technique PRISM. Results highlight distinctive characteristics of the force-responsive approach across different interaction contexts. Drawing on these findings, we highlight the contextual meaning and versatility of force-responsive interactions through four illustrative examples, aiming to inform and inspire future spatial interaction design.
Adam J Coscia, Shunan Guo, Eunyee Koh, Alex Endert
As multi-turn dialogues with large language models (LLMs) grow longer and more complex, how can users better evaluate and review progress on their conversational goals? We present OnGoal, an LLM chat interface that helps users better manage goal progress. OnGoal provides real-time feedback on goal alignment through LLM-assisted evaluation, explanations for evaluation results with examples, and overviews of goal progression over time, enabling users to navigate complex dialogues more effectively. Through a study with 20 participants on a writing task, we evaluate OnGoal against a baseline chat interface without goal tracking. Using OnGoal, participants spent less time and effort to achieve their goals while exploring new prompting strategies to overcome miscommunication, suggesting tracking and visualizing goals can enhance engagement and resilience in LLM dialogues. Our findings inspired design implications for future LLM chat interfaces that improve goal communication, reduce cognitive load, enhance interactivity, and enable feedback to improve LLM performance.
Posters
MILO: An LLM Multi-Stage Conversational Agent for Fostering Teenagers’ Mental Resilience
Han Bao, Yongan Yu, Bohan Wang, Xiaowen Lu, Xin Tong
Adolescence is a significant period that shapes long-term development and well-being. Mental disorders contribute to 15% of the global disease burden among teenagers, according to the WHO. Adverse well-being during adolescence can not only compromise physical health but also lead to a wide range of negative social outcomes throughout life. Motivated by the potential of generative AI conversational agents to provide scalable and personalized support to cultivate mental resilience, we designed Milo, an LLM digital companion grounded in cognitive behavioral therapy (CBT), tailored specifically for teenagers. Milo promotes greater involvement of teenagers in the development of emotional awareness and resilience strategies through agent customization and offering an interactive interface.
Noetic Dream: A Personalized VR and Meditation System for Lucid Dream Training
Yichen Yu, Qiaoran Wang
Lucid dreaming relies on a high level of metacognition and requires significant time and effort to master induction techniques, presenting obstacles for those seeking such experiences. This study proposes a personalized lucid dreaming training system Noetic Dream that combines virtual reality (VR) with open-monitoring(OM) meditation, acting on the mechanism of “dream awareness” through both external and internal pathways. VR provides immersive dream-based games to help users practice identifying unrealistic states, while OM meditation stabilizes internal focus and implants lucid intent. The training cycle uses multimodal cues to help users establish dream recognition mechanisms, thereby increasing the likelihood of lucid dreaming. The contributions of this study include: applying generative language models (LLMs) to construct dream VR scenarios, designing dream anomaly detection game mechanisms to stimulate dream awareness, and integrating OM meditation to achieve a non-invasive lucid dreaming training pathway, thereby effectively increasing the probability of spontaneous lucid dreaming.

Telecommunications Policy Research Conference
Washington, D.C. | Sept. 18–20, 2025
Data Governance
This paper focuses on cross-border data flow regulations regarding Connected Vehicles (CVs) in the People’s Republic of China (PRC), the European Union (EU), and the United States of America (USA). The paper reviews the engineering-cybersecurity literature regarding CVs and derives from this a classification of data types generated by the CV ecosystem. It then analyzes the legal and policy texts regarding CVs from the three jurisdictions. By mapping the data types to each jurisdiction’s restrictions and regulations, the paper unpacks how they conceptualize the risks or threats from CV data and how they operationalize these concerns into CV data regulation. The paper’s objective is to provide a detailed examination of the similarities and differences among the three jurisdictions. We discover that governments’ attempt to regulate data flows pushes them into classification systems for information, and that governments attach different values or policy interests to these categories.
Platforms and Competition
Interconnection and Rivalry in Global Monetary Networks
Karim Farhat, Milton L. Mueller, Vagisha Srivastava
In this white paper, we apply concepts of network competition to analyze the contest for dominance between the US dollar, a BRICS alliance against the dollar, and a politically neutral money like Bitcoin.
Global money networks have network externalities; a currency becomes more valuable as more users in more countries accept it and use it. Users thus tend to converge on a single, dominant network for payments that maximizes their demand-side economies of scope. Drawing on empirical evidence from telecommunications competition and network externality theory, we show that when three systems with network externalities compete, an interconnection agreement between the dominant system and one of the two competitors can isolate and exclude the third system. We analyze the governance of dollar stablecoins as the monetary equivalent of an interconnection agreement between the fiat dollar and Bitcoin. We argue that the fiat dollar can strengthen its global dominance by fostering a stronger interconnection with Bitcoin via dollar stablecoins.
Dollar stablecoins are the optimal conversion asset between a liquid medium of exchange like the dollar and a less liquid store of value like Bitcoin. With a formal interconnection between dollar stablecoins and Bitcoin, demand-side economies of scope are shared, and strong complementarities become evident. Stablecoins serve as a medium of remittance and short-term savings while Bitcoin serves as a longer-term store of value or speculative asset, as with gold. At the same time, an interconnection agreement acts as an implicit check, imposing fiscal discipline on US dollar governance. If the dollar weakens excessively, a positive feedback loop ensues in Bitcoin where the more users diversify to Bitcoin the more its price appreciates and the more users drive value away from the fiat dollar, and so on.
As such, we argue policymakers should proactively foster an interconnection between dollar stablecoins and Bitcoin to strengthen the US dollar’s global dominance and forestall long-term threats to its hegemony. The interconnection agreement should center around:
• Designing a federal regulatory framework for stablecoins centered on open capital markets — without picking favorites.
• Incentivizing stablecoin operators to reduce short-term bonds in favor of longer-term securities and harder assets, enhancing stability and market confidence.
• Encouraging emerging markets and BRICS nations to freely access dollar stablecoins and Bitcoin as reliable stores of value depending on their needs; and
• Eliminating capital gains and tax reporting requirements for long-term Bitcoin saving and long-term Bitcoin to dollar stablecoin conversions to retain capital in the United States and simultaneously encourage more dollar exports for the foreseeable future. By pursuing these policies, the dollar’s network advantage can be reinforced, ensuring it remains the dominant currency in an increasingly contested global monetary landscape.
Routing Security Adoption
The Role of RIRs in RPKI Adoption
Josephine Wolff, Cecilia Testart
Recognizing the relevance of securing inter-domain routing to protect traffic flows in the Internet, the Internet Engineering Task Force (IETF) standardized the Resource Public Key Infrastructure (RPKI), a framework to provide networks with a system to cryptographically validate routing data. Despite many obstacles, RPKI has emerged as the consensus to improve routing security and currently about 50% of routed IP address blocks are part of the system. The Regional Internet Registries (RIRs) are in charge of allocating address space in five different geographical zones and play a crucial role in RPKI: they are the roots of trust of the crypto graphic system and provide the infrastructure to host RPKI certificates and keys for the Internet resources allocated in their region. Organizations and networks wanting to issue RPKI records for their address space need to follow the process from the RIR that delegated their address space. In this paper, we analyze the RIRs’ implementation of RPKI infrastructure from the perspective of network operators. Based on in-depth interviews with 13 network engineers who have been involved in their organizations’ efforts to adopt RPKI, we examine the RIR initiatives
that have or would have most supported RPKI adoption for different types of organizations. Given RIRs have independently developed and implemented the cryptographic infrastructure as well as the tooling to issue and manage certificates, we offer recommendations on strategies that have encouraged RPKI adoption.
Satellite and Space Networks
Are Leo Networks the Future of National Emergency Failover? – A Quantitative Study and Policy Blueprint
Vaibhav Bhosale, Zachary Bischof, Fabián E. Bustamante, Ying Zhang, Sameer Kapoor, Robin Kim, Miguel Schlicht, Muskaan Gupta, Ekaterina Tumanova, Alberto Dainotti, Ahmed Saeed
Low Earth Orbit (LEO) satellite networks are emerging as backups for national-scale outages. While they have demonstrated value in small-scale disasters such as supporting first responders during hurricanes, their effectiveness during large-scale infrastructure failures remains underexplored. This paper evaluates the capacity of LEO networks to act as national failover infrastructure using six real-world submarine cable failures. The failure capacity provided by a LEO network to a specific nation depends on a few key factors: the size of the country, the distribution of the user terminals, and the policies of the network operator for spectrum allocation and traffic engineering. We find that coordinated policies between governments and network operators, especially regarding terminal placement and spectrum use, can improve failover capacity by up to 1.8× without requiring additional infrastructure. However, even under optimistic conditions with 200,000 terminals and a dedicated failover network, LEO networks can only restore 0.9–14.7% of lost submarine cable capacity in most cases.
User-Generated Content
The Impact of Premium Licenses on Creator Behavior
Jae Sang Rhee
The creator economy relies on third-party platforms, free-sharing platforms, which enable creators to reach wide audiences, enhancing monetization opportunities. However, creators often remain uncompensated, their visibility declines due to content oversaturation, and the unauthorized use of their work poses significant risks as training datasets vital to artificial intelligence (AI) frequently draw from freely accessible creator content. These issues directly harm both creators and platforms. Some platforms introduced a premium license, offering subscription-based exclusive content, upfront creator payments, and enhanced copyright protection. This paper investigates the impact of premium licensing on creator behavior by leveraging a unique natural experiment. Using data from Unsplash and Pexels, we find that introducing premium licenses on free-sharing platforms reduces the volume of freely available content by 13.2\%. Particularly, this decline is observed even among creators who could not get into the premium license. We further identify two mechanisms driving this decline. First, reduced multi-homing occurs as existing creators deactivate accounts and move away from the platform offering premium license. Second, creators improve free content quality to stay competitive with premium offerings. Our findings highlight crucial trade-offs associated with premium licensing, demonstrating significant unintended consequences for content volume and quality. These issues directly impact both creators and platforms, underscoring the importance of strategic policy design in platform monetization.

Research Activities
A Common’s Approach to Cybersecurity Policy (Tutorial)
Vaibhav Garg, Comcast, Holly Peterson, Louisiana State University and Milton Mueller, Georgia Tech
There are two dominant paradigms in Tech Policy. The first one assumes technology outcomes to be a public good and ground policy interventions in regulatory responses. The second one asserts these outcomes to be private goods and targets policy solutions that address market incentives as well as associated dynamics. Yet the interconnected nature of Telecommunications technologies, such as Internet, as well as the correlated nature of associated risks, such as cybersecurity, means that there is a third option. This third way assumes technology outcomes to be common pool resources. Untrammeled extraction of these resources may lead to a Tragedy of the Commons. Numerous institutions across distinct domains have been able to avoid said Tragedy by investing in community-based governance. Research documenting the commonalities between such institutions led to Elinor Ostrom’s 2010 Nobel Prize winning work called the Institutional Analysis and Development framework (IAD).
Despite IAD’s successful application in many risk domains, its formal application to Telecommunications Policy, especially in cybersecurity, has been underexplored. Yet, telecommunications policy stakeholders – especially in emerging technologies – will often leverage community-based interventions. Applying IAD to such interventions may provide significant insights, making community-based governance both more effective and efficient. Furthermore, formal application of IAD to Telecommunications Policy may open opportunities for new policy solutions in cybersecurity. The goal of this workshop is to introduce TPRC attendees to the IAD framework and teach its application to cybersecurity.
The Regulatory Challenge of Artificial Intelligence (Panel)
The character of generative AI technologies present unique challenges to traditional regulatory paradigms. The panel participants have been conducting research in this field and will report briefly on their recent findings to provoke discussion among the panel members and audience.
Topics include: The intersection of intellectual property rights with AI; the framing of AI Ethics in terms of their social, economic and political contexts, the regulatory ramification of the potential existential risk of AI systems, current regulatory models in the U.S. and Europe and a view of AI as distributed computing.
Panelists:
Russ Neuman, New York University
Christopher Yoo, University of Pennsylvania
Christos Makridis, Stanford University
Chloé Bakalar, Meta
Milton L. Mueller, Georgia Institute of Technology

USENIX Security Symposium
Seattle | August 13 – 15, 2025
Hardware Security 1: Microarchitectures
FLOP: Breaking the Apple M3 CPU via False Load Output Predictions
Jason Kim, Jalen Chuang, Daniel Genkin, Yuval Yarom
To bridge the ever-increasing gap between the fast execution speed of modern processors and the long latency of memory accesses, CPU vendors continue to introduce newer and more advanced optimizations. While these optimizations improve performance, research has repeatedly demonstrated that they may also have an adverse impact on security. In this work, we identify that recent Apple M- and A-series processors implement a load value predictor (LVP), an optimization that predicts the contents of memory that the processor loads before the contents are actually available. This allows processors to alleviate slowdowns from Read-After-Write dependencies, as instructions can now be executed in parallel rather than sequentially. To evaluate the security impact of Apple’s LVP implementation, we first investigate the implementation, identifying the conditions for prediction. We then show that although the LVP cannot directly predict 64-bit values (e.g., pointers), prediction of smaller-size values can be leveraged to achieve arbitrary memory access. Finally, we demonstrate end-to-end attack exploit chains that build on the LVP to obtain a 64-bit read primitive within the Safari and Chrome browsers.
Hardware Security 3: Side-Channel and Fault Injection Attacks
ECC.fail: Mounting Rowhammer Attacks on DDR4 Servers with ECC Memory
Nureddin Kamadan, Walter Wang, Stephan van Schaik, Christina Garman, Daniel Genkin, Yuval Yarom
Rowhammer is a hardware vulnerability present in nearly all computer memory, allowing attackers to modify bits in memory without directly accessing them. While Rowhammer has been extensively studied on client and even mobile platforms, no successful Rowhammer attack has been demonstrated on server platforms using DDR4 ECC memory. Tackling this challenge, in this paper we demonstrate the first end-to-end Rowhammer technique effective against Intel servers using Hynix DDR4 ECC memory. To that aim, we first characterize the Hynix implementation of Target Row Refresh (TRR) on server parts, demonstrating effective hammering patterns on both FPGA and Intel-based testing platforms with ECC disabled. We then reverse engineer Intel’s ECC implementation on Skylake and Cascade Lake servers. We find that it has a coding distance of four, which often allows triggering incorrect ECC correction with just two bit flips. Combining the two observations, we present an end-to-end Rowhammer attack which can flip bits on Intel servers, without causing crashes. Finally, we demonstrate the effectiveness of our attack by hammering RSA public keys loaded into memory, causing the server to accept messages not signed by the original key.
Privacy 1: Differential Privacy and Audit
General-Purpose f-DP Estimation and Auditing in a Black-Box Setting
Önder Askin, Holger Dette, Martin Dunsche, Tim Kutta, Yun Lu, Yu Wei, Vassilis Zikas
In this paper we propose new methods to statistically assess f-Differential Privacy (f-DP), a recent refinement of differential privacy (DP) that remedies certain weaknesses of standard DP (including tightness under algorithmic composition). A challenge when deploying differentially private mechanisms is that DP is hard to validate, especially in the black-box setting. This has led to numerous empirical methods for auditing standard DP, while f-DP remains less explored. We introduce new black-box methods for f-DP that, unlike existing approaches for this privacy notion, do not require prior knowledge of the investigated algorithm. Our procedure yields a complete estimate of the f-DP trade-off curve, with theoretical guarantees of convergence. Additionally, we propose an efficient auditing method that empirically detects f-DP violations with statistical certainty, merging techniques from non-parametric estimation and optimal classification theory. Through experiments on a range of DP mechanisms, we demonstrate the effectiveness of our estimation and auditing procedures.
Privacy 2: Consent, Compliance, and Provable Privacy
Evaluating Privacy Policies under Modern Privacy Laws At Scale: An LLM-Based Automated Approach
Qinge Xie, Karthik Ramakrishnan, Frank Li
Website privacy policies detail an online service’s information practices, including how they handle user data and rights. For many sites, these disclosures are now necessitated by a growing set of privacy regulations, such as GDPR and multiple US state laws, offering visibility into privacy practices that are often not publicly observable. Motivated by this visibility, prior work has explored techniques for automated analysis of privacy policies and characterized specific aspects of real-world policies on a larger scale. However, existing approaches are constrained in the privacy practices they evaluate, as they rely upon rule-based methods or supervised classifiers, and many predate the prominent privacy laws now enacted that drastically shape privacy disclosures. Thus, we lack a comprehensive understanding of modern website privacy practices disclosed through privacy policies. In this work, we seek to close this gap by providing a systematic and comprehensive evaluation of website privacy policies at scale. We first systematize the privacy practices discussed by 10 notable privacy regulations currently in effect in the European Union and the US, identifying 34 distinct clauses on privacy practices across 4 overarching themes. We then develop and evaluate an LLM-based approach for assessing these clauses in privacy policies, providing a more accurate, comprehensive, and flexible analysis compared to prior techniques. Finally, we collect privacy policies from over 100K websites, and apply our LLM method to a subset of sites to investigate in-depth the privacy practices of websites today. Ultimately, our work supports broader investigations into web privacy practices moving forward.
Software Security 3: Fuzzing
Hybrid Language Processor Fuzzing via LLM-Based Constraint Solving
Yupeng Yang, Shenglong Yao, Jizhou Chen, Wenke Lee
Language processors, such as compilers and interpreters, play a crucial role in modern cyberspace. Faulty language processors can lead to severe consequences such as incorrect functionalities or malicious attacks. It is non-trivial to automatically test language processors to detect faulty behaviors, because language processors are multistaged and require various complex constraints to reach deep program states. Existing testing (fuzzing) approaches either fail to effectively generate inputs that satisfy the complex constraints or fail to generalize due to their heavy reliance on target-specific constraint modeling heuristics. In this paper, we explore the potential of using LLMs for constraint solving to address these limitations and identify two challenges regarding constraint prioritization and context construction. To effectively address these challenges, we propose two novel solutions, hybrid centrality prioritization and iterative context construction. We implement the solutions in a hybrid fuzzing framework, HLPFuzz, which leverages an LLM to overcome complex constraints and reach deep program states. In our evaluation, HLPFuzz successfully discovers 52 bugs in 9 popular language processors, of which 37 are confirmed and 14 are fixed. HLPFuzz also outperforms state-of-the-art solutions by up to 190% in code coverage and discovers 5x more bugs than the second-best fuzzer, with minimal reliance on target-specific heuristics.
Waltzz: WebAssembly Runtime Fuzzing with Stack-Invariant Transformation
Lingming Zhang, Binbin Zhao, Jiacheng Xu, Peiyu Liu, Qinge Xie, Yuan Tian, Jianhai Chen, Shouling Ji
WebAssembly (Wasm) is a binary instruction format proposed by major browser vendors to achieve near-native performance on the web and other platforms. By design, Wasm modules should be executed in a memory-safe runtime, which acts as a trusted computing base. Therefore, security vulnerabilities inside runtime implementation can have severe impacts and should be identified and mitigated promptly. Fuzzing is a practical and widely adopted technique for uncovering bugs in real-world programs. However, to apply fuzzing effectively to the domain of Wasm runtimes, it is vital to address two primary challenges: (1) Wasm is a stack-based language and runtimes should verify the correctness of stack semantics, which requires fuzzers to meticulously maintain desired stack semantics to reach deeper states. (2) Wasm acts as a compilation target and includes hundreds of instructions, making it hard for fuzzers to explore different combinations of instructions and cover the input space effectively. To address these challenges, we design and implement Waltzz, a practical greybox fuzzing framework tailored for Wasm runtimes. Specifically, Waltzz proposes the concept of stack-invariant code transformation to preserve appropriate stack semantics during fuzzing. Next, Waltzz introduces a versatile suite of mutators designed to systematically traverse diverse combinations of instructions in terms of both control and data flow. Moreover, Waltzz designs a skeleton-based generation algorithm to produce code snippets that are rarely seen in the seed corpus. To demonstrate the efficacy of Waltzz, we evaluate it on seven well-known Wasm runtimes. Compared to the state-of-the-art works, Waltzz can surpass the nearest competitor by finding 12.4% more code coverage even within the large code bases and uncovering 1.38x more unique bugs. Overall, Waltzz has discovered 20 new bugs which have all been confirmed and 17 CVE IDs have been assigned.

ACM Conference on International Computing Education Research
Charlottesville | August 3 – 6, 2025
Doctoral Consortium
Ethical Computing Education in the Age of Generative AI
Grace Barkhuff
Educating computing students in ethical practices is vitally important. This education is complicated by the rapid rise of generative AI (GenAI) and its use in higher education by students and instructors alike. My research aims to understand computing educators’ perceptions on ethically educating computing students, both about and with GenAI.
Lightning Talks and Posters
Benchmarking of Generative AI Tools in Software Engineering Education: Formative Insights for Curriculum Integration
Nimisha Roy, Oleksandr Horielko, Fisayo Omojokun
Exploring Community Perceptions and Experiences Towards Academic Dishonesty in Computing Education
Chandler C. Payne, Kai A. Hackney, Lucas Guarenti Zangari, Emmanuel Munoz, Sterling R. Kalogeras, Juan Sebastián Sánchez-Gómez, Fisayo Omojokun, Pedro Guillermo Feijóo-García
Should I Submit or Should I Not? Exploring the Effects of Mandatory vs. Voluntary Tasks on Student Engagement in Computing Education
Lucas Guarenti Zangari, Emilio Aponte-Archila, Pedro Guillermo Feijóo-García
What Computing Faculty Want: Designing AI Tools for High-Enrollment Courses Beyond CS1
Rodrigo Borela, Meryem Yilmaz Soylu, Jeonghyun Lee, Nimisha Roy

International Conference on Machine Learning
Vancouver | July 13 – 19, 2025
Algorithms
Learning to Stop: Deep Learning for Mean Field Optimal Stopping
Lorenzo Magnino, Yuchen Zhu, Mathieu Lauriere
Optimal stopping is a fundamental problem in optimization with applications in risk management, finance, robotics, and machine learning. We extend the standard framework to a multi-agent setting, named multi-agent optimal stopping (MAOS), where agents cooperate to make optimal stopping decisions in a finite-space, discrete-time environment. Since solving MAOS becomes computationally prohibitive as the number of agents is very large, we study the mean-field optimal stopping (MFOS) problem, obtained as the number of agents tends to infinity. We establish that MFOS provides a good approximation to MAOS and prove a dynamic programming principle (DPP) based on mean-field control theory. We then propose two deep learning approaches: one that learns optimal stopping decisions by simulating full trajectories and another that leverages the DPP to compute the value function and to learn the optimal stopping rule using backward induction. Both methods train neural networks to approximate optimal stopping policies. We demonstrate the effectiveness and the scalability of our work through numerical experiments on 6 different problems in spatial dimension up to 300. To the best of our knowledge, this is the first work to formalize and computationally solve MFOS in discrete time and finite space, opening new directions for scalable MAOS methods.
Mustafa Burak Gurbuz, Xingyu Zheng, Constantine Dovrolis
As deep learning continues to be driven by ever-larger datasets, understanding which examples are most important for generalization has become a critical question. While progress in data selection continues, emerging applications require studying this problem in dynamic contexts. To bridge this gap, we pose the Incremental Data Selection (IDS) problem, where examples arrive as a continuous stream, and need to be selected without access to the full data source. In this setting, the learner must incrementally build a training dataset of predefined size while simultaneously learning the underlying task. We find that in IDS, the impact of a new sample on the model state depends fundamentally on both its geometric relationship in the feature space and its prediction error. Leveraging this insight, we propose PEAKS (Prediction Error Anchored by Kernel Similarity), an efficient data selection method tailored for IDS. Our comprehensive evaluations demonstrate that PEAKS consistently outperforms existing selection strategies. Furthermore, PEAKS yields increasingly better performance returns than random selection as training data size grows on real-world datasets. The code is available at https://github.com/BurakGurbuz97/PEAKS.
Unpaired Point Cloud Completion via Unbalanced Optimal Transport
Taekyung Lee, Jaemoo Choi, Jaewoong Choi, Myungjoo Kang
Unpaired point cloud completion is crucial for real-world applications, where ground-truth data for complete point clouds are often unavailable. By learning a completion map from unpaired incomplete and complete point cloud data, this task avoids the reliance on paired datasets. In this paper, we propose the \textit{Unbalanced Optimal Transport Map for Unpaired Point Cloud Completion (\textbf{UOT-UPC})} model, which formulates the unpaired completion task as the (Unbalanced) Optimal Transport (OT) problem. Our method employs a Neural OT model learning the UOT map using neural networks. Our model is the first attempt to leverage UOT for unpaired point cloud completion, achieving competitive or superior performance on both single-category and multi-category benchmarks. In particular, our approach is especially robust under the class imbalance problem, which is frequently encountered in real-world unpaired point cloud completion scenarios.
Alignment
CollabLLM: From Passive Responders to Active Collaborators
Shirley Wu, Michel Galley, Baolin Peng, Hao Cheng, Gavin Li, Yao Dou, Weixin Cai, James Zou, Jure Leskovec, Jianfeng Gao
Large Language Models are typically trained with next-turn rewards, limiting their ability to optimize for long-term interaction. As a result, they often respond passively to ambiguous or open-ended user requests, failing to help users reach their ultimate intents and leading to inefficient conversations. To address these limitations, we introduce CollabLLM, a novel and general training framework that enhances multiturn human-LLM collaboration. Its key innovation is a collaborative simulation that estimates the long-term contribution of responsesusing Multiturn-aware Rewards. By reinforcement fine-tuning these rewards, CollabLLM goes beyond responding to user requests, and actively uncovers user intent and offers insightful suggestions—a key step towards more human-centered AI. We also devise a multiturn interaction benchmark with three challenging tasks such as document creation. CollabLLM significantly outperforms our baselines with averages of 18.5% higher task performance and 46.3% improved interactivity by LLM judges. Finally, we conduct a large user study with 201 judges, where CollabLLM increases user satisfaction by 17.6% and reduces user spent time by 10.4%.
Applications
Generalization Principles for Inference over Text-Attributed Graphs with Large Language Models
Haoyu Wang, Shikun Liu, Rongzhe Wei, Pan Li
Large language models (LLMs) have recently been introduced to graph learning, aiming to extend their zero-shot generalization success to tasks where labeled graph data is scarce. Among these applications, inference over text-attributed graphs (TAGs) presents unique challenges: existing methods struggle with LLMs’ limited context length for processing large node neighborhoods and the misalignment between node embeddings and the LLM token space. To address these issues, we establish two key principles for ensuring generalization and derive the framework LLM-BP accordingly: (1) **Unifying the attribute space with task-adaptive embeddings**, where we leverage LLM-based encoders and task-aware prompting to enhance generalization of the text attribute embeddings; (2) **Developing a generalizable graph information aggregation mechanism**, for which we adopt belief propagation with LLM-estimated parameters that adapt across graphs. Evaluations on 11 real-world TAG benchmarks demonstrate that LLM-BP significantly outperforms existing approaches, achieving 8.10\% improvement with task-conditional embeddings and an additional 1.71\% gain from adaptive aggregation. The code and task-adaptive embeddings are publicly available.
Chemistry, Physics, and Earth Sciences
LLM-Augmented Chemical Synthesis and Design Decision Programs
Haorui Wang, Jeff Guo, Lingkai Kong, Rampi Ramprasad, Philippe Schwaller, Yuanqi Du, Chao Zhang
Retrosynthesis, the process of breaking down a target molecule into simpler precursors through a series of valid reactions, stands at the core of organic chemistry and drug development. Although recent machine learning (ML) research has advanced single-step retrosynthetic modeling and subsequent route searches, these solutions remain restricted by the extensive combinatorial space of possible pathways. Concurrently, large language models (LLMs) have exhibited remarkable chemical knowledge, hinting at their potential to tackle complex decision-making tasks in chemistry. In this work, we explore whether LLMs can successfully navigate the highly constrained, multi-step retrosynthesis planning problem. We introduce an efficient scheme for encoding reaction pathways and present a new route-level search strategy, moving beyond the conventional step-by-step reactant prediction. Through comprehensive evaluations, we show that our LLM-augmented approach excels at retrosynthesis planning and extends naturally to the broader challenge of synthesizable molecular design.
Convex
Geometric Algebra Planes: Convex Implicit Neural Volumes
Irmak Sivgin, Sara Fridovich-Keil, Gordon Wetzstein, Mert Pilanci
Volume parameterizations abound in recent literature, encompassing methods from classic voxel grids to implicit neural representations. While implicit representations offer impressive capacity and improved memory efficiency compared to voxel grids, they traditionally require training through nonconvex optimization, which can be slow and sensitive to initialization and hyperparameters. We introduce GA-Planes, a novel family of implicit neural volume representations inspired by Geometric Algebra that can be trained using convex optimization, addressing the limitations of nonconvex methods. GA-Planes models generalize many existing representations including any combination of features stored in tensor basis elements followed by a neural feature decoder, and can be adapted to convex or nonconvex training as needed for various inverse problems. In the 2D setting, we prove GA-Planes models are equivalent to a low-rank plus low-resolution matrix factorization that outperforms the classic low-rank plus sparse decomposition for fitting a natural image. In 3D, GA-Planes models exhibit competitive expressiveness, model size, and optimizability across tasks such as radiance field reconstruction, 3D segmentation, and video segmentation.
Deep Learning
Can Transformers Reason Logically? A Study in SAT Solving
Leyan Pan, Vijay Ganesh, Jacob Abernethy, Chris Esposo, Wenke Lee
We formally study the logical reasoning capabilities of decoder-only Transformers in the context of the boolean satisfiability (SAT) problem. First, we prove by construction that decoder-only Transformers can decide 3-SAT, in a non-uniform model of computation, using backtracking and deduction via Chain-of-Thought (CoT).Second, we implement our construction as a PyTorch model with a tool (PARAT) that we designed to empirically demonstrate its correctness and investigate its properties.Third, rather than \textit{programming} a transformer to reason, we evaluate empirically whether it can be \textit{trained} to do so by learning directly from algorithmic traces (“reasoning paths”) from our theoretical construction. The trained models demonstrate strong out-of-distribution generalization on problem sizes seen during training but has limited length generalization, which is consistent with the implications of our theoretical result.
LaCache: Ladder-Shaped KV Caching for Efficient Long-Context Modeling of Large Language Models
Dachuan Shi, Yonggan Fu, Xiangchi Yuan, Zhongzhi Yu, Haoran You, Sixu Li, Xin Dong, Jan Kautz, Pavlo Molchanov, Yingyan (Celine) Lin
Recent advancements in Large Language Models (LLMs) have spurred interest in numerous applications requiring robust long-range capabilities, essential for processing extensive input contexts and continuously generating extended outputs. As sequence lengths increase, the number of Key-Value (KV) pairs in LLMs escalates, creating a significant efficiency bottleneck.In this paper, we propose a new KV cache optimization paradigm called LaCache, a training-free method for efficient and accurate generative inference of LLMs. LaCache enables LLMs to simultaneously address both of the critical challenges in long-range modeling: robust long-range capabilities and continuous generation without running out-of-memory (OOM). Specifically, LaCache integrates two key innovations: (1) a ladder-shaped KV cache pattern that stores KV pairs not only sequentially (left-to-right within each layer) but also across layers (from shallow to deep), providing an extended span for capturing long-range dependencies under a fixed storage budget, thereby boosting long-range capabilities; and (2) an iterative compaction mechanism that progressively compresses older caches, freeing up space for new tokens within a fixed cache size. This token distance-based dynamic compression enables more effective continuous generation under constrained cache budgets.Experiments across various tasks, benchmarks, and LLM models consistently validate LaCache’s effectiveness in enhancing LLMs’ long-range capabilities. Our code is available at https://github.com/GATECH-EIC/LaCache.
Deep RL
Deep Reinforcement Learning from Hierarchical Preference Design
Alexander Bukharin, Yixiao Li, Pengcheng He, Tuo Zhao
Reward design is a fundamental, yet challenging aspect of reinforcement learning (RL). Researchers typically utilize feedback signals from the environment to handcraft a reward function, but this process is not always effective due to the varying scale and intricate dependencies of the feedback signals. This paper shows by exploiting certain structures, one can ease the reward design process. Specifically, we propose a hierarchical reward design framework — HERON for scenarios: (I) The feedback signals naturally present hierarchy; (II) The reward is sparse, but with less important surrogate feedback to help policy learning. Both scenarios allow us to design a hierarchical decision tree induced by the importance ranking of the feedback signals to compare RL trajectories. With such preference data, we can then train a reward model for policy learning. We apply HERON to several RL applications, and we find that our framework can not only train high performing agents on a variety of difficult tasks, but also provide additional benefits such as improved sample efficiency and robustness.
Efficient Online Reinforcement Learning for Diffusion Policy
Haitong Ma, Tianyi Chen, Kai Wang, Na Li, Bo Dai
Diffusion policies have achieved superior performance in imitation learning and offline reinforcement learning (RL) due to their rich expressiveness. However, the conventional diffusion training procedure requires samples from target distribution, which is impossible in online RL since we cannot sample from the optimal policy. Backpropagating policy gradient through the diffusion process incurs huge computational costs and instability, thus being expensive and not scalable. To enable efficient training of diffusion policies in online RL, we generalize the conventional denoising score matching by reweighting the loss function. The resulting Reweighted Score Matching (RSM) preserves the optimal solution and low computational cost of denoising score matching, while eliminating the need to sample from the target distribution and allowing learning to optimize value functions. We introduce two tractable reweighted loss functions to solve two commonly used policy optimization problems, policy mirror descent and max-entropy policy, resulting in two practical algorithms named Diffusion Policy Mirror Descent (DPMD) and Soft Diffusion Actor-Critic (SDAC). We conducted comprehensive comparisons on MuJoCo benchmarks. The empirical results show that the proposed algorithms outperform recent diffusion-policy online RLs on most tasks, and the DPMD improves more than 120% over soft actor-critic on Humanoid and Ant.
Foundation Models
Primitive Vision: Improving Diagram Understanding in MLLMs
Shan Zhang, Aotian Chen, Yanpeng Sun, Jindong Gu, Yi-Yu Zheng, Piotr Koniusz, Kai Zou, Anton Hengel, Yuan Xue
Mathematical diagrams have a distinctive structure. Standard feature transforms designed for natural images (e.g., CLIP) fail to process them effectively, limiting their utility in multimodal large language models (MLLMs). Current efforts to improve MLLMs have primarily focused on scaling mathematical visual instruction datasets and strengthening LLM backbones, yet fine?grained visual recognition errors remain unaddressed. Our systematic evaluation on the visual grounding capabilities of state?of?the?art MLLMs highlights that fine?grained visual understanding remains a crucial bottleneck in visual mathematical reasoning (GPT-4o exhibits a 70% grounding error rate, and correcting these errors improves reasoning accuracy by 12%). We thus propose a novel approach featuring a geometrically?grounded vision encoder and a feature router that dynamically selects between hierarchical visual feature maps. Our model accurately recognizes visual primitives and generates precise visual prompts aligned with the language model’s reasoning needs. In experiments, PRIMITIVE-Qwen2.5-7B outperforms other 7B models by 12% on MathVerse and is on par with GPT-4V on MathVista. Our findings highlight the need for better fine?grained visual integration in MLLMs. Code is available at github.com/AI4Math-ShanZhang/SVE-Math.
General Machine Learning
On the Power of Learning-Augmented Search Trees
Jingbang Chen, Xinyuan Cao, Alicia Stepin, Li Chen
We study learning-augmented binary search trees (BSTs) via Treaps with carefully designed priorities.The result is a simple search tree in which the depth of each item $x$ is determined by its predicted weight $w_x$.Specifically, each item $x$ is assigned a composite priority of $-\lfloor\log\log(1/w_x)\rfloor + U(0, 1)$ where $U(0, 1)$ is the uniform random variable. By choosing $w_x$ as the relative frequency of $x$, the resulting search trees achieve static optimality.This approach generalizes the recent learning-augmented BSTs [Lin-Luo-Woodruff ICML`22], which only work for Zipfian distributions, by extending them to arbitrary input distributions.Furthermore, we demonstrate that our method can be generalized to a B-Tree data structure using the B-Treap approach [Golovin ICALP’09]. Our search trees are also capable of leveraging localities in the access sequence through online self-reorganization, thereby achieving the working-set property. Additionally, they are robust to prediction errors and support dynamic operations, such as insertions, deletions, and prediction updates. We complement our analysis with an empirical study, demonstrating that our method outperforms prior work and classic data structures.
Generative Models and Autoencoders
ConceptAttention: Diffusion Transformers Learn Highly Interpretable Features
Alec Helbling, Tuna Han Salih Meral, Benjamin Hoover, Pinar Yanardag, Polo Chau
Do the rich representations of multi-modal diffusion transformers (DiTs) exhibit unique properties that enhance their interpretability? We introduce ConceptAttention, a novel method that leverages the expressive power of DiT attention layers to generate high-quality saliency maps that precisely locate textual concepts within images. Without requiring additional training, ConceptAttention repurposes the parameters of DiT attention layers to produce highly contextualized *concept embeddings*, contributing the major discovery that performing linear projections in the output space of DiT attention layers yields significantly sharper saliency maps compared to commonly used cross-attention maps. ConceptAttention even achieves state-of-the-art performance on zero-shot image segmentation benchmarks, outperforming 15 other zero-shot interpretability methods on the ImageNet-Segmentation dataset. ConceptAttention works for popular image models and even seamlessly generalizes to video generation. Our work contributes the first evidence that the representations of multi-modal DiTs are highly transferable to vision tasks like segmentation.
Diffuse Everything: Multimodal Diffusion Models on Arbitrary State Spaces
Kevin Rojas, Yuchen Zhu, Sichen Zhu, Felix Ye, Molei Tao
Diffusion models have demonstrated remarkable performance in generating unimodal data across various tasks, including image, video, and text generation. On the contrary, the joint generation of multimodal data through diffusion models is still in the early stages of exploration. Existing approaches heavily rely on external preprocessing protocols, such as tokenizers and variational autoencoders, to harmonize varied data representations into a unified, unimodal format. This process heavily demands the high accuracy of encoders and decoders, which can be problematic for applications with limited data. To lift this restriction, we propose a novel framework for building multimodal diffusion models on arbitrary state spaces, enabling native generation of coupled data across different modalities. By introducing an innovative decoupled noise schedule for each modality, we enable both unconditional and modality-conditioned generation within a single model simultaneously. We empirically validate our approach for text-image generation and mixed-type tabular data synthesis, demonstrating that it achieves competitive performance.
Kaiwen Zheng, Yongxin Chen, Huayu Chen, Guande He, Ming-Yu Liu, Jun Zhu, Qinsheng Zhang
While likelihood-based generative models, particularly diffusion and autoregressive models, have achieved remarkable fidelity in visual generation, the maximum likelihood estimation (MLE) objective, which minimizes the forward KL divergence, inherently suffers from a mode-covering tendency that limits the generation quality under limited model capacity. In this work, we propose Direct Discriminative Optimization (DDO) as a unified framework that integrates likelihood-based generative training and GAN-type discrimination to bypass this fundamental constraint by exploiting reverse KL and self-generated negative signals. Our key insight is to parameterize a discriminator implicitly using the likelihood ratio between a learnable target model and a fixed reference model, drawing parallels with the philosophy of Direct Preference Optimization (DPO). Unlike GANs, this parameterization eliminates the need for joint training of generator and discriminator networks, allowing for direct, efficient, and effective finetuning of a well-trained model to its full potential beyond the limits of MLE. DDO can be performed iteratively in a self-play manner for progressive model refinement, with each round requiring less than 1\% of pretraining epochs. Our experiments demonstrate the effectiveness of DDO by significantly advancing the previous SOTA diffusion model EDM, reducing FID scores from 1.79/1.58/1.96 to new records of 1.30/0.97/1.26 on CIFAR-10/ImageNet-64/ImageNet 512$\times$512 datasets without any guidance mechanisms, and by consistently improving both guidance-free and CFG-enhanced FIDs of visual autoregressive models on ImageNet 256$\times$256.
Learnings from Scaling Visual Tokenizers for Reconstruction and Generation
Philippe Hansen-Estruch, David Yan, Ching-Yao Chuang, Orr Zohar, Jialiang Wang, Tingbo Hou, Tao Xu, Sriram Vishwanath, Peter Vajda, Xinlei Chen
Visual tokenization via auto-encoding empowers state-of-the-art image and video generative models by compressing pixels into a latent space. However, questions remain about how auto-encoder design impacts reconstruction and downstream generative performance. This work explores scaling in auto-encoders for reconstruction and generation by replacing the convolutional backbone with an enhanced Vision Transformer for Tokenization (ViTok). We find scaling the auto-encoder bottleneck correlates with reconstruction but exhibits a nuanced relationship with generation. Separately, encoder scaling yields no gains, while decoder scaling improves reconstruction with minimal impact on generation. As a result, we determine that scaling the current paradigm of auto-encoders is not effective for improving generation performance. Coupled with Diffusion Transformers, ViTok achieves competitive image reconstruction and generation performance on 256p and 512p ImageNet-1K. In videos, ViTok achieves SOTA reconstruction and generation performance on 16-frame 128p UCF-101.
Jaemoo Choi, Jaewoong Choi, Dohyun Kwon
We address the convergence problem in learning the Optimal Transport (OT) map, where the OT Map refers to a map from one distribution to another while minimizing the transport cost. Semi-dual Neural OT, a widely used approach for learning OT Maps with neural networks, often generates spurious solutions that fail to transfer one distribution to another accurately. We identify a sufficient condition under which the max-min solution of Semi-dual Neural OT recovers the true OT Map. Moreover, to address cases when this sufficient condition is not satisfied, we propose a novel method, OTP, which learns both the OT Map and the Optimal Transport Plan, representing the optimal coupling between two distributions. Under sharp assumptions on the distributions, we prove that our model eliminates the spurious solution issue and correctly solves the OT problem. Our experiments show that the OTP model recovers the optimal transport map where existing methods fail and outperforms current OT-based models in image-to-image translation tasks. Notably, the OTP model can learn stochastic transport maps when deterministic OT Maps do not exist, such as one-to-many tasks like colorization.
RocketKV: Accelerating Long-Context LLM Inference via Two-Stage KV Cache Compression
Payman Behnam, Yaosheng Fu, Ritchie Zhao, Po-An Tsai, Zhiding Yu, Alexey Tumanov
Transformer-based Large Language Models rely critically on the KV cache to efficiently handle extended contexts during the decode phase. Yet, the size of the KV cache grows proportionally with the input length, burdening both memory bandwidth and capacity as decoding progresses. To address this challenge, we present RocketKV, a training-free KV cache compression strategy containing two consecutive stages. In the first stage, it performs coarse-grain permanent KV cache eviction on the input sequence tokens. In the second stage, it adopts a hybrid sparse attention method to conduct fine-grain top-k sparse attention, approximating the attention scores by leveraging both head and sequence dimensionality reductions. We show that RocketKV provides a compression ratio of up to 400×, end-to-end speedup of up to 3.7× as well as peak memory reduction of up to 32.6% in the decode phase on an NVIDIA A100 GPU compared to the full KV cache baseline, while achieving negligible accuracy loss on a variety of long-context tasks. We also propose a variant of RocketKV for multi-turn scenarios, which consistently outperforms other existing methods and achieves accuracy nearly on par with an oracle top-k attention scheme.
Graph Neural Networks
Biswadeep Chakraborty, Harshit Kumar, Saibal Mukhopadhyay
Graph Neural Networks (GNNs) face a critical limitation known as oversmoothing, where increasing network depth leads to homogenized node representations, severely compromising their expressiveness. We present a novel dynamical systems perspective on this challenge, revealing oversmoothing as an emergent property of GNNs’ convergence to low-dimensional attractor states. Based on this insight, we introduce **DYNAMO-GAT**, which combines noise-driven covariance analysis with Anti-Hebbian learning to dynamically prune attention weights, effectively preserving distinct attractor states. We provide theoretical guarantees for DYNAMO-GAT’s effectiveness and demonstrate its superior performance on benchmark datasets, consistently outperforming existing methods while requiring fewer computational resources. This work establishes a fundamental connection between dynamical systems theory and GNN behavior, providing both theoretical insights and practical solutions for deep graph learning.
Health / Medicine
EARTH: Epidemiology-Aware Neural ODE with Continuous Disease Transmission Graph
Guancheng Wan, Zewen Liu, Xiaojun Shan, Max Lau, B. Aditya Prakash, Wei Jin
Effective epidemic forecasting is critical for public health strategies and efficient medical resource allocation, especially in the face of rapidly spreading infectious diseases. However, existing deep-learning methods often overlook the dynamic nature of epidemics and fail to account for the specific mechanisms of disease transmission. In response to these challenges, we introduce an innovative end-to-end framework called Epidemiology-Aware Neural ODE with Continuous Disease Transmission Graph (EARTH) in this paper. To learn continuous and regional disease transmission patterns, we first propose EANO, which seamlessly integrates the neural ODE approach with the epidemic mechanism, considering the complex spatial spread process during epidemic evolution. Additionally, we introduce GLTG to model global infection trends and leverage these signals to guide local transmission dynamically. To accommodate both the global coherence of epidemic trends and the local nuances of epidemic transmission patterns, we build a cross-attention approach to fuse the most meaningful information for forecasting. Through the smooth synergy of both components, EARTH offers a more robust and flexible approach to understanding and predicting the spread of infectious diseases. Extensive experiments show EARTH superior performance in forecasting real-world epidemics compared to state-of-the-art methods. The code is available at https://github.com/GuanchengWan/EARTH.
Kernel methods
Statistical and Computational Guarantees of Kernel Max-Sliced Wasserstein Distances
Jie Wang, March Boedihardjo, Yao Xie
Optimal transport has been very successful for various machine learning tasks; however, it is known to suffer from the curse of dimensionality. Hence, dimensionality reduction is desirable when applied to high-dimensional data with low-dimensional structures. The kernel max-sliced (KMS) Wasserstein distance is developed for this purpose by finding an optimal nonlinear mapping that reduces data into $1$ dimension before computing the Wasserstein distance. However, its theoretical properties have not yet been fully developed. In this paper, we provide sharp finite-sample guarantees under milder technical assumptions compared with state-of-the-art for the KMS $p$-Wasserstein distance between two empirical distributions with $n$ samples for general $p\in[1,\infty)$. Algorithm-wise, we show that computing the KMS $2$-Wasserstein distance is NP-hard, and then we further propose a semidefinite relaxation (SDR) formulation (which can be solved efficiently in polynomial time) and provide a relaxation gap for the obtained solution. We provide numerical examples to demonstrate the good performance of our scheme for high-dimensional two-sample testing.
Large Language Models
CommVQ: Commutative Vector Quantization for KV Cache Compression
Junyan Li, Yang Zhang, Muhammad Yusuf Hassan, Talha Chafekar, Tianle Cai, Zhile Ren, Pengsheng Guo, Foroozan Karimzadeh, Colorado Reed, Chong Wang, Chuang Gan
Large Language Models (LLMs) are increasingly used in applications requiring long context lengths, but the key-value (KV) cache often becomes a memory bottleneck on GPUs as context grows. To address this, we propose Commutative Vector Quantization (CommVQ) to significantly reduce memory usage for long-context LLM inference. We first introduce additive quantization with a lightweight encoder and codebook to compress the KV cache, which can be decoded via simple matrix multiplication. To further reduce computational costs during decoding, we design the codebook to be commutative with Rotary Position Embedding (RoPE) and train it using an Expectation-Maximization (EM) algorithm. This enables efficient integration of decoding into the self-attention mechanism. Our approach achieves high accuracy with additive quantization and low overhead via the RoPE-commutative codebook. Experiments on long-context benchmarks and GSM8K show that our method reduces FP16 KV cache size by 87.5% with 2-bit quantization, while outperforming state-of-the-art KV cache quantization methods. Notably, it enables 1-bit KV cache quantization with minimal accuracy loss, allowing a LLaMA-3.1 8B model to run with a 128K context length on a single RTX 4090 GPU. The source code is available at: https://github.com/UMass-Embodied-AGI/CommVQ.
Siqi Guo, Ilgee Hong, Vicente Balmaseda, Changlong Yu, Liang Qiu, Xin Liu, Haoming Jiang, Tuo Zhao, Tianbao Yang
Supervised fine-tuning (SFT) has become a crucial step for aligning pretrained large language models (LLMs) using supervised datasets of input-output pairs. However, despite being supervised, SFT is inherently limited by its generative training objective. To address its limitations, the existing common strategy is to follow SFT with a separate phase of preference optimization (PO), which relies on either human-labeled preference data or a strong reward model to guide the learning process. In this paper, we address the limitations of SFT by exploring one of the most successful techniques in conventional supervised learning: discriminative learning. We introduce **Discriminative Fine-Tuning (DFT)**, an improved variant of SFT, which mitigates the burden of collecting human-labeled preference data or training strong reward models. Unlike SFT that employs a generative approach and overlooks negative data, DFT adopts a **discriminative paradigm** that increases the probability of positive answers while suppressing potentially negative ones, aiming for **data prediction** instead of token prediction. Our contributions include: (i) a discriminative probabilistic framework for fine-tuning LLMs by explicitly modeling the discriminative likelihood of an answer among all possible outputs given an input; (ii) efficient algorithms to optimize this discriminative likelihood; and (iii) extensive experiments demonstrating DFT’s effectiveness, achieving performance better than SFT and comparable to if not better than SFT?PO. The code can be found at https://github.com/Optimization-AI/DFT.
Diving into Self-Evolving Training for Multimodal Reasoning
Wei Liu, Junlong Li, Xiwen Zhang, Fan Zhou, Yu Cheng, Junxian He
Self-evolving training—where models iteratively learn from their own outputs—has emerged as a key approach for complex reasoning tasks, addressing the scarcity of high-quality chain-of-thought data. However, its effectiveness in multimodal reasoning, a domain more intricate than text-only reasoning, remains underexplored, and the understanding of critical factors in this training paradigm remains limited. Furthermore, a central challenge for this training method is performance saturation, which impedes further improvements and scalability. Inspired by reinforcement learning (RL), in this paper, we reframe self-evolving training for multimodal reasoning through the lens of RL, identifying three pivotal factors: $\textit{Training Method}$, $\textit{Reward Model}$, and $\textit{Prompt Variation}$. Through systematic analysis, we establish relatively optimal design principles that significantly enhance multimodal reasoning capabilities. Moreover, delving deeper into training dynamics, we uncover the roots of saturation and propose a new automatic balancing mechanism to mitigate this limitation. Building on these insights, we propose M-STaR (**M**ultimodal **S**elf-evolving **T**r**a**ining for **R**easoning), a framework that achieves consistent performance gains across models of varying sizes and diverse benchmarks. All resources will be made publicly available.
Mechanistic Unlearning: Robust Knowledge Unlearning and Editing via Mechanistic Localization
Phillip Guo, Aaquib Syed, Abhay Sheshadri, Aidan Ewart, Gintare Karolina Dziugaite
Methods for knowledge editing and unlearning in large language models seek to edit or remove undesirable knowledge or capabilities without compromising general language modeling performance. This work investigates how mechanistic interpretability—which, in part, aims to identify model components (circuits) associated to specific interpretable mechanisms that make up a model capability—can improve the precision and effectiveness of editing and unlearning. We find a stark difference in unlearning and edit robustness when training components localized by different methods. We highlight an important distinction between methods that localize components based primarily on preserving outputs, and those finding high level mechanisms with predictable intermediate states. In particular, localizing edits/unlearning to components associated with the *lookup-table mechanism* for factual recall 1) leads to more robust edits/unlearning across different input/output formats, and 2) resists attempts to relearn the unwanted information, while also reducing unintended side effects compared to baselines, on both a sports facts dataset and the CounterFact dataset across multiple models.We also find that certain localized edits disrupt the latent knowledge in the model more than any other baselines, making unlearning more robust to various attacks.
Rethinking Addressing in Language Models via Contextualized Equivariant Positional Encoding
Jiajun Zhu, Peihao Wang, Ruisi Cai, Jason Lee, Pan Li, Zhangyang “Atlas” Wang
Transformers rely on both content-based and position-based addressing mechanisms to make predictions, but existing positional encoding techniques often diminish the effectiveness of position-based addressing. Many current methods enforce rigid patterns in attention maps, limiting the ability to model long-range dependencies and adapt to diverse tasks. Additionally, most positional encodings are learned as general biases, lacking the specialization required for different instances within a dataset. To address this, we propose con**T**extualized equivari**A**nt **P**osition **E**ncoding (**TAPE**), a novel framework that enhances positional embeddings by incorporating sequence content across layers. TAPE introduces dynamic, context-aware positional encodings, overcoming the constraints of traditional fixed patterns. By enforcing permutation and orthogonal equivariance, TAPE ensures the stability of positional encodings during updates, improving robustness and adaptability. Our method can be easily integrated into pre-trained transformers, offering parameter-efficient fine-tuning with minimal overhead. Extensive experiments show that TAPE achieves superior performance in language modeling, arithmetic reasoning, and long-context retrieval tasks compared to existing positional embedding techniques.
Scaling Sparse Feature Circuits For Studying In-Context Learning
Dmitrii Kharlapenko, Stepan Shabalin, Arthur Conmy, Neel Nanda
Sparse autoencoders (SAEs) are a popular tool for interpreting large language model activations, but their utility in addressing open questions in interpretability remains unclear. In this work, we demonstrate their effectiveness by using SAEsto deepen our understanding of the mechanism behind in-context learning (ICL). We identify abstract SAE features that (i) encode the model’s knowledge of which task to execute and (ii) whose latent vectors causally induce the task zero-shot.This aligns with prior work showing that ICL is mediated by task vectors. We further demonstrate that these task vectors are well approximated by a sparse sum of SAE latents, including these task-execution features. To explore the ICL mechanism, we scale the sparse feature circuits methodology of Marks et al. (2024) to the Gemma 1 2B model for the more complex task of ICL. Through circuit finding, we discover task-detecting features with corresponding SAE latents that activate earlier in the prompt, that detect when tasks have been performed. They are causally linked with task-execution features through the attention and MLP sublayers.
Monte Carlo and Sampling Methods
Annealing Flow Generative Models Towards Sampling High-Dimensional and Multi-Modal Distributions
Dongze Wu, Yao Xie
Sampling from high-dimensional, multi-modal distributions remains a fundamental challenge across domains such as statistical Bayesian inference and physics-based machine learning. In this paper, we propose Annealing Flow (AF), a method built on Continuous Normalizing Flows (CNFs) for sampling from high-dimensional and multi-modal distributions. AF is trained with a dynamic Optimal Transport (OT) objective incorporating Wasserstein regularization, and guided by annealing procedures, facilitating effective exploration of modes in high-dimensional spaces. Compared to recent NF methods, AF significantly improves training efficiency and stability, with minimal reliance on MC assistance. We demonstrate the superior performance of AF compared to state-of-the-art methods through extensive experiments on various challenging distributions and real-world datasets, particularly in high-dimensional and multi-modal settings. We also highlight AF’s potential for sampling the least favorable distributions.
Neuroscience, Cognitive Science
Jingyang Ke, Feiyang Wu, Jiyi Wang, Jeffrey Markowitz, Anqi Wu
Traditional approaches to studying decision-making in neuroscience focus on simplified behavioral tasks where animals perform repetitive, stereotyped actions to receive explicit rewards. While informative, these methods constrain our understanding of decision-making to short timescale behaviors driven by explicit goals. In natural environments, animals exhibit more complex, long-term behaviors driven by intrinsic motivations that are often unobservable. Recent works in time-varying inverse reinforcement learning (IRL) aim to capture shifting motivations in long-term, freely moving behaviors. However, a crucial challenge remains: animals make decisions based on their history, not just their current state. To address this, we introduce SWIRL (SWitching IRL), a novel framework that extends traditional IRL by incorporating time-varying, history-dependent reward functions. SWIRL models long behavioral sequences as transitions between short-term decision-making processes, each governed by a unique reward function. SWIRL incorporates biologically plausible history dependency to capture how past decisions and environmental contexts shape behavior, offering a more accurate description of animal decision-making. We apply SWIRL to simulated and real-world animal behavior datasets and show that it outperforms models lacking history dependency, both quantitatively and qualitatively. This work presents the first IRL model to incorporate history-dependent policies and rewards to advance our understanding of complex, naturalistic decision-making in animals.
Learning Time-Varying Multi-Region Brain Communications via Scalable Markovian Gaussian Processes
Weihan Li, Yule Wang, Chengrui Li, Anqi Wu
Understanding and constructing brain communications that capture dynamic communications across multiple regions is fundamental to modern system neuroscience, yet current methods struggle to find time-varying region-level communications or scale to large neural datasets with long recording durations. We present a novel framework using Markovian Gaussian Processes to learn brain communications with time-varying temporal delays from multi-region neural recordings, named Adaptive Delay Model (ADM). Our method combines Gaussian Processes with State Space Models and employs parallel scan inference algorithms, enabling efficient scaling to large datasets while identifying concurrent communication patterns that evolve over time. This time-varying approach captures how brain region interactions shift dynamically during cognitive processes. Validated on synthetic and multi-region neural recordings datasets, our approach discovers both the directionality and temporal dynamics of neural communication. This work advances our understanding of distributed neural computation and provides a scalable tool for analyzing dynamic brain networks. Code is available at https://github.com/BRAINML-GT/Adaptive-Delay-Model.
Neural Encoding and Decoding at Scale
Yizi Zhang, Yanchen Wang, Mehdi Azabou, Alexandre Andre, Zixuan Wang, Hanrui Lyu, International Brain Laboratory, Eva Dyer, Department of Statistics Liam Paninski, Cole Hurwitz
Recent work has demonstrated that large-scale, multi-animal models are powerful tools for characterizing the relationship between neural activity and behavior. Current large-scale approaches, however, focus exclusively on either predicting neural activity from behavior (encoding) or predicting behavior from neural activity (decoding), limiting their ability to capture the bidirectional relationship between neural activity and behavior. To bridge this gap, we introduce a multimodal, multi-task model that enables simultaneous Neural Encoding and Decoding at Scale (NEDS). Central to our approach is a novel multi-task-masking strategy, which alternates between neural, behavioral, within-modality, and cross-modality masking. We pretrain our method on the International Brain Laboratory (IBL) repeated site dataset, which includes recordings from 83 animals performing the visual decision-making task. In comparison to other large-scale modeling approaches, we demonstrate that NEDS achieves state-of-the-art performance for both encoding and decoding when pretrained on multi-animal data and then fine-tuned on new animals. Surprisingly, NEDS’s learned embeddings exhibit emergent properties: even without explicit training, they are highly predictive of the brain regions in each recording. Altogether, our approach is a step towards a foundation model of the brain that enables seamless translation between neural activity and behavior.
Online
Novelty Detection in Reinforcement Learning with World Models
Geigh Zollicoffer, Kenneth Eaton, Jonathan Balloch, Julia Kim, Wei Zhou, Robert Wright, Mark Riedl
Reinforcement learning (RL) using world models has found significant recent successes.However, when a sudden change to world mechanics or properties occurs then agent performance and reliability can dramatically decline.We refer to the sudden change in visual properties or state transitions as novelties.Implementing novelty detection within generated world model frameworks is a crucialtask for protecting the agent when deployed. In this paper, we propose straightforward bounding approaches to incorporate novelty detection into world model RL agents by utilizing the misalignment of the world model’s hallucinated states and the true observed states as a novelty score. We provideeffective approaches to detecting novelties in a distribution of transitions learned by an agent ina world model. Finally, we show the advantage ofour work in a novel environment compared to traditional machine learning novelty detection methods as well as currently accepted RL-focused novelty detection algorithms.
Online Learning and Bandits
On Mitigating Affinity Bias through Bandits with Evolving Biased Feedback
Matthew Faw, Constantine Caramanis, Jessica Hoffmann
Unconscious bias has been shown to influence how we assess our peers, with consequences for hiring, promotions and admissions. In this work, we focus on affinity bias, the component of unconscious bias which leads us to prefer people who are similar to us, despite no deliberate intention of favoritism. In a world where the people hired today become part of the hiring committee of tomorrow, we are particularly interested in understanding (and mitigating) how affinity bias affects this feedback loop. This problem has two distinctive features: 1) we only observe the _biased value_ of a candidate, but we want to optimize with respect to their _real value_ 2) the bias towards a candidate with a specific set of traits depends on the _fraction_ of people in the hiring committee with the same set of traits. We introduce a new bandits variant that exhibits those two features, which we call affinity bandits. Unsurprisingly, classical algorithms such as UCB often fail to identify the best arm in this setting. We prove a new instance-dependent regret lower bound, which is larger than that in the standard bandit setting by a multiplicative function of $K$. Since we treat rewards that are _time-varying_ and _dependent on the policy’s past actions_, deriving this lower bound requires developing proof techniques beyond the standard bandit techniques. Finally, we design an elimination-style algorithm which nearly matches this regret, despite never observing the real rewards.
Online Learning, Active Learning and Bandits
Improved and Oracle-Efficient Online $\ell_1$-Multicalibration
Rohan Ghuge, Vidya Muthukumar, Sahil Singla
We study *online multicalibration*, a framework for ensuring calibrated predictions across multiple groups in adversarial settings, across $T$ rounds. Although online calibration is typically studied in the $\ell_1$ norm, prior approaches to online multicalibration have taken the indirect approach of obtaining rates in other norms (such as $\ell_2$ and $\ell_{\infty}$) and then transferred these guarantees to $\ell_1$ at additional loss. In contrast, we propose a direct method that achieves improved and oracle-efficient rates of $\widetilde{\mathcal{O}}(T^{-1/3})$ and $\widetilde{\mathcal{O}}(T^{-1/4})$ respectively, for online $\ell_1$-multicalibration. Our key insight is a novel reduction of online $\ell_1$-multicalibration to an online learning problem with product-based rewards, which we refer to as *online linear-product optimization* ($\mathtt{OLPO}$). To obtain the improved rate of $\widetilde{\mathcal{O}}(T^{-1/3})$, we introduce a linearization of $\mathtt{OLPO}$ and design a no-regret algorithm for this linearized problem. Although this method guarantees the desired sublinear rate (nearly matching the best rate for online calibration), it is computationally expensive when the group family $\mathcal{H}$ is large or infinite, since it enumerates all possible groups. To address scalability, we propose a second approach to $\mathtt{OLPO}$ that makes only a polynomial number of calls to an offline optimization (*multicalibration evaluation*) oracle, resulting in *oracle-efficient* online $\ell_1$-multicalibration with a corresponding rate of $\widetilde{\mathcal{O}}(T^{-1/4})$. Our framework also extends to certain infinite families of groups (e.g., all linear functions on the context space) by exploiting a $1$-Lipschitz property of the $\ell_1$-multicalibration error with respect to $\mathcal{H}$.
Optimization
Fast Tensor Completion via Approximate Richardson Iteration
Mehrdad Ghadiri, Matthew Fahrbach, Yunbum Kook, Ali Jadbabaie
We study tensor completion (TC) through the lens of low-rank tensor decomposition (TD). Many TD algorithms use fast alternating minimization methods to solve _highly structured_ linear regression problems at each step (e.g., for CP, Tucker, and tensor-train decompositions). However, such algebraic structure is often lost in TC regression problems, making direct extensions unclear. This work proposes a novel _lifting_ method for approximately solving TC regression problems using structured TD regression algorithms as blackbox subroutines, enabling sublinear-time methods. We analyze the convergence rate of our approximate Richardson iteration-based algorithm, and our empirical study shows that it can be 100x faster than direct methods for CP completion on real-world tensors.
Planning
Navigating the Social Welfare Frontier: Portfolios for Multi-objective Reinforcement Learning
Cheol Kim, Jai Moondra, Shresth Verma, Madeleine Pollack, Lingkai Kong, Milind Tambe, Swati Gupta
In many real-world applications of Reinforcement Learning (RL), deployed policies have varied impacts on different stakeholders, creating challenges in reaching consensus on how to effectively aggregate their preferences. Generalized $p$-means form a widely used class of social welfare functions for this purpose, with broad applications in fair resource allocation, AI alignment, and decision-making. This class includes well-known welfare functions such as Egalitarian, Nash, and Utilitarian welfare. However, selecting the appropriate social welfare function is challenging for decision-makers, as the structure and outcomes of optimal policies can be highly sensitive to the choice of $p$. To address this challenge, we study the concept of an $\alpha$-approximate portfolio in RL, a set of policies that are approximately optimal across the family of generalized $p$-means for all $p \in [-\infty, 1]$. We propose algorithms to compute such portfolios and provide theoretical guarantees on the trade-offs among approximation factor, portfolio size, and computational efficiency. Experimental results on synthetic and real-world datasets demonstrate the effectiveness of our approach in summarizing the policy space induced by varying $p$ values, empowering decision-makers to navigate this landscape more effectively.
Privacy
Underestimated Privacy Risks for Minority Populations in Large Language Model Unlearning
Rongzhe Wei, Mufei Li, Mohsen Ghassemi, Eleonora Kreacic, Yifan Li, Xiang Yue, Bo Li, Vamsi Potluru, Pan Li, Eli Chien
Large Language Models (LLMs) embed sensitive, human-generated data, prompting the need for unlearning methods. Although certified unlearning offers strong privacy guarantees, its restrictive assumptions make it unsuitable for LLMs, giving rise to various heuristic approaches typically assessed through empirical evaluations. These standard evaluations randomly select data for removal, apply unlearning techniques, and use membership inference attacks (MIAs) to compare unlearned models against models retrained without the removed data. However, to ensure robust privacy protections for every data point, it is essential to account for scenarios in which certain data subsets face elevated risks. Prior research suggests that outliers, particularly including data tied to minority groups, often exhibit higher memorization propensity which indicates they may be more difficult to unlearn. Building on these insights, we introduce a complementary, minority-aware evaluation framework to highlight blind spots in existing frameworks. We substantiate our findings with carefully designed experiments, using canaries with personally identifiable information (PII) to represent these minority subsets and demonstrate that they suffer at least 20\% higher privacy leakage across various unlearning methods, MIAs, datasets, and LLM scales. Our proposed minority-aware evaluation framework marks an essential step toward more equitable and comprehensive assessments of LLM unlearning efficacy.
XAttnMark: Learning Robust Audio Watermarking with Cross-Attention
Yixin Liu, Lie Lu, Jihui Jin, Lichao Sun, Andrea Fanelli
The rapid proliferation of generative audio synthesis and editing technologies has raised significant concerns about copyright infringement, data provenance, and the spread of misinformation through deepfake audio. Watermarking offers a proactive solution by embedding imperceptible, identifiable, and traceable marks into audio content. While recent neural network-based watermarking methods like WavMark and AudioSeal have improved robustness and quality, they struggle to achieve both robust detection and accurate attribution simultaneously. This paper introduces the Cross-Attention Robust Audio Watermark (XAttnMark), which bridges this gap by leveraging partial parameter sharing between the generator and the detector, a cross-attention mechanism for efficient message retrieval, and a temporal conditioning module for improved message distribution. Additionally, we propose a psychoacoustic-aligned temporal-frequency masking loss that captures fine-grained auditory masking effects, enhancing watermark imperceptibility. Our approach achieves state-of-the-art performance in both detection and attribution, demonstrating superior robustness against a wide range of audio transformations, including challenging generative editing with strong editing strength. This work represents a significant step forward in protecting intellectual property and ensuring the authenticity of audio content in the era of generative AI.
Reinforcement Learning and Planning
Incentivize without Bonus: Provably Efficient Model-based Online Multi-agent RL for Markov Games
Tong Yang, Bo Dai, Lin Xiao, Yuejie Chi
Multi-agent reinforcement learning (MARL) lies at the heart of a plethora of applications involving the interaction of a group of agents in a shared unknown environment. A prominent framework for studying MARL is Markov games, with the goal of finding various notions of equilibria in a sample-efficient manner, such as the Nash equilibrium (NE) and the coarse correlated equilibrium (CCE). However, existing sample-efficient approaches either require tailored uncertainty estimation under function approximation, or careful coordination of the players. In this paper, we propose a novel model-based algorithm, called VMG, that incentivizes exploration via biasing the empiricalestimate of the model parameters towards those with a higher collective best-response values of all the players when fixing the other players’ policies, thus encouraging the policy to deviate from its current equilibrium for more exploration. VMG is oblivious to different forms of function approximation, and permits simultaneous and uncoupled policy updates of all players. Theoretically, we also establish that VMG achieves a near-optimal regret for finding both the NEs of two-player zero-sum Markov games and CCEs of multi-player general-sum Markov games under linear function approximation in an online environment, which nearly match their counterparts with sophisticated uncertainty quantification.
Robotics
Letian Chen, Nina Moorman, Matthew Gombolay
Reinforcement learning (RL) has demonstrated compelling performance in robotic tasks, but its success often hinges on the design of complex, ad hoc reward functions. Researchers have explored how Large Language Models (LLMs) could enable non-expert users to specify reward functions more easily. However, LLMs struggle to balance the importance of different features, generalize poorly to out-of-distribution robotic tasks, and cannot represent the problem properly with only text-based descriptions. To address these challenges, we propose ELEMENTAL (intEractive LEarning froM dEmoNstraTion And Language), a novel framework that combines natural language guidance with visual user demonstrations to align robot behavior with user intentions better. By incorporating visual inputs, ELEMENTAL overcomes the limitations of text-only task specifications, while leveraging inverse reinforcement learning (IRL) to balance feature weights and match the demonstrated behaviors optimally. ELEMENTAL also introduces an iterative feedback-loop through self-reflection to improve feature, reward, and policy learning. Our experiment results demonstrate that ELEMENTAL outperforms prior work by 42.3% on task success, and achieves 41.3% better generalization in out-of-distribution tasks, highlighting its robustness in LfD.
Robustness
SGD Jittering: A Training Strategy for Robust and Accurate Model-Based Architectures
Peimeng Guan, Mark Davenport
Inverse problems aim to reconstruct unseen data from corrupted or perturbed measurements. While most work focuses on improving reconstruction quality, generalization accuracy and robustness are equally important, especially for safety-critical applications. Model-based architectures (MBAs), such as loop unrolling methods, are considered more interpretable and achieve better reconstructions. Empirical evidence suggests that MBAs are more robust to perturbations than black-box solvers, but the accuracy-robustness tradeoff in MBAs remains underexplored. In this work, we propose a simple yet effective training scheme for MBAs, called SGD jittering, which injects noise iteration-wise during reconstruction. We theoretically demonstrate that SGD jittering not only generalizes better than the standard mean squared error training but is also more robust to average-case attacks. We validate SGD jittering using denoising toy examples, seismic deconvolution, and single-coil MRI reconstruction. Both SGD jittering and its SPGD extension yield cleaner reconstructions for out-of-distribution data and demonstrates enhanced robustness against adversarial attacks.
Safety
Tiansheng Huang, Gautam Bhattacharya, Pratik Joshi, Joshua Kimball, Ling Liu
Safety aligned Large Language Models (LLMs) are vulnerable to harmful fine-tuning attacks — a few harmful data mixed in the fine-tuning dataset can break the LLMs’s safety alignment. While several defenses have been proposed, our evaluation shows that existing defenses fail \textit{when some specific training hyper-parameters are chosen} — a large learning rate or a large number of training epochs in the fine-tuning stage can easily invalidate the defense. To this end, we propose Antidote, a post-fine-tuning stage solution, which remains \textbf{\textit{agnostic to the training hyper-parameters in the fine-tuning stage}}. Antidote relies on the philosophy that by removing the harmful parameters, the harmful model can be recovered from the harmful behaviors, regardless of how those harmful parameters are formed in the fine-tuning stage. With this philosophy, we introduce a one-shot pruning stage after harmful fine-tuning to remove the harmful weights that are responsible for the generation of harmful content. Despite its embarrassing simplicity, empirical results show that Antidote can reduce harmful score while maintaining accuracy on downstream tasks.
Security
Topological Signatures of Adversaries in Multimodal Alignments
Minh Vu, Geigh Zollicoffer, Huy Mai, Ben Nebgen, Boian S Alexandrov, Manish Bhattarai
Multimodal Machine Learning systems, particularly those aligning text and image data like CLIP/BLIP models, have become increasingly prevalent, yet remain susceptible to adversarial attacks. While substantial research has addressed adversarial robustness in unimodal contexts, defense strategies for multimodal systems are underexplored. This work investigates the topological signatures that arise between image and text embeddings and shows how adversarial attacks disrupt their alignment, introducing distinctive signatures. We specifically leverage persistent homology and introduce two novel Topological-Contrastive losses based on Total Persistence and Multi-scale kernel methods to analyze the topological signatures introduced by adversarial perturbations. We observe a pattern of monotonic changes in the proposed topological losses emerging in a wide range of attacks on image-text alignments, as more adversarial samples are introduced in the data. By designing an algorithm to back-propagate these signatures to input samples, we are able to integrate these signatures into Maximum Mean Discrepancy tests, creating a novel class of tests that leverage topological signatures for better adversarial detection.
Sequential Models, Time series
In-Context Fine-Tuning for Time-Series Foundation Models
Matthew Faw, Rajat Sen, Yichen Zhou, Abhimanyu Das
Motivated by the recent success of time-series foundation models for zero-shot forecasting, we present a methodology for _in-context fine-tuning_ of a time-series foundation model. In particular, we design a pretrained foundation model that can be prompted (at inference time) with multiple time-series examples, in order to forecast a target time-series into the future. Our foundation model is specifically trained to utilize examples from multiple related time-series in its context window (in addition to the history of the target time-series) to help it adapt to the specific distribution of the target domain at inference time. We show that such a foundation model that uses in-context examples at inference time can obtain much better performance on popular forecasting benchmarks compared to supervised deep learning methods, statistical models, and other time series foundation models. Interestingly, our in-context fine-tuning approach even matches the performance of a foundation model that is explicitly fine-tuned on the target domain.
Jiecheng Lu, Shihao Yang
Autoregressive attention-based time series forecasting (TSF) has drawn increasing interest, with mechanisms like linear attention often outperforming vanilla attention. However, deeper Transformer architectures frequently misalign with autoregressive objectives, obscuring the underlying VAR structure embedded within linear attention and hindering their ability to capture the data generative processes in TSF. In this work, we first show that a single linear attention layer can be interpreted as a dynamic vector autoregressive (VAR) structure. We then explain that existing multi-layer Transformers have structural mismatches with the autoregressive forecasting objective, which impair interpretability and generalization ability. To address this, we show that by rearranging the MLP, attention, and input-output flow, multi-layer linear attention can also be aligned as a VAR model. Then, we propose Structural Aligned Mixture of VAR (SAMoVAR), a linear Transformer variant that integrates interpretable dynamic VAR weights for multivariate TSF. By aligning the Transformer architecture with autoregressive objectives, SAMoVAR delivers improved performance, interpretability, and computational efficiency, comparing to SOTA TSF models.
WAVE: Weighted Autoregressive Varying Gate for Time Series Forecasting
Jiecheng Lu, Xu Han, Yan Sun, Shihao Yang
We propose a Weighted Autoregressive Varying gatE (WAVE) attention mechanism equipped with both Autoregressive (AR) and Moving-average (MA) components. It can adapt to various attention mechanisms, enhancing and decoupling their ability to capture long-range and local temporal patterns in time series data. In this paper, we first demonstrate that, for the time series forecasting (TSF) task, the previously overlooked decoder-only autoregressive Transformer model can achieve results comparable to the best baselines when appropriate tokenization and training methods are applied. Moreover, inspired by the ARMA model from statistics and recent advances in linear attention, we introduce the full ARMA structure into existing autoregressive attention mechanisms. By using an indirect MA weight generation method, we incorporate the MA term while maintaining the time complexity and parameter size of the underlying efficient attention models. We further explore how indirect parameter generation can produce implicit MA weights that align with the modeling requirements for local temporal impacts. Experimental results show that WAVE attention that incorporates the ARMA structure consistently improves the performance of various AR attentions on TSF tasks, achieving state-of-the-art results.












