Research Conference Spotlight – Research Impact & Leadership

Supercomputing: International Conference for High Performance Computing, Networking, Storage, and Analysis

St. Louis, MO | Nov 16–21, 2025

Georgia Tech’s Gordon Bell Prize Finalists

The ACM Gordon Bell Prize, which will be announced at SC 2025, recognizes outstanding achievement in high performance computing. The purpose of the award, often referred to as the Nobel Prize in supercomputing, is to track the progress over time of parallel computing, with particular emphasis on rewarding innovation in applying high performance computing to applications in science, engineering, and large-scale data analytics.

A team from Georgia Tech, NVIDIA, Oak Ridge National Laboratory, AMD, Hewlett Packard Enterprise (HPE), and New York University was selected as a finalist for the 2025 Gordon Bell Prize. The group achieved the world’s largest computational fluid dynamics simulation, exceeding the current record by a factor of 20. The group simulated interacting plumes of 33 rocket thrusters inspired by the SpaceX Super Heavy booster, spanning 200 trillion grid points and 1 quadrillion degrees of freedom. Team members ran their Multicomponent Flow Code (MFC) on OLCF Frontier, LLNL El Capitan, and CSCS Alps to achieve the simulation results.

Congratulations to all the team members, including Georgia Tech’s contributors 🐝:

IEEE VIS

Vienna, Austria | Nov 1–7, 2025

Full Paper

Explanation, Exploration, and Model Configuration

Your Model Is Unfair, Are You Even Aware? Inverse Relationship Between Comprehension and Trust in Explainability Visualizations of Biased ML Models
Zhanna Kaufman, Madeline Endres, Cindy Xiong Bearfield, Yuriy Brun

The VIS in GenAI

Write, Rank, or Rate: Comparing Methods for Studying Visualization Affordances
Chase Stokes, Kylie Lin, Cindy Xiong Bearfield

Trust No One

Visualizing Trust: How Chart Embellishments Influence Perceptions of Credibility
Hayeong Song, Aeree Cho, Cindy Xiong Bearfield, John Stasko

Visualization Literacy

Tell Me Without Telling Me: Two-Way Prediction of Visualization Literacy and Visual Attention
Minsuk Chang, Yao Wang, Huichen Wang, Yuanhong Zhou, Andreas Bulling, Cindy Xiong Bearfield

Invited TVCG Paper

Analysts, Assemble!

ASight: Fine-tuning Auto-Scheduling Optimizations for Model Deployment via Visual Analytics
Laixin Xie, Chenyang Zhang, Ruofei Ma, Xingxing Xing, Wei Wan, Quan Li

Graphs and Networks

Bridging Network Science and Vision Science: Mapping Perceptual Mechanisms to Network Visualization Tasks
S. Sandra Bae, Kyle Cave, Carsten Görg, Paul Rosen, Danielle Albers Szafir, Cindy Xiong Bearfield

Immersive & Ubiquitous Analytics

Exploring Spatial Hybrid User Interface for Visual Sensemaking
Wai Tong, Haobo Li, Meng Xia, Kam Kwai Wong, Ting-Chuen Pong, Huamin Qu, Yalong Yang

Interaction, Provenance, and Collaboration

Utilizing Provenance as an Attribute for Visual Data Analysis: A Design Probe with ProvenanceLens
Arpit Narechania, Shunan Guo, Eunyee Koh, Alex Endert, Jane Hoffswell

Short Paper

Perception & Semantics

From Perception to Decision: Assessing the Role of Chart Type Affordances in High-Level Decision Tasks
Yixuan Li, Emery D. Berger, Minsuk Kahng, Cindy Xiong Bearfield

Global Extrema Bias Perception and Recall of Average Data Values in Line Charts
Tejas Savalia, Andrew Lovett, Cristina R. Ceja, Rosemary Cowell, Cindy Xiong Bearfield

Visualization in-the-wild

Visualizing Opinion Space in Voting Advice Applications: A User Study
Damion E. Verboom, Tamara Mchedlidze, Başak Oral, Evanthia Dimara, Daniela Peres Rebelo, Naomi Kamoen, Cindy Xiong Bearfield

Poster

ChartJunkGPT: Can GPT-4.1 Interpret Visually Embellished Charts?
Alexander Bendeck, John Stasko

Diffusion Explorer: Interactive Exploration of Diffusion Models
Alec Helbling, Duen Horng Chau

Workshop

Visualization for AI Explainability

[BEST SUBMISSION] Transformer Explainer: LLM Transformer Model Visually Explained
Aeree Cho, Grace C. Kim, Alexander Karpekov, Alec Helbling, Zijie J. Wang, Seongmin Lee, Benjamin Hoover, Duen Horng Chau

Conference on Empirical Methods in Natural Language Processing

Suzhou, China | Nov 4–9, 2025

AI/LLM Agents

ToolHaystack: Stress-Testing Tool-Augmented Language Models in Realistic Long-Term Interactions

Beong-woo Kwak, Minju Kim, Dongha Lim, Hyungjoo Chae, Dongjin Kang, Sunghwan Kim, Dongil Yang, Jinyoung Yeo

Large language models (LLMs) have demonstrated strong capabilities in using external tools to address user inquiries. However, most existing evaluations assume tool use in short contexts, offering limited insight into model behavior during realistic long-term interactions. To fill this gap, we introduce ToolHaystack, a benchmark for testing the tool use capabilities in long-term interactions. Each test instance in ToolHaystack includes multiple tasks execution contexts and realistic noise within a continuous conversation, enabling assessment of how well models maintain context and handle various disruptions. By applying this benchmark to 14 state-of-the-art LLMs, we find that while current models perform well in standard multi-turn settings, they often significantly struggle in ToolHaystack, highlighting critical gaps in their long-term robustness not revealed by previous tool benchmarks.

WebAgent-R1: Training Web Agents via End-to-End Multi-Turn Reinforcement Learning

Zhepei Wei, Wenlin Yao, Yao Liu, Weizhi Zhang, Qin Lu, Liang Qiu, Changlong Yu, Puyang Xu, Chao Zhang, Bing Yin, Hyokun Yun, Lihong Li

While reinforcement learning (RL) has demonstrated remarkable success in enhancing large language models (LLMs), it has primarily focused on single-turn tasks such as solving math problems. Training effective web agents for multi-turn interactions remains challenging due to the complexity of long-horizon decision-making across dynamic web interfaces. In this work, we present WebAgent-R1, a simple yet effective end-to-end multi-turn RL framework for training web agents. It learns directly from online interactions with web environments by asynchronously generating diverse trajectories, entirely guided by binary rewards depending on task success. Experiments on the WebArena-Lite benchmark demonstrate the effectiveness of WebAgent-R1, boosting the task success rate of Qwen-2.5-3B from 6.1% to 33.9% and LLaMA-3.1-8B from 8.5% to 44.8%, significantly outperforming existing state-of-the-art methods and strong proprietary models such as OpenAI o3. In-depth analyses reveal the effectiveness of the thinking-based prompting strategy and test-time scaling through increased interactions for web tasks. We further investigate different RL initialization policies by introducing two variants, namely WebAgent-R1-Zero and WebAgent-R1-CoT, which highlight the importance of the warm-up training stage (i.e., behavior cloning) and provide insights on incorporating long chain-of-thought (CoT) reasoning in web agents.

Computational Social Science, Cultural Analytics, and NLP for Social Good

[ORAL] Culture Cartography: Mapping the Landscape of Cultural Knowledge

Caleb Ziems, William Barr Held, Jane Yu, Amir Goldberg, David Grusky, Diyi Yang

To serve global users safely and productively, LLMs need culture-specific knowledge that might not be learned during pre-training. How do we find such knowledge that is (1) salient to in-group users, but (2) unknown to LLMs? The most common solutions are single-initiative: either researchers define challenging questions that users passively answer (traditional annotation), or users actively produce data that researchers structure as benchmarks (knowledge extraction). The process would benefit from mixed-initiative collaboration, where users guide the process to meaningfully reflect their cultures, and LLMs steer the process towards more challenging questions that meet the researcher’s goals. We propose a mixed-initiative methodology called CultureCartography. Here, an LLM initializes annotation with questions for which it has low-confidence answers, making explicit both its prior knowledge and the gaps therein. This allows a human respondent to fill these gaps and steer the model towards salient topics through direct edits. We implement this methodology as a tool called CultureExplorer. Compared to a baseline where humans answer LLM-proposed questions, we find that CultureExplorer more effectively produces knowledge that leading models like DeepSeek R1 and GPT-4o are missing, even with web search. Fine-tuning on this data boosts the accuracy of Llama-3.1-8B by up to 19.2% on related culture benchmarks.

Africa Health Check: Probing Cultural Bias in Medical LLMs

Charles Nimo, Shuheng Liu, Irfan Essa, Michael L. Best

Large language models (LLMs) are increasingly deployed in global healthcare, yet their outputs often reflect Western-centric training data and omit indigenous medical systems and region-specific treatments. This study investigates cultural bias in instruction-tuned medical LLMs using a curated dataset of African traditional herbal medicine. We evaluate model behavior across two complementary tasks, namely, multiple-choice questions and fill-in-the-blank completions, designed to capture both treatment preferences and responsiveness to cultural context. To quantify outcome preferences and prompt influences, we apply two complementary metrics: Cultural Bias Score (CBS) and Cultural Bias Attribution (CBA). Our results show that while prompt adaptation can reduce inherent bias and enhance cultural alignment, models vary in how responsive they are to contextual guidance. Persistent default to allopathic (Western) treatments in zero-shot scenarios suggests that many biases remain embedded in model training. These findings underscore the need for culturally informed evaluation strategies to guide the development of AI systems that equitably serve diverse global health contexts. By releasing our dataset and providing a dual-metric evaluation approach, we offer practical tools for developing more culturally aware and clinically grounded AI systems for healthcare settings in the Global South.

How Real Are Synthetic Therapy Conversations? Evaluating Fidelity in Prolonged Exposure Dialogues

Suhas BN, Dominik O. Mattioli, Andrew M. Sherrill, Rosa I. Arriaga, Christopher Wiese, Saeed Abdullah

The growing adoption of synthetic data in healthcare is driven by privacy concerns, limited access to real-world data, and high annotation costs. This work explores the use of synthetic Prolonged Exposure (PE) therapy conversations for Post-Traumatic Stress Disorder (PTSD) as a scalable alternative for training and evaluating clinical models. We systematically compare real and synthetic dialogues using linguistic, structural, and protocol-specific metrics, including turn-taking patterns and treatment fidelity. We introduce and evaluate PE-specific metrics derived from linguistic analysis and semantic modeling, offering a novel framework for assessing clinical fidelity beyond surface fluency. Our findings show that while synthetic data holds promise for mitigating data scarcity and protecting patient privacy, it often struggles to capture the subtle dynamics of therapeutic interactions. Synthetic therapy dialogues closely match the structural features of real conversations (e.g., speaker switch ratio: 0.98 vs. 0.99), but often fails to adequately reflect key fidelity markers such as distress monitoring. This work highlights gaps in current evaluation frameworks and advocate for fidelity-aware metrics that go beyond surface fluency to uncover clinically significant failures. Our findings clarify where synthetic data can effectively complement real-world datasets—and where critical limitations remain.

MythTriage: Scalable Detection of Opioid Use Disorder Myths on a Video-Sharing Platform

Hayoung Jung, Shravika Mittal, Ananya Aatreya, Navreet Kaur, Munmun De Choudhury, Tanu Mitra

Understanding the prevalence of misinformation in health topics online can inform public health policies and interventions. However, measuring such misinformation at scale remains a challenge, particularly for high-stakes but understudied topics like opioid-use disorder (OUD)—a leading cause of death in the U.S. We present the first large-scale study of OUD-related myths on YouTube, a widely-used platform for health information. With clinical experts, we validate 8 pervasive myths and release an expert-labeled video dataset. To scale labeling, we introduce MythTriage, an efficient triage pipeline that uses a lightweight model for routine cases and defers harder ones to a high-performing, but costlier, large language model (LLM). MythTriage achieves up to 0.86 macro F1-score while estimated to reduce annotation time and financial cost by over 76% compared to experts and full LLM labeling. We analyze 2.9K search results and 343K recommendations, uncovering how myths persist on YouTube and offering actionable insights for public health and platform moderation.

Who Speaks Matters: Analysing the Influence of the Speaker’s Linguistic Identity on Hate Classification

Ananya Malik, Kartik Sharma, Lynnette Hui Xian Ng, Shaily Bhatt

Large Language Models (LLMs) offer a lucrative promise for scalable content moderation, including hate speech detection. However, they are also known to be brittle and biased against marginalised communities and dialects. This requires their applications to high-stakes tasks like hate speech detection to be critically scrutinized. In this work, we investigate the robustness of hate speech classification using LLMs particularly when explicit and implicit markers of the speaker’s ethnicity are injected into the input. For explicit markers, we inject a phrase that mentions the speaker’s linguistic identity. For the implicit markers, we inject dialectal features. By analysing how frequently model outputs flip in the presence of these markers, we reveal varying degrees of brittleness across 3 LLMs and 1 LM and 5 linguistic identities. We find that the presence of implicit dialect markers in inputs causes model outputs to flip more than the presence of explicit markers. Further, the percentage of flips varies across ethnicities. Finally, we find that larger models are more robust. Our findings indicate the need for exercising caution in deploying LLMs for high-stakes tasks like hate speech detection.

Ethics, Bias, and Fairness

Towards Universal Debiasing for Language Models-based Tabular Data Generation

Tianchun Li, Tianci Liu, Xingchen Wang, Rongzhe Wei, Pan Li, Lu Su, Jing Gao

Large language models (LLMs) have achieved promising results in tabular data generation. However, inherent historical biases in tabular datasets often cause LLMs to exacerbate fairness issues, particularly when multiple advantaged and protected features are involved. In this work, we introduce a universal debiasing framework that minimizes group-level dependencies by simultaneously reducing the mutual information between advantaged and protected attributes. By leveraging the autoregressive structure and analytic sampling distributions of LLM-based tabular data generators, our approach efficiently computes mutual information, reducing the need for cumbersome numerical estimations. Building on this foundation, we propose two complementary methods: a direct preference optimization (DPO)-based strategy, namely UDF-DPO, that integrates seamlessly with existing models, and a targeted debiasing technique, namely UDF-MIX, that achieves debiasing without tuning the parameters of LLMs. Extensive experiments demonstrate that our framework effectively balances fairness and utility, offering a scalable and practical solution for debiasing in high-stakes applications.

Human-AI Interaction/Cooperation

[ORAL] The Pursuit of Empathy: Evaluating Small Language Models for PTSD Dialogue Support

Suhas BN, Yash Mahajan, Dominik O. Mattioli, Andrew M. Sherrill, Rosa I. Arriaga, Christopher Wiese, Saeed Abdullah

Can small language models (0.5B–5B parameters) meaningfully engage in trauma-informed, empathetic dialogue for individuals with PTSD? We answer this by introducing TIDE, a dataset of 10,000 two-turn dialogues across 500 diverse PTSD client personas, grounded in a three-factor empathy model: emotion recognition, distress normalization, and supportive reflection. All scenarios and reference responses were reviewed for realism and trauma sensitivity by a clinical psychologist specializing in PTSD. Eight small language models are evaluated before and after fine-tuning, with outputs compared to a frontier model (Claude Sonnet 3.5) as reference. Our IRB-approved human evaluation and automatic metrics reveal that, while fine-tuning generally improves perceived empathy, gains are highly scenario- and user-dependent, with smaller models facing an “empathy ceiling.” Notably, demographic analyses show older adults value distress validation and graduate-educated users prefer nuanced replies, while gender effects are minimal. We highlight limitations of automatic metrics and the need for context- and user-aware system design. Our findings—along with the planned release of TIDE—offer a foundation for building safe, resource-efficient, and ethically sound empathetic AI to supplement, not replace, clinical mental health care.

Interpretability, Model Editing, Transparency, and Explainability

Interpretation Meets Safety: A Survey on Interpretation Methods and Tools for Improving LLM Safety

Seongmin Lee, Aeree Cho, Grace C. Kim, ShengYun Peng, Mansi Phute, Duen Horng Chau

As large language models (LLMs) see wider real-world use, understanding and mitigating their unsafe behaviors is critical. Interpretation techniques can reveal causes of unsafe outputs and guide safety, but such connections with safety are often overlooked in prior surveys. We present the first survey that bridges this gap, introducing a unified framework that connects safety-focused interpretation methods, the safety enhancements they inform, and the tools that operationalize them. Our novel taxonomy, organized by LLM workflow stages, summarizes nearly 70 works at their intersections. We conclude with open challenges and future directions. This timely survey helps researchers and practitioners navigate key advancements for safer, more interpretable LLMs.

Low-resource Methods for NLP

DORM: Preference Data Weights Optimization for Reward Modeling in LLM Alignment

Rongzhi Zhang, Chenwei Zhang, Xinyang Zhang, Liang Qiu, Haoming Jiang, Yuchen Zhuang, Qingru Zhang, Hyokun Yun, Xian Li, Bing Yin, Tuo Zhao, Chao Zhang

Aligning large language models (LLMs) with human preferences relies heavily on high-quality reward models. However, existing approaches struggle with two critical challenges: noisy preference labels and the varying usefulness of preference samples. To address these issues, we introduce DORM, a method that enhances reward modeling by learning to dynamically weigh preference data. First, DORM estimates data importance by integrating model uncertainty with prediction disagreement, thereby emphasizing data points that are both informative and reliable. Second, it iteratively refines these weights via a bilevel optimization procedure: the upper level adjusts weights to enhance validation performance, guided by initial uncertainty estimates, while the lower level trains the reward model using the updated weights. Using only 50k samples, DORM trains a 12B reward model that achieves 90.2% accuracy on RewardBench, matching the performance of models trained on significantly larger datasets. Furthermore, downstream alignment tasks show that fine-tuned LLMs with DORM achieve a 61.2% win rate against baseline methods, highlighting its data efficiency and generalizability.

Multilinguality and Language Diversity

[ORAL] CARE: Multilingual Human Preference Learning for Cultural Awareness

Geyang Guo, Tarek Naous, Hiromi Wakaki, Yukiko Nishimura, Yuki Mitsufuji, Alan Ritter, Wei Xu

Language Models (LMs) are typically tuned with human preferences to produce helpful responses, but the impact of preference tuning on the ability to handle culturally diverse queries remains understudied. In this paper, we systematically analyze how native human cultural preferences can be incorporated into the preference learning process to train more culturally aware LMs. We introduce \textbf{CARE}, a multilingual resource containing 3,490 culturally specific questions and 31.7k responses with native judgments. We demonstrate how a modest amount of high-quality native preferences improves cultural awareness across various LMs, outperforming larger generic preference data. Our analyses reveal that models with stronger initial cultural performance benefit more from alignment, leading to gaps among models developed in different regions with varying access to culturally relevant data. CARE will be made publicly available at \url{https://anonymized_url}.

What are Foundation Models Cooking in the Post-Soviet World?

Anton Lavrouk, Tarek Naous, Alan Ritter, Wei Xu

The culture of the Post-Soviet states is complex, shaped by a turbulent history that continues to influence current events. In this study, we investigate the Post-Soviet cultural food knowledge of foundation models by constructing BORSch, a multi-modal dataset encompassing 1147 and 823 dishes in the Russian and Ukrainian languages, centered around the Post-Soviet region. We demonstrate that leading models struggle to correctly identify the origins of dishes from Post-Soviet nations in both text-only and multi-modal Question Answering (QA), instead over-predicting countries linked to the language the question is asked in. Through analysis of pre-training data, we show that these results can be explained by misleading dish-origin co-occurrences, along with linguistic phenomena such as Russian-Ukrainian code mixing. Finally, to move beyond QA-based assessments, we test models’ abilities to produce accurate visual descriptions of dishes. The weak correlation between this task and QA suggests that QA alone may be insufficient as an evaluation of cultural understanding.

NLP Applications

AssistedDS: Benchmarking How External Domain Knowledge Assists LLMs in Automated Data Science

An Luo, Xun Xian, Jin Du, Fangqiao Tian, Ganghua Wang, Ming Zhong, Shengchun ZHAO, Xuan Bi, Zirui Liu, Jiawei Zhou, Jayanth Srinivasa, Ashish Kundu, Charles Fleming, Mingyi Hong, Jie Ding

Large language models (LLMs) have advanced the automation of data science workflows. Yet it remains unclear whether they can critically leverage external domain knowledge as human data scientists do in practice. To answer this question, we introduce AssistedDS (Assisted Data Science), a benchmark designed to systematically evaluate how LLMs handle domain knowledge in tabular prediction tasks. AssistedDS features both synthetic datasets with explicitly known generative mechanisms and real-world Kaggle competitions, each accompanied by curated bundles of helpful and adversarial documents. These documents provide domain-specific insights into data cleaning, feature engineering, and model selection. We assess state-of-the-art LLMs on their ability to discern and apply beneficial versus harmful domain knowledge, evaluating submission validity, information recall, and predictive performance. Our results demonstrate three key findings: (1) LLMs frequently exhibit an uncritical adoption of provided information, significantly impairing their predictive performance when adversarial content is introduced, (2) helpful guidance is often insufficient to counteract the negative influence of adversarial information, and (3) in Kaggle datasets, LLMs often make errors in handling time-series data, applying consistent feature engineering across different folds, and interpreting categorical variables correctly. These findings highlight a substantial gap in current models’ ability to critically evaluate and leverage expert knowledge, underscoring an essential research direction for developing more robust, knowledge-aware automated data science systems.

Protein Large Language Models: A Comprehensive Survey

Yijia Xiao, Wanjia Zhao, Junkai Zhang, Yiqiao Jin, Han Zhang, Zhicheng Ren, Renliang Sun, Haixin Wang, Guancheng Wan, Pan Lu, Xiao Luo, Yu Zhang, James Zou, Yizhou Sun, Wei Wang

Protein-specific large language models (ProteinLLMs) are revolutionizing protein science by enabling more efficient protein structure prediction, function annotation, and design. While existing surveys focus on specific aspects or applications, this work provides the first comprehensive overview of ProteinLLMs, covering their architectures, training datasets, evaluation metrics, and diverse applications. Through a systematic analysis of over 100 articles, we propose a structured taxonomy of state-of-the-art ProteinLLMs, analyze how they leverage large-scale protein sequence data for improved accuracy, and explore their potential in advancing protein engineering and biomedical research. Additionally, we discuss key challenges and future directions, positioning ProteinLLMs as essential tools for scientific discovery in protein science. A GitHub repository and tutorial will be available upon publication.

Phonology, Morphology and Word Segmentation

[ORAL] Train It and Forget It: Merge Lists are Unnecessary for BPE Inference in Language Models

Tomohiro Sawada, Kartik Goyal

Standard Byte-Pair Encoding (BPE) tokenization compresses text by pairing a learned token vocabulary with a detailed merge list. Recent work has shown that this merge list exposes a potential attack surface for extracting informa- tion about language model’s training data. In this paper, we explore the downstream impact of BPE inference algorithms that do not rely on this merge list at all, and hence differ from the encoding process during the BPE training. To address this question, we investigate two broad classes of BPE inference schemes that differ from BPE appliction during training: a) targetted deviation from merge-lists including random merge orders, and various corruptions of merge list involving deletion/truncation, and b) non-targetted BPE inference algorithms that do not depend on the merge list but focus on compressing the text either greedily or exactly. Extensive experiments across diverse language modeling tasks like accuracy-based QA bench- marks, machine translation, and open-ended generation reveal that while the targetted devi- ation from the merge lists exhibit significant degradation in language model performance, the non-targetted merge-list free inference algo- rithms result in minimal impact on downstream performance that is often much smaller than expected. These findings pave way for simpler and potentially more privacy-preserving tok- enization schemes that do not catastrophically compromise model performance.

Question Answering

Superficial Self-Improved Reasoners Benefit from Model Merging

Xiangchi Yuan, Chunhui Zhang, Zheyuan Liu, Dachuan Shi, Leyan Pan, Soroush Vosoughi, Wenke Lee

Large Language Models (LLMs) rely heavily on large-scale reasoning data, but as such data becomes increasingly scarce, model self-improvement offers a promising alternative. However, this process can lead to model collapse, as the model’s output becomes overly deterministic with reduced diversity. In this work, we identify a new risk beyond model collapse, which we term the Superficial Self-Improved Reasoners phenomenon. This phenomenon indicates that while self-improvement enhances in-domain (ID) reasoning accuracy, it degrades the model’s generalized reasoning capability on out-of-domain (OOD) datasets, as the model tends to memorize the training data. Our analyses of layer importance and parameter changes reveal that reasoning-critical layers receive fewer updates compared to less relevant layers during self-improvement. To address this, we propose Iterative Model Merging (IMM), which balances reasoning improvements and generalization by merging the weights of the original and self-improved models. IMM effectively mitigates model collapse and improves generalized reasoning capability.

Resources and Evaluation

DCR: Quantifying Data Contamination in LLMs Evaluation

Cheng Xu, Nan Yan, Shuhao Guan, Changhong Jin, Yuke Mei, Yibing Guo, Tahar Kechadi

The rapid advancement of large language models (LLMs) has heightened concerns about benchmark data contamination (BDC), where models inadvertently memorize evaluation data, inflating performance metrics and undermining genuine generalization assessment. This paper introduces the Data Contamination Risk (DCR) framework, a lightweight, interpretable pipeline designed to detect and quantify BDC across four granular levels: semantic, informational, data, and label. By synthesizing contamination scores via a fuzzy inference system, DCR produces a unified DCR Factor that adjusts raw accuracy to reflect contamination-aware performance. Validated on 9 LLMs (0.5B–72B) across sentiment analysis, fake news detection, and arithmetic reasoning tasks, the DCR framework reliably diagnoses contamination severity and with accuracy adjusted using the DCR Factor to within 4% average error across the three benchmarks compared to the uncontaminated baseline. Emphasizing computational efficiency and transparency, DCR provides a practical tool for integrating contamination assessment into routine evaluations, fostering fairer comparisons and enhancing the credibility of LLM benchmarking practices.

FlashAdventure: A Benchmark for GUI Agents Solving Full Story Arcs in Diverse Adventure Games

Jaewoo Ahn, Junseo Kim, Heeseung Yun, Jaehyeon Son, Dongmin Park, Jaewoong Cho, Gunhee Kim

GUI agents powered by LLMs show promise in interacting with diverse digital environments. Among these, video games offer a valuable testbed due to their varied interfaces, with adventure games posing additional challenges through complex, narrative-driven interactions. Existing game benchmarks, however, lack diversity and rarely evaluate agents on completing entire storylines. To address this, we introduce FlashAdventure, a benchmark of 34 Flash-based adventure games designed to test full story arc completion and tackle the observation-behavior gap—the challenge of remembering and acting on earlier gameplay information. We also propose CUA-as-a-judge, an automated gameplay evaluator, and COAST, an agentic framework leveraging long-term clue memory to better plan and solve sequential tasks. Experiments show current GUI agents struggle with full story arcs, while COAST improves milestone completion by bridging the observation-behavior gap. Nonetheless, a marked discrepancy between humans and best-performing agents warrants continued research efforts to narrow this divide.

How to Protect Yourself from 5G Radiation? Investigating LLM Responses to Implicit Misinformation

Ruohao Guo, Wei Xu, Alan Ritter

As Large Language Models (LLMs) are widely deployed in diverse scenarios, the extent to which they could tacitly spread misinformation emerges as a critical safety concern. Current research primarily evaluates LLMs on explicit false statements, overlooking how misinformation often manifests subtly as unchallenged premises in real-world interactions. We curated EchoMist, the first comprehensive benchmark for implicit misinformation, where false assumptions are embedded in the query to LLMs. EchoMist targets circulated, harmful, and ever-evolving implicit misinformation from diverse sources, including realistic human-AI conversations and social media interactions. Through extensive empirical studies on 15 state-of-the-art LLMs, we find that current models perform alarmingly poorly on this task, often failing to detect false premises and generating counterfactual explanations. We also investigate two mitigation methods, i.e., Self-Alert and RAG, to enhance LLMs’ capability to counter implicit misinformation. Our findings indicate that EchoMist remains a persistent challenge and underscore the critical need to safeguard against the risk of implicit misinformation.

Multimodal Emotion Recognition in Conversations: A Survey of Methods, Trends, Challenges and Prospects

ChengYan Wu, Yiqiang Cai, Yang Liu, pengxu zhu, Yun Xue, Ziwei Gong, Julia Hirschberg, Bolei Ma

While text-based emotion recognition methods have achieved notable success, real-world dialogue systems often demand a more nuanced emotional understanding than any single modality can offer. Multimodal Emotion Recognition in Conversations (MERC) has thus emerged as a crucial direction for enhancing the naturalness and emotional understanding of human-computer interaction. Its goal is to accurately recognize emotions by integrating information from various modalities such as text, speech, and visual signals. This survey offers a systematic overview of MERC, including its motivations, core tasks, representative methods, and evaluation strategies. We further examine recent trends, highlight key challenges, and outline future directions. As interest in emotionally intelligent systems grows, this survey provides timely guidance for advancing MERC research.

NESTFUL: A Benchmark for Evaluating LLMs on Nested Sequences of API Calls

Kinjal Basu, Ibrahim Abdelaziz, Kiran Kate, Mayank Agarwal, Maxwell Crouse, Yara Rizk, Kelsey Bradford, Asim Munawar, Sadhana Kumaravel, Saurabh Goyal, Xin Wang, Luis A. Lastras, Pavan Kapanipathi

The resurgence of autonomous agents built using large language models (LLMs) to solve complex real-world tasks has brought increased focus on LLMs’ fundamental ability of tool or function calling. At the core of these agents, an LLM must plan, execute, and respond using external tools, APIs, and custom functions. Research on tool calling has gathered momentum, but evaluation benchmarks and datasets representing the complexity of the tasks have lagged behind. In this work, we focus on one such complexity, nested sequencing, with the goal of extending existing benchmarks and evaluation. Specifically, we present NESTFUL, a benchmark to evaluate LLMs on nested sequences of API calls, i.e., sequences where the output of one API call is passed as input to a subsequent call. NESTFUL contains 1800+ nested sequences where all the function calls are executable. Experimental results on a variety of models show that the best-performing model (GPT-4o) achieves a full sequence match accuracy of 28% and a win-rate of 60%, necessitating a large scope for improvement in the nested sequencing aspect of function calling. Our analysis of these results provides possible future research directions for the community, in addition to a benchmark to track progress.

SimulatorArena: Are User Simulators Reliable Proxies for Multi-Turn Evaluation of AI Assistants?

Yao Dou, Michel Galley, Baolin Peng, Chris Kedzie, Weixin Cai, Alan Ritter, Chris Quirk, Wei Xu, Jianfeng Gao

Large language models (LLMs) are increasingly used in interactive applications, and human evaluation remains the gold standard for assessing their performance in multi-turn conversations. Since human studies are costly, time-consuming, and hard to reproduce, recent work explores using LLMs to simulate users for automatic assistant evaluation. However, there is no benchmark or systematic study to evaluate whether these simulated users are reliable stand-ins for real users. To address this, we introduce USimBench, a benchmark of 909 annotated human–LLM conversations on two interactive tasks—math tutoring and document creation. USimBench evaluates simulators based on how closely their messages match human behavior and how well their assistant ratings align with human judgments. Experiments on various simulator methods show that simulators conditioned on user profiles, capturing traits like background and message styles, align closely with human judgments. They reach Spearman’s $\rho$ of 0.7 on both tasks, providing a practical, scalable alternative to human evaluation.

SSA: Semantic Contamination of LLM-Driven Fake News Detection

Cheng Xu, Nan Yan, Shuhao Guan, Yuke Mei, Tahar Kechadi

Benchmark data contamination (BDC) silently inflate the evaluation performance of large language models (LLMs), yet current work on BDC has centered on direct token overlap (data/label level), leaving the subtler and equally harmful semantic level BDC largely unexplored. This gap is critical in fake news detection task, where prior exposure to semantic BDC lets a model “remember” the answer instead of reasoning. We (1) are the first to formally defined semantic contamination for this task and (2) introduced the Semantic Sensitivity Amplifier (SSA)—a lightweight, model-agnostic framework that detect BDC risks across semantic to label level via an entity shift perturbation and a comprehensive interpretable metric, the SSA Factor. Evaluating 45 variants of nine LLMs (0.5B–72B parameters) across four BDC levels, we find LIAR2 accuracy climbs monotonically with injected contamination, while the SSA Factor escalates in near-perfect lock-step ($r\geq $.97, for models $\geq$3B, $p<$.05; $\rho \geq$.9 overall, $p<$.05). These results show that SSA provides a sensitive, scalable audit of comprehensive BDC risk and paves the way for more integrity evaluation of LLM-driven fake news detection task.

Towards Robust Mathematical Reasoning

Thang Luong, Hoang H Nguyen, Dawsen Hwang, Golnaz Ghiasi, Yuri Chervonyi, Insuk Seo, Garrett Bingham, Jonathan Lee, Swaroop Mishra, Alex Zhai, Huiyi Hu, Henryk Michalewski, Jimin Kim, Jeonghyun Ahn, Junhwi Bae, Quoc V Le, Junehyuk Jung

We present IMO-Bench, a suite of advanced reasoning benchmarks that aim for robustness in evaluation and specifically target the level of the International Mathematical Olympiad, the most prestigious venue for competitive math. IMO-Bench consists of diverse and challenging problems vetted by a panel of top IMO medalists and mathematicians. The first benchmark, IMO-AnswerBench, consists of 400 problems with verifiable answers curated from past Olympiad competitions and then altered by experts for robustness in evaluation. The latest frontier models struggle on this benchmark, with less than 48% accuracies in terms of matching the final answers. To advance the field beyond simple short-answer evaluation, we design IMO-ProofBench, consisting of both basic and novel problems, with detailed grading guidelines for full proof evaluation. Experts’gradings reveal that the best model achieves less than 36% max performance on this benchmark. Towards reducing grading cost, we share an automatic grader for the basic set that highly correlates with human expert evaluations. Last but not least, we construct, IMO-MistakeBench, a benchmark for identifying the first incorrect step in a full solution. Together, we hope the IMO-Bench contributes towards advancing robust mathematical reasoning.

Retrieval-Augmented Language Models

OG-RAG: Ontology-grounded retrieval-augmented generation for large language models

Kartik Sharma, Peeyush Kumar, Yunqing Li

While LLMs are widely used for generic tasks like question answering and search, they struggle to adapt to specialized knowledge, such as industrial workflows in healthcare, legal, and agricultural sectors, as well as knowledge-driven tasks such as news journalism, investigative research, and consulting without expensive fine-tuning or sub-optimal retrieval methods. Existing retrieval-augmented models, such as RAG, offer improvements but fail to account for structured domain knowledge, leading to suboptimal context generation. Ontologies, which conceptually organize domain knowledge by defining entities and their interrelationships, offer a structured representation to address this gap. This paper presents OG-RAG, an Ontology-Grounded Retrieval Augmented Generation method designed to enhance LLM-generated responses by anchoring retrieval processes in domain-specific ontologies. OG-RAG constructs a hypergraph representation of domain documents, where each hyperedge encapsulates clusters of factual knowledge grounded using domain-specific ontology and retrieves a minimal set of hyperedges for a given query using an optimization algorithm. Our evaluations demonstrate that OG-RAG increases the recall of accurate facts by 55% and improves response correctness by 40% across four different LLMs. Additionally, OG-RAG enables 30% faster attribution of responses to context and boosts fact-based reasoning accuracy by 27% compared to baseline methods. We release the code at https://anonymous.4open.science/r/ograg-E7A8.

Safety and Alignment in LLMs

Jailbreak-Tuning: Models Efficiently Learn Jailbreak Susceptibility

Brendan Murphy, Dillon Bowen, Shahrad Mohammadzadeh, Tom Tseng, Julius Broomfield, Adam Gleave, Kellin Pelrine

AI systems are rapidly advancing in capability, and frontier model developers broadly acknowledge the need for safeguards against serious misuse. However, this paper demonstrates that fine-tuning, whether via open weights or closed fine-tuning APIs, can produce helpful-only models. In contrast to prior work which is blocked by modern moderation systems or achieved only partial removal of safeguards or degraded output quality, our jailbreak-tuning method teaches models to generate detailed, high-quality responses to arbitrary harmful requests. For example, OpenAI, Google, and Anthropic models will fully comply with requests for CBRN assistance, executing cyberattacks, and other criminal activity. We further show that backdoors can increase not only the stealth but also the severity of attacks. Stronger jailbreak prompts become even more effective in fine-tuning attacks, linking attack and potentially defenses in the input and weight spaces. Not only are current models vulnerable, more recent ones also appear to be becoming even more vulnerable to these attacks, underscoring the urgent need for tamper-resistant safeguards. Until such safeguards are discovered, companies and policymakers should view the release of any fine-tunable model as simultaneously releasing its evil twin: equally capable as the original model, and usable for any malicious purpose within its capabilities.

WebInject: Prompt Injection Attack to Web Agents

Xilong Wang, John Bloch, Zedian Shao, Yuepeng Hu, Shuyan Zhou, Neil Zhenqiang Gong

Multi-modal large language model (MLLM)-based web agents interact with webpage environments by generating actions based on screenshots of the webpages. Environmental prompt injection attacks manipulate the environment to induce the web agent to perform a specific, attacker-chosen action–referred to as the target action–such as clicking on a designated coordinate on the monitor. However, existing attacks suffer from limited effectiveness or stealthiness, or are impractical in real-world settings. In this work, we propose EnvInjection, a new attack that addresses these limitations. Our attack adds a perturbation to the raw pixel values of the rendered webpage, which can be implemented by modifying the webpage’s source code. After these perturbed pixels are mapped into a screenshot, the perturbation induces the web agent to perform the target action. We formulate the task of finding the perturbation as an optimization problem. A key challenge in solving this problem is that the mapping between raw pixel values and screenshot is non-differentiable, making it difficult to backpropagate gradients to the perturbation. To overcome this, we train a neural network to approximate the mapping and apply projected gradient descent to solve the reformulated optimization problem. Extensive evaluation on multiple webpage datasets shows that EnvInjection is highly effective and significantly outperforms existing baselines.

Special Theme: Interdisciplinary Recontextualization of NLP

[ORAL] From Language to Cognition: How LLMs Outgrow the Human Language Network

Badr AlKhamissi, Greta Tuckute, Yingtian Tang, Taha Osama A Binhuraib, Antoine Bosselut, Martin Schrimpf

Large language models (LLMs) exhibit remarkable similarity to neural activity in the human language network. However, the key properties of language underlying this alignment—and how brain-like representations emerge and change across training—remain unclear. We here benchmark 34 training checkpoints spanning 300B tokens across 8 different model sizes to analyze how brain alignment relates to linguistic competence. Specifically, we find that brain alignment tracks the development of formal linguistic competence—i.e., knowledge of linguistic rules—more closely than functional linguistic competence. While functional competence, which involves world knowledge and reasoning, continues to develop throughout training, its relationship with brain alignment is weaker, suggesting that the human language network primarily encodes formal linguistic structure rather than broader cognitive functions. Notably, we find that the correlation between next-word prediction, behavioral alignment, and brain alignment fades once models surpass human language proficiency. We further show that model size is not a reliable predictor of brain alignment when controlling for the number of features. Finally, using the largest set of rigorous neural language benchmarks to date, we show that language brain alignment benchmarks remain unsaturated, highlighting opportunities for improving future models. Taken together, our findings suggest that the human language network is best modeled by formal, rather than functional, aspects of language.

polyBART: A Chemical Linguist for Polymer Property Prediction and Generative Design

Anagha Savit, Harikrishna Sahu, Shivank S. Shukla, Wei Xiong, Rampi Ramprasad

Designing polymers for targeted applications and accurately predicting their properties is a key challenge in materials science owing to the vast and complex polymer chemical space. While molecular language models have proven effective in solving analogous problems for molecular discovery, similar advancements for polymers are limited. To address this gap, we propose polyBART, a language model-driven polymer discovery capability that enables rapid and accurate exploration of the polymer design space. Central to our approach is Pseudo-polymer SELFIES (PSELFIES), a novel representation that allows for the transfer of molecular language models to the polymer space. polyBART is, to the best of our knowledge, the first language model capable of bidirectional translation between polymer structures and properties, achieving state-of-the-art results in property prediction and design of novel polymers for electrostatic energy storage. Further, polyBART is validated through a combination of both computational and laboratory experiments. We report what we believe is the first successful synthesis and validation of a polymer designed by a language model, predicted to exhibit high thermal degradation temperature and confirmed by our laboratory measurements. Our work presents a generalizable strategy for adapting molecular language models to the polymer space and introduces a polymer foundation model, advancing generative polymer design that may be adapted for a variety of applications.

ACM Internet Measurement Conference

Madison, Wisconsin | Oct 28–31, 2025

BGP / Routing security

A first look into long-lived BGP zombies
Iliana Maria Xygkou, Antonios A. Chariton, Xenofontas Dimitropoulos, Alberto Dainotti

Replication: A Two Decade Review of Policy Atoms – Tracing the Evolution of AS Path Sharing Prefixes
Weili Wu, Zachary Bischof, Cecilia Testart, Alberto Dainotti

ru-RPKI-ready: the Road Left to Full ROA Adoption
Deepak Gouda, Romain Fontugne, Cecilia Testart

Mapping resources & infrastructure

Prefix2Org : Mapping BGP Prefixes to Organizations
Deepak Gouda, Alberto Dainotti, Cecilia Testart

Satellite

Assessing LEO Satellite Networks for National Emergency Failover
Vaibhav Bhosale, Ying Zhang, Sameer Kapoor, Robin Kim, Miguel Schlicht, Muskaan Gupta, Ekaterina Tumanova, Zachary Bischof, Fabián E. Bustamante, Alberto Dainotti, Ahmed Saeed

International Conference on Computer Vision

Honolulu, Hawai’i | Oct 19–23, 2025

Georgia Tech-Led Papers

Adversarial Attention Perturbations for Large Object Detection Transformers
Zachary Yahn, Selim Tekin, Fatih Ilhan, Sihao Hu, Tiansheng Huang, Yichang Xu, Margaret Loper, Ling Liu

ASCENT: Annotation-free Self-supervised Contrastive Embeddings for 3D Neuron Tracking in Fluorescence Microscopy
Haejun Han, Hang Lu

Clink! Chop! Thud! – Learning Object Sounds from Real-World Interactions
Mengyu Yang, Yiming Chen, Haozheng Pei, Siddhant Agarwal, Arun Vasudevan, James Hays

Contrastive Flow Matching
George Stoica, Vivek Ramanujan, Xiang Fan, Ali Farhadi, Ranjay Krishna, Judy Hoffman

Fewer Denoising Steps or Cheaper Per-Step Inference: Towards Compute-Optimal Diffusion Model Deployment
Zhenbang Du, Yonggan Fu, Lifu Wang, Jiayi Qian, Xiao Luo, Yingyan Celine Lin

HyPiDecoder: Hybrid Pixel Decoder for Efficient Segmentation and Detection
Fengzhe Zhou, Humphrey Shi

Orchid: Image Latent Diffusion for Joint Appearance and Geometry Generation
Akshay Krishnan, Xinchen Yan, Vincent Casser, Abhijit Kundu

OuroMamba: A Data-Free Quantization Framework for Vision Mamba
Akshat Ramachandran, Mingyu Lee, Huan Xu, Souvik Kundu, Tushar Krishna

SplatTalk: 3D VQA with Gaussian Splatting
Anh Thai, Kyle Genova, Songyou Peng, Leonidas Guibas, Thomas Funkhouser

T2I-Copilot: A Training-Free Multi-Agent Text-to-Image System for Enhanced Prompt Interpretation and Interactive Generation
Chieh-Yun Chen, Min Shi, Gong Zhang, Humphrey Shi

Task-Specific Zero-shot Quantization-Aware Training for Object Detection
Changhao Li, Xinrui Chen, Ji Wang, Kang Zhao, Jianfei Chen

Partner-Led Papers

CLIP-GS: Unifying Vision-Language Representation with 3D Gaussian Splatting
Siyu Jiao, Haoye Dong, Yuyang Yin, ZEQUN JIE, Yinlong Qian, Yao Zhao, Humphrey Shi, Yunchao Wei

CompCap: Improving Multimodal Large Language Models with Composite Captions
Xiaohui Chen, Satya Narayan Shukla, Mahmoud Azab, Aashu Singh, Qifan Wang, David Yang, ShengYun Peng, Hanchao Yu, Shen Yan, Xuewen Zhang, Baosheng He

EmbodiedSplat: Personalized Real-to-Sim-to-Real Navigation with Gaussian Splats from a Mobile Device
Gunjan Chhablani, Xiaomeng Ye, Muhammad Zubair Irshad, Zsolt Kira

IMG: Calibrating Diffusion Models via Implicit Multimodal Guidance
Jiayi Guo, Chuanhao Yan, Xingqian Xu, Yulin Wang, Kai Wang, Gao Huang, Humphrey Shi

Modeling Saliency Dataset Bias
Matthias Kümmerer, Harneet Singh Khanuja, Matthias Bethge

One Last Attention for Your Vision-Language Model
Liang Chen, Ghazi Shazan Ahmad, Tianjun Yao, Lingqiao Liu, Zhiqiang Shen

SummDiff: Generative Modeling of Video Summarization with Diffusion
Kwanseok Kim, Jaehoon Hahm, Sumin Kim, Jinhwan Sul, Byung-Hak Kim, Joonseok Lee

ACM SIGCHI Conference on Computer-Supported Cooperative Work & Social Computing

Bergen, Norway | Oct 18–22, 2025

GEORGIA TECH Papers 📊

global papers analysis 📊

Lasting Impact Award 🎖️

More research

Papers

Advocacy Work

Charismatic Data and Material Traces: Monitoring Bird-Building Collisions through Citizen Science

Ashley Boone, Carl DiSalvo, Christopher Le Dantec

Bird collisions with man-made structures pose a significant threat to bird populations. In [Southern City], a small group of dedicated volunteers track these deaths with hopes of advocating for local policy requiring the use of bird-safe building materials. In addition to recording observations in a mobile application, volunteers log their efforts and collect the bodies of birds they find to add to university specimen collections. We offer a detailed empirical account of the work done by volunteers to produce (1) a digital record of local bird strikes (2) a log of volunteer monitoring efforts and (3) a collection of bird specimens. Unpacking the multiple forms of data produced by volunteer efforts, we examine how Project Safe Flight produced data oriented towards advocacy work. We find that Safe Flight data practices are deeply intertwined with the material qualities of these traces: mass, decay, feathers, and charisma. Finally, we discuss implications for data activism, discussing the link between materiality and charismatic data and next steps for action citizen science.

Metrics and Macchiatos: Challenges for Service-Industry Workers and the Need for Worker-Driven ICTs

Xander Koo, Lucy Scott, Amy Bruckman

Nearly 30 million people work in the foodservice and retail industries in the United States, representing approximately 18 percent of the total U.S. workforce. These service-industry workers contend with pressures from algorithmic management and other workplace technologies, yet they typically do not benefit from technologies that might help foster mutual support in the way that white-collar workers do. Recently, Starbucks, a major service-industry employer, has garnered media attention for issues with understaffing, labor law violations, and algorithm-based operations. We conducted interviews with sixteen Starbucks employees about their workplace issues, interactions with technology, and communication practices. These interviews illustrate how workplace technologies worsen existing issues for service-industry workers and how challenges to worker-to-worker communication reduce their capacity to rectify these issues, especially at the cross-store level. Our participants want better communication with other workers, such as through labor unions or new information and communication technologies (ICTs), to help improve their working conditions. We discuss how HCI scholars can use action research to help design localized, worker-driven ICTs to facilitate more connectivity and collaborative practices outside of the workplace. We conclude by outlining our ongoing work studying and designing ICTs for service-industry workers.

AI Applications for Safety and Support

“Poker with Play Money”: Exploring Psychotherapist Training with Virtual Patients

Cynthia Baseman, Masum Hasan, Nathaniel Swinger, Sheila Rauch, Sheila Rauch, Ehsan Hoque, Rosa Arriaga

Role-play exercises are widely utilized for training across a variety of domains; however, they have many shortcomings, including low availability, resource intensity, and lack of diversity. Large language model-driven virtual agents offer a potential avenue to mitigate these limitations and offer lower-risk role-play. The implications, however, of shifting this human-human collaboration to human-agent collaboration are still largely unexplored. In this work we focus on the context of psychotherapy, as psychotherapists-in-training extensively engage in role-play exercises with peers and/or supervisors to practice the interpersonal and therapeutic skills required for effective treatment. We provide a case study of a realistic virtual patient” system for mental health training, evaluated by trained psychotherapists in comparison to their previous experiences with both real role-play partners and real patients. Our qualitative, reflexive analysis generated three themes and thirteen subthemes regarding key interpersonal skills of psychotherapy, the utility of the system compared to traditional role-play techniques, and factors which impacted psychotherapist-perceivedhumanness” of the virtual patient. Although psychotherapists were optimistic about the system’s potential to bolster therapeutic skills, this utility was impacted by the extent to which the virtual patient was perceived as human-like. We leverage the Computers Are Social Actors framework to discuss human–virtual-patient collaboration for practicing rapport, and discuss challenges of prototyping novel human-AI systems for clinical contexts which require a high degree of unpredictability. We pull from the “SEEK” three-factor theory of anthropomorphism to stress the importance of adequately representing a variety of cultural communities within mental health AI systems, in alignment with decolonial computing.

The Practice of Online Peer Counseling and the Potential for AI-Powered Support Tools

Tony Wang, Amy Bruckman, Diyi Yang

What challenges do volunteers providing peer support in online mental health platforms (OMHPs) face in operating and growing their communities? How could the HCI community develop human-AI systems to help? Recent work on online peer counseling has led to the development of novel AI tools for conversational interaction, but it remains unknown how such technology can fit into existing practices. In this research, we conducted interviews and design exercises with seventeen peer counselors from 7 Cups of Tea, a large online therapy and counseling platform, to design tools — AI or not — that resolve challenges that arise from day-to-day community practices. Participant responses suggest three classes of tools that could improve online peer counseling: real-time decision support, productivity, and management and training. Investigation of design motivations surfaced four practice-based challenges including chat interface limitations, difficulties in support seeker management, fragmented contexts of practice, and lack of visibility due to privacy concerns. Based on counselors’ discussion of benefits and risks associated with AI features in the tools they designed, we offer suggestions for research on AI tools that build on peer counseling practices, and connect our findings with broader implications about online peer counseling as a form of volunteer-based mental health practice.

The Typing Cure: Experiences with Large Language Model Chatbots for Mental Health Support

Inhwa Song, Sachin Pendse, Neha Kumar, Munmun De Choudhury

People experiencing severe distress increasingly use Large Language Model (LLM) chatbots as mental health support tools. Discussions on social media have described how engagements were lifesaving for some, but evidence suggests that general-purpose LLM chatbots also have notable risks that could endanger the welfare of users if not designed responsibly. In this study, we investigate the lived experiences of people who have used LLM chatbots for mental health support. We build on interviews with 21 individuals from globally diverse backgrounds to analyze how users create unique support roles for their chatbots, fill in gaps in everyday care, and navigate associated cultural limitations when seeking support from chatbots. We ground our analysis in psychotherapy literature around effective support, and introduce the concept of therapeutic alignment, or aligning AI with therapeutic values for mental health contexts. Our study offers recommendations for how designers can approach the ethical and effective use of LLM chatbots and other AI mental health support tools in mental health care.

Beyond AI: Additional Considerations for Enhancing Healthcare

[HONORABLE MENTION] Bridging Ontologies of Neurological Conditions: Towards Patient-centered Data Practices in Digital Phenotyping Research and Design

Jianna So, Faye Yang, Krzysztof Gajos, Naveena Karusala, Anoopum Gupta

Amidst the increasing datafication of healthcare, deep digital phenotyping is being explored in clinical research to gather comprehensive data that can improve understanding of neurological conditions. However, participants currently do not have access to this data due to researchers’ apprehension around whether such data is interpretable or useful. This study focuses on patient perspectives on the potential of deep digital phenotyping data to benefit people with neurodegenerative diseases, such as ataxias, Parkinson’s disease, and multiple system atrophy. We present an interview study (n=12) to understand how people with these conditions currently track their symptoms and how they envision interacting with their deep digital phenotyping data. We describe how participants envision the utility of this deep digital phenotyping data in relation to multiple stages of disease and stakeholders, especially its potential to bridge different and sometimes conflicting understandings of their condition. Looking towards a future in which patients have increased agency over their data and can use it to inform their care, we contribute implications for shaping patient-driven clinical research practices and deep digital phenotyping tools that serve a multiplicity of patient needs.

Care Work

From Regulation to Support: Centering Humans in Technology-Mediated Emotion Intervention in Care Contexts

Jiaying “Lizzy” Liu, Shuer Zhuo, Xingyu Li, Andrew Dillon, Noura Howell, Angela D. R. Smith, Yan Zhang

“Enhancing emotional well-being has become an important focus in HCI and CSCW, with technologies increasingly designed to track, visualize, and manage emotions. However, these approaches have faced criticism for potentially suppressing certain emotional experiences. Through a scoping review of 53 empirical studies from ACM proceedings implementing Technology-Mediated Emotion Intervention (TMEI), we critically examine current practices through lenses drawn from HCI critical theories.

Our analysis reveals emotion intervention mechanisms that extend beyond traditional “”emotion regulation”” paradigms, identifying care-centered goals that prioritize non-judgmental emotional support and preserve users’ identities.

The findings demonstrate how researchers design technologies to generate artificial care, intervene in power dynamics, and nudge behavioral changes. We contribute the concept of “”emotion support”” as an alternative approach to “”emotion regulation,”” emphasizing human-centered approaches to emotional well-being. This work advances the understanding of diverse human emotional needs beyond individual and cognitive perspectives, offering design implications that critically reimagine how technologies can honor emotional complexity, preserve human agency, and transform power dynamics in care contexts.”

Caregiving & Caregivers

Understanding the Temporality of Informal Caregivers’ Sense-Making on Conflicts and Life-Changing Events through Online Health Communities

Kefan Xu, Cynthia Baseman, Nathaniel Swinger, Myeonghan Ryu, Rosa Arriaga

Informal caregivers perform an important role in taking care of family members with chronic disease. Informal caregivers’ mental health can be negatively impacted by life-changing events (e.g., patients’ diagnosis, care transitioning, etc.). This leads the caregiver to suffer from interpersonal and intrapersonal conflicts, causing a sense of disorientation and escalating malaise. In this study, we investigated informal caregivers’ experiences of facing conflicts and life-changing events by qualitatively analyzing the data from online health communities. We categorized conflicts using a psychodynamic framework. We further looked at the interplay of life-changing events and conflicts and how this leads to caregivers’ sense-making and decisions to mediate conflicts. We also found that online health communities provide support by helping caregivers interpret and navigate conflicts and raising awareness of the temporal resolution of life-changing events. We conclude this study by discussing designing online health communities to better support such practice.

Caring at a Distance

Breaking Barriers in Remote Client-Therapist Interaction: Exploring Design Spaces of Sensing and Sharing Non-Verbal Cues in Remote Psychotherapy

Lan Gao, Munmun De Choudhury, Jennifer Kim

In remote psychotherapy, challenges arising from remote client-therapist interactions can impact the therapeutic alliance and overall outcomes. HCI research has focused on leveraging sensing technology to bridge gaps in remote interactions. In this work, we investigate the values and risks of integrating sensing technology in remote psychotherapy, specifically to capture and interpret non-verbal cues, by conducting a speculative design study with both clients and therapists. Our findings reveal that sensing technology has the potential to facilitate self-reflection in therapy. The sharing of tracked non-verbal cues could also possibly foster mutual disclosure, supporting therapists’ judgments and balancing power dynamics between clients and therapists. However, clients and therapists were concerned about the accuracy of sensing systems, potential privacy threats, and additional cognition burden. Our insights into system values imply how sensing technology could potentially balance power dynamics in client-therapist relationships as well as general interpersonal relationships. We also emphasize the increased considerations in sensing-technology-empowered communication for remote psychotherapy than in non-vulnerable settings.

Helping the Helper: Supporting Peer Counselors via AI-Empowered Practice and Feedback

Shang-Ling Hsu, Raj Shah, Prathik Senthil, Zahra Ashktorab, Casey Dugan, Werner Geyer, Diyi Yang

Millions of users come to online peer counseling platforms to seek support. However, studies show that online peer support groups are not always as effective as expected largely due to users’ negative experiences with unhelpful counselors. Peer counselors are key to the success of online peer counseling platforms, but most often do not receive appropriate training. Hence, we introduce CARE: an AI-based tool to empower and train peer counselors through practice and feedback. Concretely, CARE helps diagnose which counseling strategies are needed in a given situation and suggests example responses to counselors during their practice sessions. Building upon the Motivational Interviewing framework, CARE utilizes large-scale counseling conversation data with text generation techniques to enable these functionalities. We demonstrate the efficacy of CARE by performing quantitative evaluations and qualitative user studies through simulated chats and semi-structured interviews, finding that CARE especially helps novice counselors in challenging situations. The code is available at https://app.box.com/s/z3a4dwgmeqfy8vbzi9cgmg0yhn6t4j53.

Core Concepts in Privacy Research

Measuring, Modeling, and Helping People Account for Privacy Risks in Online Self-Disclosures with AI

Isadora Krsek, Anubha Kabra, Yao Dou, Tarek Naous, Laura Dabbish, Alan Ritter, Wei Xu, Sauvik Das

In pseudonymous online fora like Reddit, the benefits of self-disclosure are often apparent to users (e.g., I can vent about my in-laws to understanding strangers), but the privacy risks are more abstract (e.g., will my partner be able to tell that this is me?). Prior work has sought to develop natural language processing (NLP) tools that help users identify potentially risky self-disclosures in their text, but none have been designed for or evaluated with the users they hope to protect. Absent this assessment, these tools will be limited by the social-technical gap: users need assistive tools that help them make informed decisions, not paternalistic tools that tell them to avoid self-disclosure altogether.To bridge this gap, we conducted a study with $N=21$ Reddit users; we had them use a state-of-the-art NLP disclosure detection model on two of their own posts, and asked them questions to understand if and how the model helped, where it fell short, and how it could be improved to help them make more informed decisions. Despite its imperfections, users responded positively to the model and highlighted its use as a tool that can help them catch mistakes, inform them of risks they were unaware of, and encourage self-reflection. However our work also shows how, to be useful and usable, AI for supporting privacy decision making must account for posting context, disclosure norms, users’ lived threat models, and provide explanations that help contextualize detected risks.

Data Visualization

Cartographers in Cubicles: How Training and Preferences of Mapmakers Interplay with Structures and Norms in Not-for-Profit Organizations

Arpit Narechania, Alex Endert, Clio Andris

Choropleth maps are a common and effective way to visualize geographic thematic data. Although cartographers have established many principles about map design, data binning and color usage, less is known about how mapmakers make individual decisions in practice. We interview 16 cartographers and geographic information systems (GIS) experts from 13 government organizations, NGOs, and federal agencies about their choropleth mapmaking decisions and workflows. We categorize our findings and report on how mapmakers follow cartographic guidelines and personal rules of thumb, collaborate with other stakeholders within and outside their organization, and how organizational structures and norms are tied to decision-making during data preparation, data analysis, data binning, map styling, and map post-processing. We find several points of variation as well as regularity across mapmakers and organizations and present takeaways to inform cartographic education and practice, including broader implications and opportunities for CSCW, HCI, and information visualization researchers and practitioners.

Designing for Privacy

Design(ing) Fictions for Collective Civic Reporting of Privacy Harms

Yuxi Wu, William Agnew, W. Keith Edwards, Sauvik Das

Individually-experienced privacy harms are often difficult to demonstrate and quantify, which impedes efforts for their redress. Their effects often appear small and are inconsistently documented, and they only become more obvious when aggregated over time and across populations. Taking a design fiction approach, we explore the design requirements and cultural ideals of a government-run system that empowers people to collectively report on and make sense of experiences of privacy harm from online behavioral advertising. Through the use of fictional inquiry, story completion, and comicboarding methods, delivered in an online survey with 50 participants, we found that participants had detailed conceptions of the user experience of such a tool, but wanted assurance that their labor and personal data would not be exploited further by the government if they contributed evidence of harm. We extrapolate these design insights to government-supported complaint-reporting platforms in other domains, finding multiple common design gaps that might disincentivize people to report experiences of harm, be they privacy-related or otherwise.

Fighting Misinformation, Building Believability

Credibility Boosters as a Lens for Understanding Epistemic Injustice in Civic Tech: The Case of Heat Seek

Mohsin Yousufi, Charlotte Alexander, Nassim Parvin

Marginalized groups often face situations in which their knowledge and experiences are dismissed due to prejudice or bias—a phenomenon identified and theorized as epistemic injustice in feminist philosophy. These circumstances frequently compel individuals to produce additional evidence to support their claims, ranging from paper documentation to data generated by technologies such as location logs. This paper examines the case of Heat Seek, an internet-connected temperature sensor designed to provide tenants in New York City with “objective and reliable data” when filing heating complaints and appearing in housing court. We present findings from a qualitative study, supplemented by document review and artifact analysis, to illuminate the tool’s functions and uses. Drawing on this case, we introduce a class of civic technologies—credibility boosters. We find that these technologies aim to overcome credibility deficits by: (1) backing individual and collective claims with objective data, (2) materializing intangible experiences as tangible evidence with aesthetic reliability, and (3) shifting epistemic authority to perceived neutral third parties. We conclude by demonstrating the institutional and social impacts of such technologies and call for greater attention to epistemic injustices within CSCW research, advocating for the design of institutional, legal, and social systems that confront biased systems and empower marginalized communities.

Harassment & Micro-Aggressions

“I get hives when I come on here”: Persisting Through Platform-Delivered Microaggressions on LinkedIn

Lara Karki, Kayla Uleah, Carl DiSalvo, Sierra Traynail Ross, Jadin Butler, Selamawit Husein, Emanuel Bryant, Dana Priest, Justin Booker, Betsy DiSalvo

LinkedIn is central to salaried job search and professional networking. In a career development program for adults seeking upward socioeconomic mobility through middle-wage computing work, we aimed to use LinkedIn to find and develop new social ties. However, we could not use the platform for this purpose. Through a participatory research approach, we formed a research team with diverse positionalities to understand why LinkedIn was difficult to use and how it could be better for our program. We analyzed recorded walk-throughs and confirmed our findings with two years of ethnographic field notes and written reflections. Our findings demonstrate that LinkedIn’s embedded algorithms and interface design prioritize users with large networks who can afford a LinkedIn Premium subscription. We argue that such platform-embedded power differentials lead to platform-delivered microaggressions. Non-Premium users and users with small networks must endure microaggressions to participate in the salaried labor market. We argue the politics of LinkedIn as a platform are such that its embedded power differentials are beyond our control and unlikely to change. Therefore, we recommend sociotechnical coping and mitigation strategies for career development programs in lieu of design implications for LinkedIn or similar platforms. We contribute a detailed example of how a technology reinforces pre-existing privilege without users’ knowledge.

Hate Speech

[BEST PAPER] Harm in Layers: Compositions of Misinformative Hate in Anti-Asian Speech and Their Impacts on Perceived Harmfulness

Jiawei Zhou, Gaurav Verma, Lei Zhang, Nicholas Chang, Munmun De Choudhury

During times of crisis, heightened anxiety and fear make individuals more vulnerable, creating fertile ground for hate speech and misinformation, as people are more likely to fall for and be influenced by it. This paper looks into the interwoven relationship between anti-Asian hatred and COVID-19 misinformation amid the pandemic. By analyzing 785,798 Asian hate tweets and surveying 308 diverse participants, this empirical study explores how hateful content portrays the Asian community, whether it is based on truth, and what makes such portrayal harmful. We observed a high prevalence of misinformative hate speech that appeared to be lengthier, less emotional, and carried more pronounced motivational drives than general hate speech. Overall, we found that anti-Asian rhetoric was characterized by an antagonism and inferiority framing, with misinformative hate underscoring antagonism and general hate emphasizing calls for action. Among all entities being explicitly criticized, China and the Chinese were constantly named to assign blame with misinformative hate more likely to finger-point than general hate. Our survey results indicated that hateful messages with misinformation, demographic targeting, or divisive references were perceived as significantly more damaging. Individuals who placed less importance on free speech, had personal encounters with hate speech, or believed in the natural origin of COVID-19 were more likely to perceive higher severity. Taken together, this work highlights the distinct compositions of hate within misinformative hate speech that influences perceived harmfulness and adds to the complexity of defining and moderating harmful content. We discuss the implications for designing more contextualized and culturally sensitive counter-strategies, as well as building more adaptive, explainable moderation approaches.

Humanized AI: Avatars, Agents, and Voice Assistants

Virtual agent-based communication skills training to facilitate health persuasion among peers

Farnaz Nouraei, Keith Rebello, Mina Fallah, Prasanth Murali, Haley Matuszak, Valerie Jap, Andrea Parker, Michael Paasche-Orlow, Timothy Bickmore

Many laypeople are motivated to improve the health behavior of their family or friends but do not know where to start, especially if the health behavior is potentially stigmatizing or controversial. We present an approach that uses virtual agents to coach community-based volunteers in health counseling techniques, such as motivational interviewing, and allows them to practice these skills in role-playing scenarios. We use this approach in a virtual agent-based system to increase COVID-19 vaccination by empowering users to influence their social network. In a between-subjects comparative design study, we test the effects of agent system interactivity and role-playing functionality on counseling outcomes, with participants evaluated by standardized patients and objective judges. We find that all versions are effective at producing peer counselors who score adequately on a standardized measure of counseling competence, and that participants were significantly more satisfied with interactive virtual agents compared to passive viewing of the training material. We discuss design implications for interpersonal skills training systems based on our findings.

Identifying and Mitigating AI Risks

A Risk Taxonomy and Reflection Tool for LLM Adoption in Public Health

Jiawei Zhou, Amy Chen, Darshi Shah, Laura Schwab Reese, Munmun De Choudhury

Recent breakthroughs in large language models (LLMs) have generated both interest and concern about their potential adoption as information sources or communication tools across different domains. In public health, where stakes are high and impacts extend across diverse populations, adopting LLMs poses unique challenges that require thorough evaluation. However, structured approaches for assessing potential risks in public health remain under-explored. To address this gap, we conducted focus groups with public health professionals and individuals with lived experience to unpack their concerns, situated across three distinct and critical public health issues that demand high-quality information: infectious disease prevention (vaccines), chronic and well-being care (opioid use disorder), and community health and safety (intimate partner violence). We synthesize participants’ perspectives into a risk taxonomy, distinguishing and contextualizing the potential harms LLMs may introduce when positioned alongside traditional health communication. This taxonomy highlights four dimensions of risk to individuals, human-centered care, information ecosystem, and technology accountability. For each dimension, we discuss specific risks and offer example reflection questions to help practitioners adopt a risk-reflexive approach. We discuss the need to revisit pre-existing mental models of help-seeking and complement evaluations with external validity and domain expertise through lived experience and real-world practices. Together, this work contributes a shared vocabulary and reflection tool for people in both computing and public health to collaboratively anticipate, evaluate, and mitigate risks in deciding when to employ LLM capabilities (or not) and how to mitigate harm.

Partisan Discourse Online

“I see it, I scroll past it.”: Exploring Perceptions of Social Media Political Discourse Among Gen Z Young Adult Women In The U.S.

Pooja Casula, Richmond Wong

Social media platforms have been widely perceived as centers of political discourse, and have been shown to facilitate political participation among young adults (18-26 years). However, as the effects of online political discourse and behaviors have become pervasive offline, significantly affecting global political processes such as deterring women from public political office and influencing election outcomes, it raises questions regarding how young adult users engage in these online political spaces of discourse. In this paper, we focus on the perceptions and forms of engagement of Gen Z social media users, specifically those of Gen Z young adult women. In this paper we broadly ask, how do voting-age Generation (Gen) Z young adult women perceive spaces of political discourse on social media, and do these perceptions affect how they choose to engage in them? To explore this question, we conducted 17 interviews with voting-age Gen Z women across the United States. We found that our participants were largely critical of social media as spaces of political discourse. They were skeptical of the credibility of the political information shared on social media, questioned the usefulness of sharing political information through social media, and felt that social media was not conducive to having productive political discussions. We find that participant perceptions of social media political discourse led to them limiting their online engagement or disengaging entirely from online public political spaces, but expanding their offline private political engagement through in-person discussion. Our findings indicate that our participants were not politically disinterested, but rather did not partake in public forms of social media political engagement, leading us to question and reconsider widespread interpretations of ‘political participation’ that center and emphasize public forms of action and expression. Drawing on our findings, we propose that the practice of ‘disengagement’ from public spaces of online political discourse should be considered a dimension of political engagement and not separate from it. In proposing this, we also broadly question the efficacy of social media as a forum to promote and facilitate political discourse.

The Role of Partisan Culture in Mental Health Language Online

Sachin Pendse, Ben Rochford, Neha Kumar, Munmun De Choudhury

The impact of culture on how people express distress in online support communities is increasingly a topic of interest within Computer Supported Cooperative Work (CSCW) and Human-Computer Interaction (HCI). In the United States, distinct cultures have emerged from each of the two dominant political parties, forming a primary lens by which people navigate online and offline worlds. We examine whether partisan culture may play a role in how U.S. Republican and Democrat users of online mental health support communities express distress. We present a large-scale observational study of 2,184,356 posts from 8,916 statistically matched Republican, Democrat, and unaffiliated online support community members. We utilize methods from causal inference to statistically match partisan users along covariates that correspond with demographic attributes and platform use, in order to create comparable cohorts for analysis. We then leverage methods from natural language processing to understand how partisan expressions of distress compare between these sets of closely matched opposing partisans, and between closely matched partisans and typical support community members. Our data spans January 2013 to December 2022, a period of both rising political polarization and mental health concerns. We find that partisan culture does play into expressions of distress, underscoring the importance of considering partisan cultural differences in the design of online support community platforms.

Reflecting on Methodology

Reflexive Data Walks: Cultivating Feminist Ethos through Place-Based Inquiry

Sylvia Janicki, Shubhangi Gupta, Nassim Parvin

Reflexivity, as conceived by feminist epistemologies, is essential to advancing social justice design practice. Reflexivity is thus critical for CSCW and HCI scholars and practitioners who seek to build equitable technological futures, as it allows for a critical examination of explicit and implicit values and politics in design and research processes. In this paper, we put forth a participatory walking method grounded in feminist ethos for cultivating reflexivity by engaging with the theme of boundaries in space. We outline this method through three integrated place-based strategies, including an activity in the home, a data walk in the city, and making and sharing visualizations for collaborative understandings of place. We argue that engaging with place is critical to foregrounding positionality and cultivating reflexivity in research. We share our findings from two workshops where we examined the efficacy of this method. We outline how the method deepens the understandings of the built environment, self, and others; welcomes vulnerability and fosters openness to change; scaffolds practices of critical self questioning. In doing so, it leads to a recognition of the entanglement of socio-political values in design and data creation, revealing uncertainties and ambiguities that can open up new areas for inquiry and design.

Social and Environmental Justice

[HONORABLE MENTION] Sustaining Workers Who Sustain the World: Assets-Based Design for Conservation Technologies in Madagascar

Eric Greenlee, David Klinges, Lalatiana Randriamiharisoa, Kim Valenta, Jhoanny Rasojivola, Justorien Rambeloniaina, Nicolas Naina Rasolonjatovo, Georges Razafindramavo, Joel Ratsirarson, Zovelosoa Raharinavalomanana, Edouard Ramahatratra, Abigail Ross, Thomas Kelly, Jean Claude Rakotoarivelo, Tafitasoa Mijoro, Eric Tsiriniaina Rajoelison, Efitiria Efitiria, Josiah Hester, Ellen Zegura, Alex Cabral

Local workers and their knowledge are essential for sustainable and effective conservation efforts. However, many technology-assisted conservation programs are guided by global benchmarks (e.g., forest cover) and industry metrics (e.g., cost per acre), which often devalue local knowledge and fail to consider the economic and conservation goals of local workers. Assets-based design is well-suited to center workers and their strengths, yet it may fail to fully address the complexities of long-term conservation programs by not explicitly emphasizing workers’ goals or bolstering their assets. We extend recent approaches in assets-based design literature that address these limitations through our case studies of reforestation, biodiversity monitoring, and carbon sequestration programs in three protected areas in Madagascar. We leverage a mixed-methods approach of direct reactive observations, unstructured interviews, and an informal design workshop, revealing emergent themes surrounding economic sustainability and the value of local ecological knowledge in conservation. Finally, we explore examples, tensions, and design considerations for worker-centered conservation technology to: (1) prioritize local knowledge, (2) foster love of nature, (3) center economic goals, and (4) embrace local autonomy. This work advances the dialogue on assets-based design, promoting the co-creation of equitable and sustainable conservation technologies with workers in Global South settings by centering local economic priorities and enhancing workers’ strengths.

Mapping a Movement: Exploring a Proposed Police Training Facility in Atlanta and its Opposition Movement through Online Cartographic Imagery

Camille Harris, Clio Andris

In 2021, the City of Atlanta and Atlanta Police Foundation launched joint plans to build a large police training facility in the South River Forest in unincorporated DeKalb County, GA. At this time, residents of Atlanta and DeKalb County, environmental activists, police and prison abolitionists, and other activists and concerned individuals formed the movement in opposition to the facility, known as the Stop Cop City / Defend the Atlanta Forest movement. Social media and digital maps became common tools for communicating information about the facility and the movement. In this work, we examine online maps about the facility and the opposition movement, originating from grassroots organizations, the City of Atlanta, news media outlets, the Atlanta Police Foundation, and individuals. We gather and examine 32 publicly available maps collected through the Google Search API, Twitter (now X), Instagram and reddit. Then, using a framework of critical cartography, we conduct a content analysis of these maps to identify the mapping technologies and techniques (data, cartographic elements, styles) used by different stakeholders in the construction of the facility and roles that maps and mapping technologies can play in social movements. Finally, we examine the extent to which these maps provide data to confirm or dispute concerns raised by grassroots organizations and local residents about the facility. We argue that documenting the use of maps to communicate information about a contentious project can help enumerate positions and perspectives about community issues. We find that the different uses of (and varied access to) geo-spatial technologies is uneven across stakeholders and mapmakers and advocate for accessible mapmaking tools. We conclude by discussing the implications of accessibility of mapping technology and posting maps to social media, and share example map images that extend the geographic information systems (GIS) techniques seen in the retrieved maps.

Supporting Older Adults’ Care

[HONORABLE MENTION] Rethinking Technological Solutions for Community-Based Older Adult Care: Insights from `Older Partners’ in China

Yuling Sun, Sam Ankenbauer, Yuchen Chen, Xiaojuan Ma, Zhifan Guo, Liang He

Aging in place refers to the enabling of individuals to age comfortably and securely within their own homes and communities. Continued community living creates a number of potential areas for design and, accordingly, various information and communication technologies have been employed to support older adult care. At the same time, human-led care services have been designed to support aging in place. Through a long-term ethnographic study that includes semi-structured interviews with 24 stakeholders, we consider these technology- and human-driven care infrastructures for aging in place, examining their origins, deployment, interactions with older adults, and challenges. In doing so, we reconsider the value of these different forms of older adult care, highlighting the various issues associated with using, for instance, health monitoring technology or appointment scheduling systems to care for older adults aging in place. We suggest that technology should take a “supportive, not substitutive” role in older adult care infrastructure and that designing for aging in place should not be synonymous with designing for independence but should, instead, consider the larger community and its dynamics.

Team Work Makes the Dream Work

There’s No “I” in TEAMMAIT: Impacts of Domain and Expertise on Trust in AI Teammates for Mental Health Work

Nathaniel Swinger, Cynthia Baseman, Myeonghan Ryu, Saeed Abdullah, Christopher Wiese, Andrew Sherrill, Rosa Arriaga

The mental health crisis in the United States spotlights the need for more scalable training for mental health workers. While present-day AI systems have sparked hope for addressing this problem, we must not be too quick to incorporate or solely focus on technological advancements. We must ask empirical questions about how to ethically collaborate with and integrate autonomous AI into the clinical workplace. For these Human-Autonomy Teams (HATs), poised to make the leap into the mental health domain, special consideration around the construct of trust is in order. A reflexive look toward the multidisciplinary nature of such HAT projects illuminates the need for a deeper dive into varied stakeholder considerations of ethics and trust. In this paper, we investigate the impact of domain—and the ranges of expertise within domains—on ethics- and trust-related considerations for HATs in mental health. We outline our engagement of 23 participants in two speculative activities: design fiction and factorial survey vignettes. Grounded by a video storyboard prototype, AI- and Psychotherapy-domain experts and novices alike imagined TEAMMAIT, a prospective AI system for psychotherapy training. From our inductive analysis emerged 10 themes surrounding ethics, trust, and collaboration. Three can be seen as substantial barriers to trust and collaboration, where participants imagined they would not work with an AI teammate that didn’t meet these ethical standards. Another five of the themes can be seen as interrelated, context-dependent, and variable factors of trust that impact collaboration with an AI teammate. The final two themes represent more explicit engagement with the prospective role of an AI teammate in psychotherapy training practices. We conclude by evaluating our findings through the lens of Mayer et al.’s Integrative Model of Organizational Trust to discuss the risks of HATs and adapt models of ability-, benevolence-, and integrity-based trust. These updates motivate implications for the design and integration of HATs in mental health work.

Trauma & Abuse

Making Sense of Trauma Over Time: Interweaving Feminist Temporalities to Understand Histories

Catherine Wieczorek, Cindy Lin, Shaowen Bardzell

Trauma, an emotional response to events with lasting impacts, is a significant public health issue influencing technology interactions. This paper focuses on the sixth principle of trauma-informed care—Cultural, Historical, and Gender Issues—by exploring multiple timescales of trauma and generational impacts through two ethnographic vignettes: a trauma-informed healthcare design project in Chicago and environmental advocacy in Borneo, Indonesia. We integrate feminist temporality to understand temporal contingencies in cultural contexts to inform future trauma-informed design and computing work. Our contributions include detailed ethnographic accounts that shift the focus from trauma as an individual event to a historically and communally felt phenomenon, advancing CSCW scholarship by incorporating historicist sensibilities and feminist theorizations of temporality.

Papers by Session

26 Papers | 60 Tech Experts | 64 Partners

Global Analysis

Explore Countries | Explore Organizations

Lasting Impact Award

Read News

More Research

Doctoral Consortium

The Mechanisms of Muting: Deconstructing the Technology-Mediated Violence of Silence

Jasmine Foriest

This research addresses a critical gap in HCI: while the field engages with “harm,” it inadequately conceptualizes “violence.” One gap lies in how digital artifacts mediate structural violence through muting. Muting — the systemic silencing of marginalized groups — prevents vulnerable populations from accessing potentially life-saving resources and results in preventable morbidity and mortality. Drawing from Muted Group Theory, I demonstrate how technologies imbued with dominant values amplify muting in unprecedented ways through information suppression in suicide reporting, social-computing design that silences gender-based violence survivors, and epistemic inequity perpetuated by generative AI. My dissertation employs survivor-centered mixed methods — surveys, narrative interviews, and phenomenological analysis to understand how intimate partner violence survivors use digital artifacts in help-seeking. This work will produce the first empirical understanding of relationships between muting experiences and adverse outcomes, alongside design recommendations for remediating muting in help-seeking technologies. My goal is establishing cross-disciplinary approaches to violence prevention through ethical technology design.

Panels/ SIGs

PANEL: Computing and the Arts: Establishing Theoretical and Methodological Foundations for Cross-Disciplinary Collaboration

Computing and the Arts: Establishing Theoretical and Methodological Foundations for Cross-Disciplinary Collaboration

Angela Schöpke-Gonzalez, Kellie Dunn, Shaowen Bardzell, Federico Bomba, Barbara Carreras, Makayla Lewis, Maria Murray

The last five years have resulted in substantial changes to how computing affects work, how work affects computing, and how work and computing operate in tandem to affect society. From advances in automation, artificial intelligence, and virtual/extended reality, to the entrenchment of hybrid and remote work arrangements, and the documented harmful societal impacts that computing work has produced, these changes to computing-work relationships raise concern \textit{and} opportunities to reimagine these relationships in new ways. CSCW has an opportunity and a responsibility to ensure that the kinds of futures we imagine and enact benefit workers, communities, and future generations. Artistic research is well-positioned to help us not only understand, but imagine new pathways forward in response to pressing CSCW questions. By hosting a panel of experts in artistic methods well-equipped to help us imagine these futures, we expect to lay the groundwork for mutually respectful cross-disciplinary collaboration between arts and computing that makes more space in our field for different kinds of thinking, approaches to problems, and new imaginaries.

SIG: Alternative Technology Consumption Under Capitalism

Alternative Technology Consumption Under Capitalism

Yuxi Wu, Beatriz Palacios Abad, Vishal Sharma, Hanlin Li, Alexandra To

Even as large technology companies come under increasing legal and political scrutiny, their market dominance continues to grow. As Big Tech tends toward monopoly, however, people continue to seek out alternative technology systems and uses. What are the conditions that lead people to choose alternatives? What are the long term values associated with having viable alternatives? This SIG presents alternative technology, or AltTech, as a growing area of interest for the CSCW community to consider. We invite community members with interests in technology non-use, design for disruption, and post-growth design to join us for a sketch-based speculative discussion to better understand the landscape and future of AltTech.

SIG: Conducting Research in Oppressive Settings

Conducting Research in Oppressive Settings

Adrian Petterson, Benedetta Lusi, Cristina Bosco, Ashique Ali Thuppilikkat, Anupriya Tuli, Catherine Wieczorek, Robert Soden, Emily Tseng, Priyank Chandra

As justice-related research faces increasing transnational and domestic repression, researchers working on topics like reproductive justice, LGBTQ2SIA+ equity, decolonization, climate justice, and social movements encounter escalating constraints and risks. While the CSCW community has increasingly advocated for research in these domains, the current political climate exacerbates the precarity experienced by scholars engaged in this work. Institutional mechanisms such as ethics approvals frequently fail to address researchers’ safety concerns, particularly for those from marginalized communities themselves. Collaborators within the same project experience varying levels of risk based on location, career stage, and identity. This Special Interest Group (SIG) will facilitate dialogue on practical strategies for conducting research under oppressive contexts, drawing on expertise from researchers who have developed survival and safety tactics. Discussions will address data storage practices, visibility considerations, transnational collaboration strategies, and psychological safety mechanisms. Our goal is to establish a collaboratively curated resource collection supporting researchers as they navigate oppressions in their collaborations, recognizing these threats continue to grow in scale and intensity.

Posters

From Hashtag to Human-Centered Insights: Rethinking Disability Awareness Across Languages

Zainab AlMeraj, Fatemah Husain, Rosa Arriaga

As global discourse on disability expands, much of the digital awareness and inclusion efforts remain anchored in English-language narratives. This linguistic dominance limits our understanding of how disability is perceived, discussed, and mobilized across culturally diverse regions— particularly within underrepresented communities in the Global South. This study investigates cross-lingual and cross-cultural perspectives on disability awareness by analyzing three years of public posts from X (formerly Twitter), using the hashtag #peoplewithdisabilities. Through natural language processing (NLP), we examine (1) posting behaviors and engagement dynamics, (2) sentiment and empathy-oriented language, and (3) culturally embedded narrative framings in both Arabic and English content. Our interdisciplinary lens draws from computational linguistics and disability studies allows us to interpret trends beyond surface metrics. Findings reveal that Arabic posts often reflect familial, religious, and collectivist viewpoints rooted in local cultural values, while English posts emphasize rights-based advocacy and individual empowerment. Emotional expression and engagement patterns also diverge, highlighting that awareness itself is not universal but culturally constructed and contextually nuanced. We argue that designing inclusive technologies requires more than linguistic translation, it demands sensitivity to the cultural frameworks shaping disability discourse.

Workshops

Structuring Collaborative Reflection: Integrating Diary Study and Focus Group Discussion

Jixiang Fan, Jiacheng Zhao, Sunggyeol Oh, Michael Bolmer, Yoonje Lee, Nick Flammer, Yuhao Chen, D. Scott McCrickard

We present a structured reflection framework integrating diary study and focus group discussion to support collaborative meaning-making in HCI education. The framework follows a multi-phase design in which students progress from individual journaling to a two-stage group discussion sequence: first within shared application contexts, then across emergent experiential themes. To support this process, we extended DiaryQuest, a lightweight educational tool incorporating AI-assisted grouping, image-based prompts, and a Jigsaw-inspired workflow to scaffold participation. A preliminary classroom deployment with 11 undergraduate students suggests that the approach lowers the barrier to reflective dialogue, encourages cross-perspective engagement, and helps students surface design-relevant insights grounded in lived experience. These findings point to new opportunities for structuring reflection in sociotechnical learning environments.

CSCW Contributions to Critical Futures of Work

Alina Lushnikova, Michael Muller, Shaowen Bardzell, Toby Li, Saiph Savage, Saiph Savage

As the CSCW community evolves and participates in envisioning the impact of technologies on the work practices, we want to ensure that critical and alternative computing perspectives are well represented while we are co-constructing the future of work. In this hybrid workshop, we invite researchers, practitioners, civic actors, economists, and other interested parties to challenge dominant, powerful, status-quo narratives and imaginaries when considering the future of work, nurturing the CSCW commitments and methods. Co-constructing the workshop with participants, we aim to develop actionable insights and strengthen the community.

Exploring Resistance and Other Oppositional Responses to AI

Eric Baumer, Eric Baumer, Inha Cha, Vera Khovanskaya, Rosemary Steup, Janet Vertesi, Richmond Wong

This workshop will gather researchers and practitioners who study, and/or engage in, opposition to the proliferation of AI technologies. It will do so based on an inclusive conceptualization of what counts as AI, thereby assembling a diverse collection of participants and perspectives. The organizers will especially solicit submissions that respond to a variety of specific themes: resistance in organizational contexts; understandings of community-based collective resistance; research around non-voluntary adoption; considerations around distributions of power in the creation and use of AI; implications for designing technologies to support opposition, and the possibility of resistance indirectly reifying current conceptions of AI. Prospective participants will be invited to submit descriptions of their work either studying or engaging in oppositional practices, as well as a challenge they have faced in doing so. The workshop will involve a series of interactive, hands-on activities to enable participants to share both challenges and strategies. In addition to catalyzing connections among researchers, the workshop will also produce two concrete outputs: a living annotated bibliography of relevant citations across diverse domains, and a practical guide with context-sensitive tactics for challenging the perceived inevitability of AI.

ACM Conference on Computer and Communications Security
Taipei, Taiwan | Oct 13–17, 2025

Applied Cryptography

Distance-Aware OT with Application to Fuzzy PSI
Lucas Piske, Jaspal Singh, Ni Trieu, Vladimir Kolesnikov, Vassilis Zikas

May the Force Not be With You: Brute-Force Resistant Biometric Authentication and Key Reconstruction
Alexandra Boldyreva, Deep Inder Mohan, Tianxin Tang

Toss: Garbled PIR from Table-Only Stacking
Lucien K. L. Ng, Vladimir Kolesnikov

Blockchain and Distributed Systems

Lite-PoT: Practical Powers-of-Tau Setup Ceremony
Lucien K. L. Ng, Pedro Moreno-Sanchez, Mohsen Minaei, Panagiotis Chatzigiannis, Adithya Bhat, Duc Le

Hardware, Side Channels, and Cyber Physical Systems

MOLE: Breaking GPU TEE with GPU-Embedded MCU
Hongyi Lu, Yunjie Deng, Sukarno Mertoguno, Shuai Wang, Fengwei Zhang

One Video to Steal Them All: 3D-Printing IP Theft through Optical Side-Channels
Twisha Chattopadhyay, Fabricio Ceschin, Marco Garza, Dymytriy Zyunkin, Animesh Chhotaray, Aaron Stebner, Saman Zonouz, Raheem Beyah

WireTap: Breaking Server SGX via DRAM Bus Interposition
Alex Seto, Oytun Kuday Duran, Samy Amer, Jalen Chuang, Stephan van Schaik, Daniel Genkin, Christina Garman

Machine Learning and Security

VillainNet: Targeted Poisoning Attacks Against SuperNets Along the Accuracy-Latency Pareto Frontier
David Oygenblik, Abhinav Vemulapalli, Animesh Agrawal, Debopam Sanyal, Alexey Tumanov, Brendan Saltaformaggio

Privacy and Anonymity

Fingerprinting SDKs for Mobile Apps and Where to Find Them: Understanding the Market for Device Fingerprinting
Michael Specter, Abbie Farr, Bo Ma, Robin Lassonde, Mihai Christodorescu

Security Usability and Measurement

A Sea of Cyber Threats: Maritime Cybersecurity from the Perspective of Mariners
Anna Raymaker, Akshaya Kumar, Miuyin Yong Wong, Ryan Pickren, Animesh Chhotaray, Frank Li, Saman Zonouz, Raheem Beyah

The Challenges and Opportunities with Cybersecurity Regulations: A Case Study of the US Electric Power Sector
Sena Sahin, Burak Sahin, Robin Berthier, Kate Davis, Saman Zonouz, Frank Li

Web Security

Enhanced Web Application Security Through Proactive Dead Drop Resolver Remediation
Jonathan Fuller, Mingxuan Yao, Saumya Agarwal, Srimanta Barua, Taleb Hirani, Amit Kumar Sikder, Brendan Saltaformaggio

Head(er)s Up! Detecting Security Header Inconsistencies in Browsers
Jannis Rautenstrauch, Trung Tin Nguyen, Karthik Ramakrishnan, Ben Stock

Lock the Door But Keep the Window Open: Extracting App-Protected Accessibility Information from Browser-Rendered Websites
Haichuan Xu, Runze Zhang, Mingxuan Yao, David Oygenblik, Yizhi Huang, Jeman Park, Brendan Saltaformaggio

ACM Symposium on User Interface Software and Technology
Busan, Korea | Sep 28–Oct 1, 2025

See of all of Georgia Tech’s research at UIST 2025

Best Paper

DissolvPCB: Fully Recyclable 3D-Printed Electronics with Liquid Metal Conductors and PVA Substrates

Zeyu Yan, Su Hwan Hong, Josiah Hester, Tingyu Cheng, Huaishu Peng

We introduce DissolvPCB, an electronic prototyping technique for fabricating fully recyclable printed circuit board assemblies (PCBAs) using affordable FDM 3D printing, with polyvinyl alcohol (PVA) as a water-soluble substrate and eutectic gallium-indium (EGaIn) as the conductive material. When obsolete, the PCBA can be easily recycled by immersing it in water: the PVA dissolves, the EGaIn re-forms into a liquid metal bead, and the electronic components are recovered. These materials can then be reused to fabricate a new PCBA. We present the DissolvPCB workflow, characterize its design parameters, evaluate the performance of circuit produced with it, and quantify its environmental impact through a lifecycle assessment (LCA) comparing it to conventional CNC-milled FR-4 boards. We further develop a software plugin that automatically converts PCB design files into 3D-printable circuit substrate models. To demonstrate the capabilities of DissolvPCB, we fabricate and recycle three functional prototypes: a Bluetooth speaker featuring a double-sided PCB, a finger fidget toy with a 3D circuit topology, and a shape-changing gripper enabled by joule heat driven 4D printing. The paper concludes with a discussion of current technical limitations and opportunities for future directions.

Papers

ATCion: Exploring the Design of Icon-based Visual Aids for Enhancing In-cockpit Air Traffic Control Communication

Yue Lyu, Xizi Wang, Hanlu Ma, Yalong Yang, Jian Zhao

Effective communication between pilots and air traffic control (ATC) is essential for aviation safety, but verbal exchanges over radios are prone to miscommunication, especially under high workload conditions. While cockpit-embedded visual aids offer the potential to enhance ATC communication, little is known about how to design and integrate such aids. We present an exploratory, user-centered investigation into the design and integration of icon-based visual aids, named ATCion, to support in-cockpit ATC communication, through four phases involving 22 pilots and 1 ATC controller. This study contributes a validated set of design principles and visual icon components for ATC messages. In a comparative study of ATCion, text-based visual aids, and no visual aids, we found that our design improved readback accuracy and reduced memory workload, without negatively impacting flight operations; most participants preferred ATCion over text-based aids, citing their clarity, low cognitive cost, and fast interpretability. Further, we point to implications and opportunities for integrating icon-based aids into future multimodal ATC communication systems to improve both safety and efficiency.

BIOGEM: A Fully Biodegradable Gelatin-Based McKibben Actuator with Embedded Sensing

Gaolin Ge, Haoran Lu, Yingting Gao, Qifeng Yang, Josiah Hester, Tingyu Cheng, Yiyue Luo

We present BIOGEM, a fully biodegradable McKibben actuator with integrated sensing, made from gelatin-based composites. By tailoring the material compositions, we customize the mechanical and electrical properties of the biodegradable composites, creating an integrated biodegradable system that combines both actuation and sensing functionalities. BIOGEM integrates a McKibben actuating structure by using stiff gelatin as outer braiding and the stretchable gelatin as air chambers. It also integrates resistive strain sensing through ionic gelatin, allowing the actuator to monitor its own deformation without relying on conventional electronics. We characterize the actuator’s performance across key parameters including braid angle, wall thickness, and material stiffness, demonstrating reliable contraction and repeatable force output at low pressures. Biodegradation is validated through both enzyme-assisted and backyard soil studies, confirming the material’s sustainable end-of-life behavior under realistic conditions. We illustrate the potential of this platform through interactive, edible, and environmentally-degradable prototypes across human–computer interaction and soft robotics scenarios.

CoSight: Exploring Viewer Contributions to Online Video Accessibility Through Descriptive Commenting

Ruolin Wang, Xingyu Bruce Liu, Biao Wang, Wayne Zhang, Ziqian Liao, Ziwen Li, Amy Pavel, Xiang Chen

The rapid growth of online video content has outpaced efforts to make visual information accessible to blind and low vision (BLV) audiences. While professional Audio Description (AD) remains the gold standard, it is costly and difficult to scale across the vast volume of online media. In this work, we explore a complementary approach to broaden participation in video accessibility: engaging everyday video viewers at their watching and commenting time. We introduce CoSight, a Chrome extension that augments YouTube with lightweight, in-situ nudges to support descriptive commenting. Drawing from Fogg’s Behavior Model, CoSight provides visual indicators of accessibility gaps, pop-up hints for what to describe, reminders to clarify vague comments, and related captions and comments as references. In an exploratory study with 48 sighted users, CoSight helped integrate accessibility contribution into natural viewing and commenting practices, resulting in 89% of comments including grounded visual descriptions. Follow-up interviews with four BLV viewers and four professional AD writers suggest that while such comments do not match the rigor of professional AD, they can offer complementary value by conveying visual context and emotional nuance for understanding the videos.

DropPop: Designing Drop-to-Deploy Mechanisms with Bistable Scissors Structures

Yibo Fu, Emily Guan, Jianzhe Gu, Dinesh K Patel, Justin U Soza Soto, Yichi Luo, Carmel Majidi, Josiah Hester, Lining Yao

Deployable structures often rely on complex deployment mechanisms such as external pneumatic pumps, electric motors, or manual assembly. These conventional methods, which are intended for applications in shape morphing architectures, robotics, and product design, can be bulky and unwieldy for everyday interaction and daily use. We introduce a new class of deployable structures that harness the locomotion of a single bistable cap to drive the expansion of a scissor-like mechanism. Such structures can be rapidly deployed (0.2-0.7s) upon a small trigger, and stabilize themselves requiring no sustained energy input. We explore various input modalities for deployment such as hand dropping, and drone deployment, and showcase demo applications. Additionally, we provide a computational design tool for customizing shape primitives with physics simulation and offer design guidelines for fabrication.

ForcePinch: Force-Responsive Spatial Interaction for Tracking Speed Control in XR

Chenyang Zhang, Tiffany S Ma, John Andrews, Eric J Gonzalez, Mar Gonzalez-Franco, Yalong Yang

Spatial interaction in 3D environments requires balancing efficiency and precision, which requires dynamic tracking speed adjustments. However, existing techniques often couple tracking speed adjustments directly with hand movements, reducing interaction flexibility. Inspired by the natural friction control inherent in the physical world, we introduce ForcePinch, a novel force-responsive spatial interaction method that enables users to intuitively modulate pointer tracking speed and smoothly transition between rapid and precise movements by varying their pinching force. To implement this concept, we developed a hardware prototype integrating a pressure sensor with a customizable mapping function that translates pinching force into tracking speed adjustments. We conducted a user study with 20 participants performing well-established 1D, 2D, and 3D object manipulation tasks, comparing ForcePinch against the distance-responsive technique Go-Go and speed-responsive technique PRISM. Results highlight distinctive characteristics of the force-responsive approach across different interaction contexts. Drawing on these findings, we highlight the contextual meaning and versatility of force-responsive interactions through four illustrative examples, aiming to inform and inspire future spatial interaction design.

OnGoal: Tracking and Visualizing Conversational Goals in Multi-Turn Dialogue with Large Language Models

Adam J Coscia, Shunan Guo, Eunyee Koh, Alex Endert

As multi-turn dialogues with large language models (LLMs) grow longer and more complex, how can users better evaluate and review progress on their conversational goals? We present OnGoal, an LLM chat interface that helps users better manage goal progress. OnGoal provides real-time feedback on goal alignment through LLM-assisted evaluation, explanations for evaluation results with examples, and overviews of goal progression over time, enabling users to navigate complex dialogues more effectively. Through a study with 20 participants on a writing task, we evaluate OnGoal against a baseline chat interface without goal tracking. Using OnGoal, participants spent less time and effort to achieve their goals while exploring new prompting strategies to overcome miscommunication, suggesting tracking and visualizing goals can enhance engagement and resilience in LLM dialogues. Our findings inspired design implications for future LLM chat interfaces that improve goal communication, reduce cognitive load, enhance interactivity, and enable feedback to improve LLM performance.

Posters

MILO: An LLM Multi-Stage Conversational Agent for Fostering Teenagers’ Mental Resilience

Han Bao, Yongan Yu, Bohan Wang, Xiaowen Lu, Xin Tong

Adolescence is a significant period that shapes long-term development and well-being. Mental disorders contribute to 15% of the global disease burden among teenagers, according to the WHO. Adverse well-being during adolescence can not only compromise physical health but also lead to a wide range of negative social outcomes throughout life. Motivated by the potential of generative AI conversational agents to provide scalable and personalized support to cultivate mental resilience, we designed Milo, an LLM digital companion grounded in cognitive behavioral therapy (CBT), tailored specifically for teenagers. Milo promotes greater involvement of teenagers in the development of emotional awareness and resilience strategies through agent customization and offering an interactive interface.

Noetic Dream: A Personalized VR and Meditation System for Lucid Dream Training

Yichen Yu, Qiaoran Wang

Lucid dreaming relies on a high level of metacognition and requires significant time and effort to master induction techniques, presenting obstacles for those seeking such experiences. This study proposes a personalized lucid dreaming training system Noetic Dream that combines virtual reality (VR) with open-monitoring(OM) meditation, acting on the mechanism of “dream awareness” through both external and internal pathways. VR provides immersive dream-based games to help users practice identifying unrealistic states, while OM meditation stabilizes internal focus and implants lucid intent. The training cycle uses multimodal cues to help users establish dream recognition mechanisms, thereby increasing the likelihood of lucid dreaming. The contributions of this study include: applying generative language models (LLMs) to construct dream VR scenarios, designing dream anomaly detection game mechanisms to stimulate dream awareness, and integrating OM meditation to achieve a non-invasive lucid dreaming training pathway, thereby effectively increasing the probability of spontaneous lucid dreaming.

Telecommunications Policy Research Conference
Washington, D.C. | Sept. 18–20, 2025

Data Governance

Connected Vehicles and International Data Transfers: Operationalizing Security, Privacy and Economic Rationales in China, the U.S., and the EU
Le-Tian Cheng

This paper focuses on cross-border data flow regulations regarding Connected Vehicles (CVs) in the People’s Republic of China (PRC), the European Union (EU), and the United States of America (USA). The paper reviews the engineering-cybersecurity literature regarding CVs and derives from this a classification of data types generated by the CV ecosystem. It then analyzes the legal and policy texts regarding CVs from the three jurisdictions. By mapping the data types to each jurisdiction’s restrictions and regulations, the paper unpacks how they conceptualize the risks or threats from CV data and how they operationalize these concerns into CV data regulation. The paper’s objective is to provide a detailed examination of the similarities and differences among the three jurisdictions. We discover that governments’ attempt to regulate data flows pushes them into classification systems for information, and that governments attach different values or policy interests to these categories.

Platforms and Competition

Interconnection and Rivalry in Global Monetary Networks
Karim Farhat, Milton L. Mueller, Vagisha Srivastava

In this white paper, we apply concepts of network competition to analyze the contest for dominance between the US dollar, a BRICS alliance against the dollar, and a politically neutral money like Bitcoin.

Global money networks have network externalities; a currency becomes more valuable as more users in more countries accept it and use it. Users thus tend to converge on a single, dominant network for payments that maximizes their demand-side economies of scope. Drawing on empirical evidence from telecommunications competition and network externality theory, we show that when three systems with network externalities compete, an interconnection agreement between the dominant system and one of the two competitors can isolate and exclude the third system. We analyze the governance of dollar stablecoins as the monetary equivalent of an interconnection agreement between the fiat dollar and Bitcoin. We argue that the fiat dollar can strengthen its global dominance by fostering a stronger interconnection with Bitcoin via dollar stablecoins.

Dollar stablecoins are the optimal conversion asset between a liquid medium of exchange like the dollar and a less liquid store of value like Bitcoin. With a formal interconnection between dollar stablecoins and Bitcoin, demand-side economies of scope are shared, and strong complementarities become evident. Stablecoins serve as a medium of remittance and short-term savings while Bitcoin serves as a longer-term store of value or speculative asset, as with gold. At the same time, an interconnection agreement acts as an implicit check, imposing fiscal discipline on US dollar governance. If the dollar weakens excessively, a positive feedback loop ensues in Bitcoin where the more users diversify to Bitcoin the more its price appreciates and the more users drive value away from the fiat dollar, and so on.

As such, we argue policymakers should proactively foster an interconnection between dollar stablecoins and Bitcoin to strengthen the US dollar’s global dominance and forestall long-term threats to its hegemony. The interconnection agreement should center around:

• Designing a federal regulatory framework for stablecoins centered on open capital markets — without picking favorites.
• Incentivizing stablecoin operators to reduce short-term bonds in favor of longer-term securities and harder assets, enhancing stability and market confidence.
• Encouraging emerging markets and BRICS nations to freely access dollar stablecoins and Bitcoin as reliable stores of value depending on their needs; and
• Eliminating capital gains and tax reporting requirements for long-term Bitcoin saving and long-term Bitcoin to dollar stablecoin conversions to retain capital in the United States and simultaneously encourage more dollar exports for the foreseeable future. By pursuing these policies, the dollar’s network advantage can be reinforced, ensuring it remains the dominant currency in an increasingly contested global monetary landscape.

Routing Security Adoption

The Role of RIRs in RPKI Adoption
Josephine Wolff, Cecilia Testart

Recognizing the relevance of securing inter-domain routing to protect traffic flows in the Internet, the Internet Engineering Task Force (IETF) standardized the Resource Public Key Infrastructure (RPKI), a framework to provide networks with a system to cryptographically validate routing data. Despite many obstacles, RPKI has emerged as the consensus to improve routing security and currently about 50% of routed IP address blocks are part of the system. The Regional Internet Registries (RIRs) are in charge of allocating address space in five different geographical zones and play a crucial role in RPKI: they are the roots of trust of the crypto graphic system and provide the infrastructure to host RPKI certificates and keys for the Internet resources allocated in their region. Organizations and networks wanting to issue RPKI records for their address space need to follow the process from the RIR that delegated their address space. In this paper, we analyze the RIRs’ implementation of RPKI infrastructure from the perspective of network operators. Based on in-depth interviews with 13 network engineers who have been involved in their organizations’ efforts to adopt RPKI, we examine the RIR initiatives
that have or would have most supported RPKI adoption for different types of organizations. Given RIRs have independently developed and implemented the cryptographic infrastructure as well as the tooling to issue and manage certificates, we offer recommendations on strategies that have encouraged RPKI adoption.

Satellite and Space Networks

Are Leo Networks the Future of National Emergency Failover? – A Quantitative Study and Policy Blueprint
Vaibhav Bhosale, Zachary Bischof, Fabián E. Bustamante, Ying Zhang, Sameer Kapoor, Robin Kim, Miguel Schlicht, Muskaan Gupta, Ekaterina Tumanova, Alberto Dainotti, Ahmed Saeed

Low Earth Orbit (LEO) satellite networks are emerging as backups for national-scale outages. While they have demonstrated value in small-scale disasters such as supporting first responders during hurricanes, their effectiveness during large-scale infrastructure failures remains underexplored. This paper evaluates the capacity of LEO networks to act as national failover infrastructure using six real-world submarine cable failures. The failure capacity provided by a LEO network to a specific nation depends on a few key factors: the size of the country, the distribution of the user terminals, and the policies of the network operator for spectrum allocation and traffic engineering. We find that coordinated policies between governments and network operators, especially regarding terminal placement and spectrum use, can improve failover capacity by up to 1.8× without requiring additional infrastructure. However, even under optimistic conditions with 200,000 terminals and a dedicated failover network, LEO networks can only restore 0.9–14.7% of lost submarine cable capacity in most cases.

User-Generated Content

The Impact of Premium Licenses on Creator Behavior
Jae Sang Rhee

The creator economy relies on third-party platforms, free-sharing platforms, which enable creators to reach wide audiences, enhancing monetization opportunities. However, creators often remain uncompensated, their visibility declines due to content oversaturation, and the unauthorized use of their work poses significant risks as training datasets vital to artificial intelligence (AI) frequently draw from freely accessible creator content. These issues directly harm both creators and platforms. Some platforms introduced a premium license, offering subscription-based exclusive content, upfront creator payments, and enhanced copyright protection. This paper investigates the impact of premium licensing on creator behavior by leveraging a unique natural experiment. Using data from Unsplash and Pexels, we find that introducing premium licenses on free-sharing platforms reduces the volume of freely available content by 13.2\%. Particularly, this decline is observed even among creators who could not get into the premium license. We further identify two mechanisms driving this decline. First, reduced multi-homing occurs as existing creators deactivate accounts and move away from the platform offering premium license. Second, creators improve free content quality to stay competitive with premium offerings. Our findings highlight crucial trade-offs associated with premium licensing, demonstrating significant unintended consequences for content volume and quality. These issues directly impact both creators and platforms, underscoring the importance of strategic policy design in platform monetization.

Research Activities

A Common’s Approach to Cybersecurity Policy (Tutorial)
Vaibhav Garg, Comcast, Holly Peterson, Louisiana State University and Milton Mueller, Georgia Tech

There are two dominant paradigms in Tech Policy. The first one assumes technology outcomes to be a public good and ground policy interventions in regulatory responses. The second one asserts these outcomes to be private goods and targets policy solutions that address market incentives as well as associated dynamics. Yet the interconnected nature of Telecommunications technologies, such as Internet, as well as the correlated nature of associated risks, such as cybersecurity, means that there is a third option. This third way assumes technology outcomes to be common pool resources. Untrammeled extraction of these resources may lead to a Tragedy of the Commons. Numerous institutions across distinct domains have been able to avoid said Tragedy by investing in community-based governance. Research documenting the commonalities between such institutions led to Elinor Ostrom’s 2010 Nobel Prize winning work called the Institutional Analysis and Development framework (IAD).

Despite IAD’s successful application in many risk domains, its formal application to Telecommunications Policy, especially in cybersecurity, has been underexplored. Yet, telecommunications policy stakeholders – especially in emerging technologies – will often leverage community-based interventions. Applying IAD to such interventions may provide significant insights, making community-based governance both more effective and efficient. Furthermore, formal application of IAD to Telecommunications Policy may open opportunities for new policy solutions in cybersecurity. The goal of this workshop is to introduce TPRC attendees to the IAD framework and teach its application to cybersecurity.

The Regulatory Challenge of Artificial Intelligence (Panel)

The character of generative AI technologies present unique challenges to traditional regulatory paradigms. The panel participants have been conducting research in this field and will report briefly on their recent findings to provoke discussion among the panel members and audience.

Topics include: The intersection of intellectual property rights with AI; the framing of AI Ethics in terms of their social, economic and political contexts, the regulatory ramification of the potential existential risk of AI systems, current regulatory models in the U.S. and Europe and a view of AI as distributed computing.

Panelists:
Russ Neuman, New York University
Christopher Yoo, University of Pennsylvania
Christos Makridis, Stanford University
Chloé Bakalar, Meta
Milton L. Mueller, Georgia Institute of Technology

USENIX Security Symposium
Seattle | August 13 – 15, 2025

Hardware Security 1: Microarchitectures

FLOP: Breaking the Apple M3 CPU via False Load Output Predictions
Jason Kim, Jalen Chuang, Daniel Genkin, Yuval Yarom

To bridge the ever-increasing gap between the fast execution speed of modern processors and the long latency of memory accesses, CPU vendors continue to introduce newer and more advanced optimizations. While these optimizations improve performance, research has repeatedly demonstrated that they may also have an adverse impact on security. In this work, we identify that recent Apple M- and A-series processors implement a load value predictor (LVP), an optimization that predicts the contents of memory that the processor loads before the contents are actually available. This allows processors to alleviate slowdowns from Read-After-Write dependencies, as instructions can now be executed in parallel rather than sequentially. To evaluate the security impact of Apple’s LVP implementation, we first investigate the implementation, identifying the conditions for prediction. We then show that although the LVP cannot directly predict 64-bit values (e.g., pointers), prediction of smaller-size values can be leveraged to achieve arbitrary memory access. Finally, we demonstrate end-to-end attack exploit chains that build on the LVP to obtain a 64-bit read primitive within the Safari and Chrome browsers.

Hardware Security 3: Side-Channel and Fault Injection Attacks

ECC.fail: Mounting Rowhammer Attacks on DDR4 Servers with ECC Memory
Nureddin Kamadan, Walter Wang, Stephan van Schaik, Christina Garman, Daniel Genkin, Yuval Yarom

Rowhammer is a hardware vulnerability present in nearly all computer memory, allowing attackers to modify bits in memory without directly accessing them. While Rowhammer has been extensively studied on client and even mobile platforms, no successful Rowhammer attack has been demonstrated on server platforms using DDR4 ECC memory. Tackling this challenge, in this paper we demonstrate the first end-to-end Rowhammer technique effective against Intel servers using Hynix DDR4 ECC memory. To that aim, we first characterize the Hynix implementation of Target Row Refresh (TRR) on server parts, demonstrating effective hammering patterns on both FPGA and Intel-based testing platforms with ECC disabled. We then reverse engineer Intel’s ECC implementation on Skylake and Cascade Lake servers. We find that it has a coding distance of four, which often allows triggering incorrect ECC correction with just two bit flips. Combining the two observations, we present an end-to-end Rowhammer attack which can flip bits on Intel servers, without causing crashes. Finally, we demonstrate the effectiveness of our attack by hammering RSA public keys loaded into memory, causing the server to accept messages not signed by the original key.

Privacy 1: Differential Privacy and Audit

General-Purpose f-DP Estimation and Auditing in a Black-Box Setting
Önder Askin, Holger Dette, Martin Dunsche, Tim Kutta, Yun Lu, Yu Wei, Vassilis Zikas

In this paper we propose new methods to statistically assess f-Differential Privacy (f-DP), a recent refinement of differential privacy (DP) that remedies certain weaknesses of standard DP (including tightness under algorithmic composition). A challenge when deploying differentially private mechanisms is that DP is hard to validate, especially in the black-box setting. This has led to numerous empirical methods for auditing standard DP, while f-DP remains less explored. We introduce new black-box methods for f-DP that, unlike existing approaches for this privacy notion, do not require prior knowledge of the investigated algorithm. Our procedure yields a complete estimate of the f-DP trade-off curve, with theoretical guarantees of convergence. Additionally, we propose an efficient auditing method that empirically detects f-DP violations with statistical certainty, merging techniques from non-parametric estimation and optimal classification theory. Through experiments on a range of DP mechanisms, we demonstrate the effectiveness of our estimation and auditing procedures.

Privacy 2: Consent, Compliance, and Provable Privacy

Evaluating Privacy Policies under Modern Privacy Laws At Scale: An LLM-Based Automated Approach
Qinge Xie, Karthik Ramakrishnan, Frank Li

Website privacy policies detail an online service’s information practices, including how they handle user data and rights. For many sites, these disclosures are now necessitated by a growing set of privacy regulations, such as GDPR and multiple US state laws, offering visibility into privacy practices that are often not publicly observable. Motivated by this visibility, prior work has explored techniques for automated analysis of privacy policies and characterized specific aspects of real-world policies on a larger scale. However, existing approaches are constrained in the privacy practices they evaluate, as they rely upon rule-based methods or supervised classifiers, and many predate the prominent privacy laws now enacted that drastically shape privacy disclosures. Thus, we lack a comprehensive understanding of modern website privacy practices disclosed through privacy policies. In this work, we seek to close this gap by providing a systematic and comprehensive evaluation of website privacy policies at scale. We first systematize the privacy practices discussed by 10 notable privacy regulations currently in effect in the European Union and the US, identifying 34 distinct clauses on privacy practices across 4 overarching themes. We then develop and evaluate an LLM-based approach for assessing these clauses in privacy policies, providing a more accurate, comprehensive, and flexible analysis compared to prior techniques. Finally, we collect privacy policies from over 100K websites, and apply our LLM method to a subset of sites to investigate in-depth the privacy practices of websites today. Ultimately, our work supports broader investigations into web privacy practices moving forward.

Software Security 3: Fuzzing

Hybrid Language Processor Fuzzing via LLM-Based Constraint Solving
Yupeng Yang, Shenglong Yao, Jizhou Chen, Wenke Lee

Language processors, such as compilers and interpreters, play a crucial role in modern cyberspace. Faulty language processors can lead to severe consequences such as incorrect functionalities or malicious attacks. It is non-trivial to automatically test language processors to detect faulty behaviors, because language processors are multistaged and require various complex constraints to reach deep program states. Existing testing (fuzzing) approaches either fail to effectively generate inputs that satisfy the complex constraints or fail to generalize due to their heavy reliance on target-specific constraint modeling heuristics. In this paper, we explore the potential of using LLMs for constraint solving to address these limitations and identify two challenges regarding constraint prioritization and context construction. To effectively address these challenges, we propose two novel solutions, hybrid centrality prioritization and iterative context construction. We implement the solutions in a hybrid fuzzing framework, HLPFuzz, which leverages an LLM to overcome complex constraints and reach deep program states. In our evaluation, HLPFuzz successfully discovers 52 bugs in 9 popular language processors, of which 37 are confirmed and 14 are fixed. HLPFuzz also outperforms state-of-the-art solutions by up to 190% in code coverage and discovers 5x more bugs than the second-best fuzzer, with minimal reliance on target-specific heuristics.

Waltzz: WebAssembly Runtime Fuzzing with Stack-Invariant Transformation
Lingming Zhang, Binbin Zhao, Jiacheng Xu, Peiyu Liu, Qinge Xie, Yuan Tian, Jianhai Chen, Shouling Ji

WebAssembly (Wasm) is a binary instruction format proposed by major browser vendors to achieve near-native performance on the web and other platforms. By design, Wasm modules should be executed in a memory-safe runtime, which acts as a trusted computing base. Therefore, security vulnerabilities inside runtime implementation can have severe impacts and should be identified and mitigated promptly. Fuzzing is a practical and widely adopted technique for uncovering bugs in real-world programs. However, to apply fuzzing effectively to the domain of Wasm runtimes, it is vital to address two primary challenges: (1) Wasm is a stack-based language and runtimes should verify the correctness of stack semantics, which requires fuzzers to meticulously maintain desired stack semantics to reach deeper states. (2) Wasm acts as a compilation target and includes hundreds of instructions, making it hard for fuzzers to explore different combinations of instructions and cover the input space effectively. To address these challenges, we design and implement Waltzz, a practical greybox fuzzing framework tailored for Wasm runtimes. Specifically, Waltzz proposes the concept of stack-invariant code transformation to preserve appropriate stack semantics during fuzzing. Next, Waltzz introduces a versatile suite of mutators designed to systematically traverse diverse combinations of instructions in terms of both control and data flow. Moreover, Waltzz designs a skeleton-based generation algorithm to produce code snippets that are rarely seen in the seed corpus. To demonstrate the efficacy of Waltzz, we evaluate it on seven well-known Wasm runtimes. Compared to the state-of-the-art works, Waltzz can surpass the nearest competitor by finding 12.4% more code coverage even within the large code bases and uncovering 1.38x more unique bugs. Overall, Waltzz has discovered 20 new bugs which have all been confirmed and 17 CVE IDs have been assigned.

ACM Conference on International Computing Education Research
Charlottesville | August 3 – 6, 2025

Doctoral Consortium

Ethical Computing Education in the Age of Generative AI
Grace Barkhuff

Educating computing students in ethical practices is vitally important. This education is complicated by the rapid rise of generative AI (GenAI) and its use in higher education by students and instructors alike. My research aims to understand computing educators’ perceptions on ethically educating computing students, both about and with GenAI.

Lightning Talks and Posters

Benchmarking of Generative AI Tools in Software Engineering Education: Formative Insights for Curriculum Integration
Nimisha Roy, Oleksandr Horielko, Fisayo Omojokun

Exploring Community Perceptions and Experiences Towards Academic Dishonesty in Computing Education
Chandler C. Payne, Kai A. Hackney, Lucas Guarenti Zangari, Emmanuel Munoz, Sterling R. Kalogeras, Juan Sebastián Sánchez-Gómez, Fisayo Omojokun, Pedro Guillermo Feijóo-García

Should I Submit or Should I Not? Exploring the Effects of Mandatory vs. Voluntary Tasks on Student Engagement in Computing Education
Lucas Guarenti Zangari, Emilio Aponte-Archila, Pedro Guillermo Feijóo-García

What Computing Faculty Want: Designing AI Tools for High-Enrollment Courses Beyond CS1
Rodrigo Borela, Meryem Yilmaz Soylu, Jeonghyun Lee, Nimisha Roy

International Conference on Machine Learning
Vancouver | July 13 – 19, 2025

Algorithms

Learning to Stop: Deep Learning for Mean Field Optimal Stopping

Lorenzo Magnino, Yuchen Zhu, Mathieu Lauriere

Optimal stopping is a fundamental problem in optimization with applications in risk management, finance, robotics, and machine learning. We extend the standard framework to a multi-agent setting, named multi-agent optimal stopping (MAOS), where agents cooperate to make optimal stopping decisions in a finite-space, discrete-time environment. Since solving MAOS becomes computationally prohibitive as the number of agents is very large, we study the mean-field optimal stopping (MFOS) problem, obtained as the number of agents tends to infinity. We establish that MFOS provides a good approximation to MAOS and prove a dynamic programming principle (DPP) based on mean-field control theory. We then propose two deep learning approaches: one that learns optimal stopping decisions by simulating full trajectories and another that leverages the DPP to compute the value function and to learn the optimal stopping rule using backward induction. Both methods train neural networks to approximate optimal stopping policies. We demonstrate the effectiveness and the scalability of our work through numerical experiments on 6 different problems in spatial dimension up to 300. To the best of our knowledge, this is the first work to formalize and computationally solve MFOS in discrete time and finite space, opening new directions for scalable MAOS methods.

PEAKS: Selecting Key Training Examples Incrementally via Prediction Error Anchored by Kernel Similarity

Mustafa Burak Gurbuz, Xingyu Zheng, Constantine Dovrolis

As deep learning continues to be driven by ever-larger datasets, understanding which examples are most important for generalization has become a critical question. While progress in data selection continues, emerging applications require studying this problem in dynamic contexts. To bridge this gap, we pose the Incremental Data Selection (IDS) problem, where examples arrive as a continuous stream, and need to be selected without access to the full data source. In this setting, the learner must incrementally build a training dataset of predefined size while simultaneously learning the underlying task. We find that in IDS, the impact of a new sample on the model state depends fundamentally on both its geometric relationship in the feature space and its prediction error. Leveraging this insight, we propose PEAKS (Prediction Error Anchored by Kernel Similarity), an efficient data selection method tailored for IDS. Our comprehensive evaluations demonstrate that PEAKS consistently outperforms existing selection strategies. Furthermore, PEAKS yields increasingly better performance returns than random selection as training data size grows on real-world datasets. The code is available at https://github.com/BurakGurbuz97/PEAKS.

Unpaired Point Cloud Completion via Unbalanced Optimal Transport

Taekyung Lee, Jaemoo Choi, Jaewoong Choi, Myungjoo Kang

Unpaired point cloud completion is crucial for real-world applications, where ground-truth data for complete point clouds are often unavailable. By learning a completion map from unpaired incomplete and complete point cloud data, this task avoids the reliance on paired datasets. In this paper, we propose the \textit{Unbalanced Optimal Transport Map for Unpaired Point Cloud Completion (\textbf{UOT-UPC})} model, which formulates the unpaired completion task as the (Unbalanced) Optimal Transport (OT) problem. Our method employs a Neural OT model learning the UOT map using neural networks. Our model is the first attempt to leverage UOT for unpaired point cloud completion, achieving competitive or superior performance on both single-category and multi-category benchmarks. In particular, our approach is especially robust under the class imbalance problem, which is frequently encountered in real-world unpaired point cloud completion scenarios.

Alignment

oral

CollabLLM: From Passive Responders to Active Collaborators

Shirley Wu, Michel Galley, Baolin Peng, Hao Cheng, Gavin Li, Yao Dou, Weixin Cai, James Zou, Jure Leskovec, Jianfeng Gao

Large Language Models are typically trained with next-turn rewards, limiting their ability to optimize for long-term interaction. As a result, they often respond passively to ambiguous or open-ended user requests, failing to help users reach their ultimate intents and leading to inefficient conversations. To address these limitations, we introduce CollabLLM, a novel and general training framework that enhances multiturn human-LLM collaboration. Its key innovation is a collaborative simulation that estimates the long-term contribution of responsesusing Multiturn-aware Rewards. By reinforcement fine-tuning these rewards, CollabLLM goes beyond responding to user requests, and actively uncovers user intent and offers insightful suggestions—a key step towards more human-centered AI. We also devise a multiturn interaction benchmark with three challenging tasks such as document creation. CollabLLM significantly outperforms our baselines with averages of 18.5% higher task performance and 46.3% improved interactivity by LLM judges. Finally, we conduct a large user study with 201 judges, where CollabLLM increases user satisfaction by 17.6% and reduces user spent time by 10.4%.

Applications

Generalization Principles for Inference over Text-Attributed Graphs with Large Language Models

Haoyu Wang, Shikun Liu, Rongzhe Wei, Pan Li

Large language models (LLMs) have recently been introduced to graph learning, aiming to extend their zero-shot generalization success to tasks where labeled graph data is scarce. Among these applications, inference over text-attributed graphs (TAGs) presents unique challenges: existing methods struggle with LLMs’ limited context length for processing large node neighborhoods and the misalignment between node embeddings and the LLM token space. To address these issues, we establish two key principles for ensuring generalization and derive the framework LLM-BP accordingly: (1) **Unifying the attribute space with task-adaptive embeddings**, where we leverage LLM-based encoders and task-aware prompting to enhance generalization of the text attribute embeddings; (2) **Developing a generalizable graph information aggregation mechanism**, for which we adopt belief propagation with LLM-estimated parameters that adapt across graphs. Evaluations on 11 real-world TAG benchmarks demonstrate that LLM-BP significantly outperforms existing approaches, achieving 8.10\% improvement with task-conditional embeddings and an additional 1.71\% gain from adaptive aggregation. The code and task-adaptive embeddings are publicly available.

Chemistry, Physics, and Earth Sciences

LLM-Augmented Chemical Synthesis and Design Decision Programs

Haorui Wang, Jeff Guo, Lingkai Kong, Rampi Ramprasad, Philippe Schwaller, Yuanqi Du, Chao Zhang

Retrosynthesis, the process of breaking down a target molecule into simpler precursors through a series of valid reactions, stands at the core of organic chemistry and drug development. Although recent machine learning (ML) research has advanced single-step retrosynthetic modeling and subsequent route searches, these solutions remain restricted by the extensive combinatorial space of possible pathways. Concurrently, large language models (LLMs) have exhibited remarkable chemical knowledge, hinting at their potential to tackle complex decision-making tasks in chemistry. In this work, we explore whether LLMs can successfully navigate the highly constrained, multi-step retrosynthesis planning problem. We introduce an efficient scheme for encoding reaction pathways and present a new route-level search strategy, moving beyond the conventional step-by-step reactant prediction. Through comprehensive evaluations, we show that our LLM-augmented approach excels at retrosynthesis planning and extends naturally to the broader challenge of synthesizable molecular design.

Convex

Geometric Algebra Planes: Convex Implicit Neural Volumes

Irmak Sivgin, Sara Fridovich-Keil, Gordon Wetzstein, Mert Pilanci

Volume parameterizations abound in recent literature, encompassing methods from classic voxel grids to implicit neural representations. While implicit representations offer impressive capacity and improved memory efficiency compared to voxel grids, they traditionally require training through nonconvex optimization, which can be slow and sensitive to initialization and hyperparameters. We introduce GA-Planes, a novel family of implicit neural volume representations inspired by Geometric Algebra that can be trained using convex optimization, addressing the limitations of nonconvex methods. GA-Planes models generalize many existing representations including any combination of features stored in tensor basis elements followed by a neural feature decoder, and can be adapted to convex or nonconvex training as needed for various inverse problems. In the 2D setting, we prove GA-Planes models are equivalent to a low-rank plus low-resolution matrix factorization that outperforms the classic low-rank plus sparse decomposition for fitting a natural image. In 3D, GA-Planes models exhibit competitive expressiveness, model size, and optimizability across tasks such as radiance field reconstruction, 3D segmentation, and video segmentation.

Deep Learning

Can Transformers Reason Logically? A Study in SAT Solving

Leyan Pan, Vijay Ganesh, Jacob Abernethy, Chris Esposo, Wenke Lee

We formally study the logical reasoning capabilities of decoder-only Transformers in the context of the boolean satisfiability (SAT) problem. First, we prove by construction that decoder-only Transformers can decide 3-SAT, in a non-uniform model of computation, using backtracking and deduction via Chain-of-Thought (CoT).Second, we implement our construction as a PyTorch model with a tool (PARAT) that we designed to empirically demonstrate its correctness and investigate its properties.Third, rather than \textit{programming} a transformer to reason, we evaluate empirically whether it can be \textit{trained} to do so by learning directly from algorithmic traces (“reasoning paths”) from our theoretical construction. The trained models demonstrate strong out-of-distribution generalization on problem sizes seen during training but has limited length generalization, which is consistent with the implications of our theoretical result.

LaCache: Ladder-Shaped KV Caching for Efficient Long-Context Modeling of Large Language Models

Dachuan Shi, Yonggan Fu, Xiangchi Yuan, Zhongzhi Yu, Haoran You, Sixu Li, Xin Dong, Jan Kautz, Pavlo Molchanov, Yingyan (Celine) Lin

Recent advancements in Large Language Models (LLMs) have spurred interest in numerous applications requiring robust long-range capabilities, essential for processing extensive input contexts and continuously generating extended outputs. As sequence lengths increase, the number of Key-Value (KV) pairs in LLMs escalates, creating a significant efficiency bottleneck.In this paper, we propose a new KV cache optimization paradigm called LaCache, a training-free method for efficient and accurate generative inference of LLMs. LaCache enables LLMs to simultaneously address both of the critical challenges in long-range modeling: robust long-range capabilities and continuous generation without running out-of-memory (OOM). Specifically, LaCache integrates two key innovations: (1) a ladder-shaped KV cache pattern that stores KV pairs not only sequentially (left-to-right within each layer) but also across layers (from shallow to deep), providing an extended span for capturing long-range dependencies under a fixed storage budget, thereby boosting long-range capabilities; and (2) an iterative compaction mechanism that progressively compresses older caches, freeing up space for new tokens within a fixed cache size. This token distance-based dynamic compression enables more effective continuous generation under constrained cache budgets.Experiments across various tasks, benchmarks, and LLM models consistently validate LaCache’s effectiveness in enhancing LLMs’ long-range capabilities. Our code is available at https://github.com/GATECH-EIC/LaCache.

Deep RL

Deep Reinforcement Learning from Hierarchical Preference Design

Alexander Bukharin, Yixiao Li, Pengcheng He, Tuo Zhao

Reward design is a fundamental, yet challenging aspect of reinforcement learning (RL). Researchers typically utilize feedback signals from the environment to handcraft a reward function, but this process is not always effective due to the varying scale and intricate dependencies of the feedback signals. This paper shows by exploiting certain structures, one can ease the reward design process. Specifically, we propose a hierarchical reward design framework — HERON for scenarios: (I) The feedback signals naturally present hierarchy; (II) The reward is sparse, but with less important surrogate feedback to help policy learning. Both scenarios allow us to design a hierarchical decision tree induced by the importance ranking of the feedback signals to compare RL trajectories. With such preference data, we can then train a reward model for policy learning. We apply HERON to several RL applications, and we find that our framework can not only train high performing agents on a variety of difficult tasks, but also provide additional benefits such as improved sample efficiency and robustness.

Efficient Online Reinforcement Learning for Diffusion Policy

Haitong Ma, Tianyi Chen, Kai Wang, Na Li, Bo Dai

Diffusion policies have achieved superior performance in imitation learning and offline reinforcement learning (RL) due to their rich expressiveness. However, the conventional diffusion training procedure requires samples from target distribution, which is impossible in online RL since we cannot sample from the optimal policy. Backpropagating policy gradient through the diffusion process incurs huge computational costs and instability, thus being expensive and not scalable. To enable efficient training of diffusion policies in online RL, we generalize the conventional denoising score matching by reweighting the loss function. The resulting Reweighted Score Matching (RSM) preserves the optimal solution and low computational cost of denoising score matching, while eliminating the need to sample from the target distribution and allowing learning to optimize value functions. We introduce two tractable reweighted loss functions to solve two commonly used policy optimization problems, policy mirror descent and max-entropy policy, resulting in two practical algorithms named Diffusion Policy Mirror Descent (DPMD) and Soft Diffusion Actor-Critic (SDAC). We conducted comprehensive comparisons on MuJoCo benchmarks. The empirical results show that the proposed algorithms outperform recent diffusion-policy online RLs on most tasks, and the DPMD improves more than 120% over soft actor-critic on Humanoid and Ant.

Foundation Models

Primitive Vision: Improving Diagram Understanding in MLLMs

Shan Zhang, Aotian Chen, Yanpeng Sun, Jindong Gu, Yi-Yu Zheng, Piotr Koniusz, Kai Zou, Anton Hengel, Yuan Xue

Mathematical diagrams have a distinctive structure. Standard feature transforms designed for natural images (e.g., CLIP) fail to process them effectively, limiting their utility in multimodal large language models (MLLMs). Current efforts to improve MLLMs have primarily focused on scaling mathematical visual instruction datasets and strengthening LLM backbones, yet fine?grained visual recognition errors remain unaddressed. Our systematic evaluation on the visual grounding capabilities of state?of?the?art MLLMs highlights that fine?grained visual understanding remains a crucial bottleneck in visual mathematical reasoning (GPT-4o exhibits a 70% grounding error rate, and correcting these errors improves reasoning accuracy by 12%). We thus propose a novel approach featuring a geometrically?grounded vision encoder and a feature router that dynamically selects between hierarchical visual feature maps. Our model accurately recognizes visual primitives and generates precise visual prompts aligned with the language model’s reasoning needs. In experiments, PRIMITIVE-Qwen2.5-7B outperforms other 7B models by 12% on MathVerse and is on par with GPT-4V on MathVista. Our findings highlight the need for better fine?grained visual integration in MLLMs. Code is available at github.com/AI4Math-ShanZhang/SVE-Math.

General Machine Learning

On the Power of Learning-Augmented Search Trees

Jingbang Chen, Xinyuan Cao, Alicia Stepin, Li Chen

We study learning-augmented binary search trees (BSTs) via Treaps with carefully designed priorities.The result is a simple search tree in which the depth of each item $x$ is determined by its predicted weight $w_x$.Specifically, each item $x$ is assigned a composite priority of $-\lfloor\log\log(1/w_x)\rfloor + U(0, 1)$ where $U(0, 1)$ is the uniform random variable. By choosing $w_x$ as the relative frequency of $x$, the resulting search trees achieve static optimality.This approach generalizes the recent learning-augmented BSTs [Lin-Luo-Woodruff ICML`22], which only work for Zipfian distributions, by extending them to arbitrary input distributions.Furthermore, we demonstrate that our method can be generalized to a B-Tree data structure using the B-Treap approach [Golovin ICALP’09]. Our search trees are also capable of leveraging localities in the access sequence through online self-reorganization, thereby achieving the working-set property. Additionally, they are robust to prediction errors and support dynamic operations, such as insertions, deletions, and prediction updates. We complement our analysis with an empirical study, demonstrating that our method outperforms prior work and classic data structures.

Generative Models and Autoencoders

oral

ConceptAttention: Diffusion Transformers Learn Highly Interpretable Features

Alec Helbling, Tuna Han Salih Meral, Benjamin Hoover, Pinar Yanardag, Polo Chau

Do the rich representations of multi-modal diffusion transformers (DiTs) exhibit unique properties that enhance their interpretability? We introduce ConceptAttention, a novel method that leverages the expressive power of DiT attention layers to generate high-quality saliency maps that precisely locate textual concepts within images. Without requiring additional training, ConceptAttention repurposes the parameters of DiT attention layers to produce highly contextualized *concept embeddings*, contributing the major discovery that performing linear projections in the output space of DiT attention layers yields significantly sharper saliency maps compared to commonly used cross-attention maps. ConceptAttention even achieves state-of-the-art performance on zero-shot image segmentation benchmarks, outperforming 15 other zero-shot interpretability methods on the ImageNet-Segmentation dataset. ConceptAttention works for popular image models and even seamlessly generalizes to video generation. Our work contributes the first evidence that the representations of multi-modal DiTs are highly transferable to vision tasks like segmentation.

Diffuse Everything: Multimodal Diffusion Models on Arbitrary State Spaces

Kevin Rojas, Yuchen Zhu, Sichen Zhu, Felix Ye, Molei Tao

Diffusion models have demonstrated remarkable performance in generating unimodal data across various tasks, including image, video, and text generation. On the contrary, the joint generation of multimodal data through diffusion models is still in the early stages of exploration. Existing approaches heavily rely on external preprocessing protocols, such as tokenizers and variational autoencoders, to harmonize varied data representations into a unified, unimodal format. This process heavily demands the high accuracy of encoders and decoders, which can be problematic for applications with limited data. To lift this restriction, we propose a novel framework for building multimodal diffusion models on arbitrary state spaces, enabling native generation of coupled data across different modalities. By introducing an innovative decoupled noise schedule for each modality, we enable both unconditional and modality-conditioned generation within a single model simultaneously. We empirically validate our approach for text-image generation and mixed-type tabular data synthesis, demonstrating that it achieves competitive performance.

SPOTLIGHT

Direct Discriminative Optimization: Your Likelihood-Based Visual Generative Model is Secretly a GAN Discriminator

Kaiwen Zheng, Yongxin Chen, Huayu Chen, Guande He, Ming-Yu Liu, Jun Zhu, Qinsheng Zhang

While likelihood-based generative models, particularly diffusion and autoregressive models, have achieved remarkable fidelity in visual generation, the maximum likelihood estimation (MLE) objective, which minimizes the forward KL divergence, inherently suffers from a mode-covering tendency that limits the generation quality under limited model capacity. In this work, we propose Direct Discriminative Optimization (DDO) as a unified framework that integrates likelihood-based generative training and GAN-type discrimination to bypass this fundamental constraint by exploiting reverse KL and self-generated negative signals. Our key insight is to parameterize a discriminator implicitly using the likelihood ratio between a learnable target model and a fixed reference model, drawing parallels with the philosophy of Direct Preference Optimization (DPO). Unlike GANs, this parameterization eliminates the need for joint training of generator and discriminator networks, allowing for direct, efficient, and effective finetuning of a well-trained model to its full potential beyond the limits of MLE. DDO can be performed iteratively in a self-play manner for progressive model refinement, with each round requiring less than 1\% of pretraining epochs. Our experiments demonstrate the effectiveness of DDO by significantly advancing the previous SOTA diffusion model EDM, reducing FID scores from 1.79/1.58/1.96 to new records of 1.30/0.97/1.26 on CIFAR-10/ImageNet-64/ImageNet 512$\times$512 datasets without any guidance mechanisms, and by consistently improving both guidance-free and CFG-enhanced FIDs of visual autoregressive models on ImageNet 256$\times$256.

Learnings from Scaling Visual Tokenizers for Reconstruction and Generation

Philippe Hansen-Estruch, David Yan, Ching-Yao Chuang, Orr Zohar, Jialiang Wang, Tingbo Hou, Tao Xu, Sriram Vishwanath, Peter Vajda, Xinlei Chen

Visual tokenization via auto-encoding empowers state-of-the-art image and video generative models by compressing pixels into a latent space. However, questions remain about how auto-encoder design impacts reconstruction and downstream generative performance. This work explores scaling in auto-encoders for reconstruction and generation by replacing the convolutional backbone with an enhanced Vision Transformer for Tokenization (ViTok). We find scaling the auto-encoder bottleneck correlates with reconstruction but exhibits a nuanced relationship with generation. Separately, encoder scaling yields no gains, while decoder scaling improves reconstruction with minimal impact on generation. As a result, we determine that scaling the current paradigm of auto-encoders is not effective for improving generation performance. Coupled with Diffusion Transformers, ViTok achieves competitive image reconstruction and generation performance on 256p and 512p ImageNet-1K. In videos, ViTok achieves SOTA reconstruction and generation performance on 16-frame 128p UCF-101.

Overcoming Spurious Solutions in Semi-Dual Neural Optimal Transport: A Smoothing Approach for Learning the Optimal Transport Plan

Jaemoo Choi, Jaewoong Choi, Dohyun Kwon

We address the convergence problem in learning the Optimal Transport (OT) map, where the OT Map refers to a map from one distribution to another while minimizing the transport cost. Semi-dual Neural OT, a widely used approach for learning OT Maps with neural networks, often generates spurious solutions that fail to transfer one distribution to another accurately. We identify a sufficient condition under which the max-min solution of Semi-dual Neural OT recovers the true OT Map. Moreover, to address cases when this sufficient condition is not satisfied, we propose a novel method, OTP, which learns both the OT Map and the Optimal Transport Plan, representing the optimal coupling between two distributions. Under sharp assumptions on the distributions, we prove that our model eliminates the spurious solution issue and correctly solves the OT problem. Our experiments show that the OTP model recovers the optimal transport map where existing methods fail and outperforms current OT-based models in image-to-image translation tasks. Notably, the OTP model can learn stochastic transport maps when deterministic OT Maps do not exist, such as one-to-many tasks like colorization.

RocketKV: Accelerating Long-Context LLM Inference via Two-Stage KV Cache Compression

Payman Behnam, Yaosheng Fu, Ritchie Zhao, Po-An Tsai, Zhiding Yu, Alexey Tumanov

Transformer-based Large Language Models rely critically on the KV cache to efficiently handle extended contexts during the decode phase. Yet, the size of the KV cache grows proportionally with the input length, burdening both memory bandwidth and capacity as decoding progresses. To address this challenge, we present RocketKV, a training-free KV cache compression strategy containing two consecutive stages. In the first stage, it performs coarse-grain permanent KV cache eviction on the input sequence tokens. In the second stage, it adopts a hybrid sparse attention method to conduct fine-grain top-k sparse attention, approximating the attention scores by leveraging both head and sequence dimensionality reductions. We show that RocketKV provides a compression ratio of up to 400×, end-to-end speedup of up to 3.7× as well as peak memory reduction of up to 32.6% in the decode phase on an NVIDIA A100 GPU compared to the full KV cache baseline, while achieving negligible accuracy loss on a variety of long-context tasks. We also propose a variant of RocketKV for multi-turn scenarios, which consistently outperforms other existing methods and achieves accuracy nearly on par with an oracle top-k attention scheme.

Graph Neural Networks

A Dynamical Systems-Inspired Pruning Strategy for Addressing Oversmoothing in Graph Attention Networks

Biswadeep Chakraborty, Harshit Kumar, Saibal Mukhopadhyay

Graph Neural Networks (GNNs) face a critical limitation known as oversmoothing, where increasing network depth leads to homogenized node representations, severely compromising their expressiveness. We present a novel dynamical systems perspective on this challenge, revealing oversmoothing as an emergent property of GNNs’ convergence to low-dimensional attractor states. Based on this insight, we introduce **DYNAMO-GAT**, which combines noise-driven covariance analysis with Anti-Hebbian learning to dynamically prune attention weights, effectively preserving distinct attractor states. We provide theoretical guarantees for DYNAMO-GAT’s effectiveness and demonstrate its superior performance on benchmark datasets, consistently outperforming existing methods while requiring fewer computational resources. This work establishes a fundamental connection between dynamical systems theory and GNN behavior, providing both theoretical insights and practical solutions for deep graph learning.

Health / Medicine

EARTH: Epidemiology-Aware Neural ODE with Continuous Disease Transmission Graph

Guancheng Wan, Zewen Liu, Xiaojun Shan, Max Lau, B. Aditya Prakash, Wei Jin

Effective epidemic forecasting is critical for public health strategies and efficient medical resource allocation, especially in the face of rapidly spreading infectious diseases. However, existing deep-learning methods often overlook the dynamic nature of epidemics and fail to account for the specific mechanisms of disease transmission. In response to these challenges, we introduce an innovative end-to-end framework called Epidemiology-Aware Neural ODE with Continuous Disease Transmission Graph (EARTH) in this paper. To learn continuous and regional disease transmission patterns, we first propose EANO, which seamlessly integrates the neural ODE approach with the epidemic mechanism, considering the complex spatial spread process during epidemic evolution. Additionally, we introduce GLTG to model global infection trends and leverage these signals to guide local transmission dynamically. To accommodate both the global coherence of epidemic trends and the local nuances of epidemic transmission patterns, we build a cross-attention approach to fuse the most meaningful information for forecasting. Through the smooth synergy of both components, EARTH offers a more robust and flexible approach to understanding and predicting the spread of infectious diseases. Extensive experiments show EARTH superior performance in forecasting real-world epidemics compared to state-of-the-art methods. The code is available at https://github.com/GuanchengWan/EARTH.

Kernel methods

Statistical and Computational Guarantees of Kernel Max-Sliced Wasserstein Distances

Jie Wang, March Boedihardjo, Yao Xie

Optimal transport has been very successful for various machine learning tasks; however, it is known to suffer from the curse of dimensionality. Hence, dimensionality reduction is desirable when applied to high-dimensional data with low-dimensional structures. The kernel max-sliced (KMS) Wasserstein distance is developed for this purpose by finding an optimal nonlinear mapping that reduces data into $1$ dimension before computing the Wasserstein distance. However, its theoretical properties have not yet been fully developed. In this paper, we provide sharp finite-sample guarantees under milder technical assumptions compared with state-of-the-art for the KMS $p$-Wasserstein distance between two empirical distributions with $n$ samples for general $p\in[1,\infty)$. Algorithm-wise, we show that computing the KMS $2$-Wasserstein distance is NP-hard, and then we further propose a semidefinite relaxation (SDR) formulation (which can be solved efficiently in polynomial time) and provide a relaxation gap for the obtained solution. We provide numerical examples to demonstrate the good performance of our scheme for high-dimensional two-sample testing.

Large Language Models

CommVQ: Commutative Vector Quantization for KV Cache Compression

Junyan Li, Yang Zhang, Muhammad Yusuf Hassan, Talha Chafekar, Tianle Cai, Zhile Ren, Pengsheng Guo, Foroozan Karimzadeh, Colorado Reed, Chong Wang, Chuang Gan

Large Language Models (LLMs) are increasingly used in applications requiring long context lengths, but the key-value (KV) cache often becomes a memory bottleneck on GPUs as context grows. To address this, we propose Commutative Vector Quantization (CommVQ) to significantly reduce memory usage for long-context LLM inference. We first introduce additive quantization with a lightweight encoder and codebook to compress the KV cache, which can be decoded via simple matrix multiplication. To further reduce computational costs during decoding, we design the codebook to be commutative with Rotary Position Embedding (RoPE) and train it using an Expectation-Maximization (EM) algorithm. This enables efficient integration of decoding into the self-attention mechanism. Our approach achieves high accuracy with additive quantization and low overhead via the RoPE-commutative codebook. Experiments on long-context benchmarks and GSM8K show that our method reduces FP16 KV cache size by 87.5% with 2-bit quantization, while outperforming state-of-the-art KV cache quantization methods. Notably, it enables 1-bit KV cache quantization with minimal accuracy loss, allowing a LLaMA-3.1 8B model to run with a 128K context length on a single RTX 4090 GPU. The source code is available at: https://github.com/UMass-Embodied-AGI/CommVQ.

Discriminative Finetuning of Generative Large Language Models without Reward Models and Human Preference Data

Siqi Guo, Ilgee Hong, Vicente Balmaseda, Changlong Yu, Liang Qiu, Xin Liu, Haoming Jiang, Tuo Zhao, Tianbao Yang

Supervised fine-tuning (SFT) has become a crucial step for aligning pretrained large language models (LLMs) using supervised datasets of input-output pairs. However, despite being supervised, SFT is inherently limited by its generative training objective. To address its limitations, the existing common strategy is to follow SFT with a separate phase of preference optimization (PO), which relies on either human-labeled preference data or a strong reward model to guide the learning process. In this paper, we address the limitations of SFT by exploring one of the most successful techniques in conventional supervised learning: discriminative learning. We introduce **Discriminative Fine-Tuning (DFT)**, an improved variant of SFT, which mitigates the burden of collecting human-labeled preference data or training strong reward models. Unlike SFT that employs a generative approach and overlooks negative data, DFT adopts a **discriminative paradigm** that increases the probability of positive answers while suppressing potentially negative ones, aiming for **data prediction** instead of token prediction. Our contributions include: (i) a discriminative probabilistic framework for fine-tuning LLMs by explicitly modeling the discriminative likelihood of an answer among all possible outputs given an input; (ii) efficient algorithms to optimize this discriminative likelihood; and (iii) extensive experiments demonstrating DFT’s effectiveness, achieving performance better than SFT and comparable to if not better than SFT?PO. The code can be found at https://github.com/Optimization-AI/DFT.

Diving into Self-Evolving Training for Multimodal Reasoning

Wei Liu, Junlong Li, Xiwen Zhang, Fan Zhou, Yu Cheng, Junxian He

Self-evolving training—where models iteratively learn from their own outputs—has emerged as a key approach for complex reasoning tasks, addressing the scarcity of high-quality chain-of-thought data. However, its effectiveness in multimodal reasoning, a domain more intricate than text-only reasoning, remains underexplored, and the understanding of critical factors in this training paradigm remains limited. Furthermore, a central challenge for this training method is performance saturation, which impedes further improvements and scalability. Inspired by reinforcement learning (RL), in this paper, we reframe self-evolving training for multimodal reasoning through the lens of RL, identifying three pivotal factors: $\textit{Training Method}$, $\textit{Reward Model}$, and $\textit{Prompt Variation}$. Through systematic analysis, we establish relatively optimal design principles that significantly enhance multimodal reasoning capabilities. Moreover, delving deeper into training dynamics, we uncover the roots of saturation and propose a new automatic balancing mechanism to mitigate this limitation. Building on these insights, we propose M-STaR (**M**ultimodal **S**elf-evolving **T**r**a**ining for **R**easoning), a framework that achieves consistent performance gains across models of varying sizes and diverse benchmarks. All resources will be made publicly available.

SPOTLIGHT

Mechanistic Unlearning: Robust Knowledge Unlearning and Editing via Mechanistic Localization

Phillip Guo, Aaquib Syed, Abhay Sheshadri, Aidan Ewart, Gintare Karolina Dziugaite

Methods for knowledge editing and unlearning in large language models seek to edit or remove undesirable knowledge or capabilities without compromising general language modeling performance. This work investigates how mechanistic interpretability—which, in part, aims to identify model components (circuits) associated to specific interpretable mechanisms that make up a model capability—can improve the precision and effectiveness of editing and unlearning. We find a stark difference in unlearning and edit robustness when training components localized by different methods. We highlight an important distinction between methods that localize components based primarily on preserving outputs, and those finding high level mechanisms with predictable intermediate states. In particular, localizing edits/unlearning to components associated with the *lookup-table mechanism* for factual recall 1) leads to more robust edits/unlearning across different input/output formats, and 2) resists attempts to relearn the unwanted information, while also reducing unintended side effects compared to baselines, on both a sports facts dataset and the CounterFact dataset across multiple models.We also find that certain localized edits disrupt the latent knowledge in the model more than any other baselines, making unlearning more robust to various attacks.

Rethinking Addressing in Language Models via Contextualized Equivariant Positional Encoding

Jiajun Zhu, Peihao Wang, Ruisi Cai, Jason Lee, Pan Li, Zhangyang “Atlas” Wang

Transformers rely on both content-based and position-based addressing mechanisms to make predictions, but existing positional encoding techniques often diminish the effectiveness of position-based addressing. Many current methods enforce rigid patterns in attention maps, limiting the ability to model long-range dependencies and adapt to diverse tasks. Additionally, most positional encodings are learned as general biases, lacking the specialization required for different instances within a dataset. To address this, we propose con**T**extualized equivari**A**nt **P**osition **E**ncoding (**TAPE**), a novel framework that enhances positional embeddings by incorporating sequence content across layers. TAPE introduces dynamic, context-aware positional encodings, overcoming the constraints of traditional fixed patterns. By enforcing permutation and orthogonal equivariance, TAPE ensures the stability of positional encodings during updates, improving robustness and adaptability. Our method can be easily integrated into pre-trained transformers, offering parameter-efficient fine-tuning with minimal overhead. Extensive experiments show that TAPE achieves superior performance in language modeling, arithmetic reasoning, and long-context retrieval tasks compared to existing positional embedding techniques.

Scaling Sparse Feature Circuits For Studying In-Context Learning

Dmitrii Kharlapenko, Stepan Shabalin, Arthur Conmy, Neel Nanda

Sparse autoencoders (SAEs) are a popular tool for interpreting large language model activations, but their utility in addressing open questions in interpretability remains unclear. In this work, we demonstrate their effectiveness by using SAEsto deepen our understanding of the mechanism behind in-context learning (ICL). We identify abstract SAE features that (i) encode the model’s knowledge of which task to execute and (ii) whose latent vectors causally induce the task zero-shot.This aligns with prior work showing that ICL is mediated by task vectors. We further demonstrate that these task vectors are well approximated by a sparse sum of SAE latents, including these task-execution features. To explore the ICL mechanism, we scale the sparse feature circuits methodology of Marks et al. (2024) to the Gemma 1 2B model for the more complex task of ICL. Through circuit finding, we discover task-detecting features with corresponding SAE latents that activate earlier in the prompt, that detect when tasks have been performed. They are causally linked with task-execution features through the attention and MLP sublayers.

Monte Carlo and Sampling Methods

Annealing Flow Generative Models Towards Sampling High-Dimensional and Multi-Modal Distributions

Dongze Wu, Yao Xie

Sampling from high-dimensional, multi-modal distributions remains a fundamental challenge across domains such as statistical Bayesian inference and physics-based machine learning. In this paper, we propose Annealing Flow (AF), a method built on Continuous Normalizing Flows (CNFs) for sampling from high-dimensional and multi-modal distributions. AF is trained with a dynamic Optimal Transport (OT) objective incorporating Wasserstein regularization, and guided by annealing procedures, facilitating effective exploration of modes in high-dimensional spaces. Compared to recent NF methods, AF significantly improves training efficiency and stability, with minimal reliance on MC assistance. We demonstrate the superior performance of AF compared to state-of-the-art methods through extensive experiments on various challenging distributions and real-world datasets, particularly in high-dimensional and multi-modal settings. We also highlight AF’s potential for sampling the least favorable distributions.

Neuroscience, Cognitive Science

Inverse Reinforcement Learning with Switching Rewards and History Dependency for Characterizing Animal Behaviors

Jingyang Ke, Feiyang Wu, Jiyi Wang, Jeffrey Markowitz, Anqi Wu

Traditional approaches to studying decision-making in neuroscience focus on simplified behavioral tasks where animals perform repetitive, stereotyped actions to receive explicit rewards. While informative, these methods constrain our understanding of decision-making to short timescale behaviors driven by explicit goals. In natural environments, animals exhibit more complex, long-term behaviors driven by intrinsic motivations that are often unobservable. Recent works in time-varying inverse reinforcement learning (IRL) aim to capture shifting motivations in long-term, freely moving behaviors. However, a crucial challenge remains: animals make decisions based on their history, not just their current state. To address this, we introduce SWIRL (SWitching IRL), a novel framework that extends traditional IRL by incorporating time-varying, history-dependent reward functions. SWIRL models long behavioral sequences as transitions between short-term decision-making processes, each governed by a unique reward function. SWIRL incorporates biologically plausible history dependency to capture how past decisions and environmental contexts shape behavior, offering a more accurate description of animal decision-making. We apply SWIRL to simulated and real-world animal behavior datasets and show that it outperforms models lacking history dependency, both quantitatively and qualitatively. This work presents the first IRL model to incorporate history-dependent policies and rewards to advance our understanding of complex, naturalistic decision-making in animals.

oral

Learning Time-Varying Multi-Region Brain Communications via Scalable Markovian Gaussian Processes

Weihan Li, Yule Wang, Chengrui Li, Anqi Wu

Understanding and constructing brain communications that capture dynamic communications across multiple regions is fundamental to modern system neuroscience, yet current methods struggle to find time-varying region-level communications or scale to large neural datasets with long recording durations. We present a novel framework using Markovian Gaussian Processes to learn brain communications with time-varying temporal delays from multi-region neural recordings, named Adaptive Delay Model (ADM). Our method combines Gaussian Processes with State Space Models and employs parallel scan inference algorithms, enabling efficient scaling to large datasets while identifying concurrent communication patterns that evolve over time. This time-varying approach captures how brain region interactions shift dynamically during cognitive processes. Validated on synthetic and multi-region neural recordings datasets, our approach discovers both the directionality and temporal dynamics of neural communication. This work advances our understanding of distributed neural computation and provides a scalable tool for analyzing dynamic brain networks. Code is available at https://github.com/BRAINML-GT/Adaptive-Delay-Model.

SPOTLIGHT

Neural Encoding and Decoding at Scale

Yizi Zhang, Yanchen Wang, Mehdi Azabou, Alexandre Andre, Zixuan Wang, Hanrui Lyu, International Brain Laboratory, Eva Dyer, Department of Statistics Liam Paninski, Cole Hurwitz

Recent work has demonstrated that large-scale, multi-animal models are powerful tools for characterizing the relationship between neural activity and behavior. Current large-scale approaches, however, focus exclusively on either predicting neural activity from behavior (encoding) or predicting behavior from neural activity (decoding), limiting their ability to capture the bidirectional relationship between neural activity and behavior. To bridge this gap, we introduce a multimodal, multi-task model that enables simultaneous Neural Encoding and Decoding at Scale (NEDS). Central to our approach is a novel multi-task-masking strategy, which alternates between neural, behavioral, within-modality, and cross-modality masking. We pretrain our method on the International Brain Laboratory (IBL) repeated site dataset, which includes recordings from 83 animals performing the visual decision-making task. In comparison to other large-scale modeling approaches, we demonstrate that NEDS achieves state-of-the-art performance for both encoding and decoding when pretrained on multi-animal data and then fine-tuned on new animals. Surprisingly, NEDS’s learned embeddings exhibit emergent properties: even without explicit training, they are highly predictive of the brain regions in each recording. Altogether, our approach is a step towards a foundation model of the brain that enables seamless translation between neural activity and behavior.

Online

SPOTLIGHT

Novelty Detection in Reinforcement Learning with World Models

Geigh Zollicoffer, Kenneth Eaton, Jonathan Balloch, Julia Kim, Wei Zhou, Robert Wright, Mark Riedl

Reinforcement learning (RL) using world models has found significant recent successes.However, when a sudden change to world mechanics or properties occurs then agent performance and reliability can dramatically decline.We refer to the sudden change in visual properties or state transitions as novelties.Implementing novelty detection within generated world model frameworks is a crucialtask for protecting the agent when deployed. In this paper, we propose straightforward bounding approaches to incorporate novelty detection into world model RL agents by utilizing the misalignment of the world model’s hallucinated states and the true observed states as a novelty score. We provideeffective approaches to detecting novelties in a distribution of transitions learned by an agent ina world model. Finally, we show the advantage ofour work in a novel environment compared to traditional machine learning novelty detection methods as well as currently accepted RL-focused novelty detection algorithms.

Online Learning and Bandits

On Mitigating Affinity Bias through Bandits with Evolving Biased Feedback

Matthew Faw, Constantine Caramanis, Jessica Hoffmann

Unconscious bias has been shown to influence how we assess our peers, with consequences for hiring, promotions and admissions. In this work, we focus on affinity bias, the component of unconscious bias which leads us to prefer people who are similar to us, despite no deliberate intention of favoritism. In a world where the people hired today become part of the hiring committee of tomorrow, we are particularly interested in understanding (and mitigating) how affinity bias affects this feedback loop. This problem has two distinctive features: 1) we only observe the _biased value_ of a candidate, but we want to optimize with respect to their _real value_ 2) the bias towards a candidate with a specific set of traits depends on the _fraction_ of people in the hiring committee with the same set of traits. We introduce a new bandits variant that exhibits those two features, which we call affinity bandits. Unsurprisingly, classical algorithms such as UCB often fail to identify the best arm in this setting. We prove a new instance-dependent regret lower bound, which is larger than that in the standard bandit setting by a multiplicative function of $K$. Since we treat rewards that are _time-varying_ and _dependent on the policy’s past actions_, deriving this lower bound requires developing proof techniques beyond the standard bandit techniques. Finally, we design an elimination-style algorithm which nearly matches this regret, despite never observing the real rewards.

Online Learning, Active Learning and Bandits

Improved and Oracle-Efficient Online $\ell_1$-Multicalibration

Rohan Ghuge, Vidya Muthukumar, Sahil Singla

We study *online multicalibration*, a framework for ensuring calibrated predictions across multiple groups in adversarial settings, across $T$ rounds. Although online calibration is typically studied in the $\ell_1$ norm, prior approaches to online multicalibration have taken the indirect approach of obtaining rates in other norms (such as $\ell_2$ and $\ell_{\infty}$) and then transferred these guarantees to $\ell_1$ at additional loss. In contrast, we propose a direct method that achieves improved and oracle-efficient rates of $\widetilde{\mathcal{O}}(T^{-1/3})$ and $\widetilde{\mathcal{O}}(T^{-1/4})$ respectively, for online $\ell_1$-multicalibration. Our key insight is a novel reduction of online $\ell_1$-multicalibration to an online learning problem with product-based rewards, which we refer to as *online linear-product optimization* ($\mathtt{OLPO}$). To obtain the improved rate of $\widetilde{\mathcal{O}}(T^{-1/3})$, we introduce a linearization of $\mathtt{OLPO}$ and design a no-regret algorithm for this linearized problem. Although this method guarantees the desired sublinear rate (nearly matching the best rate for online calibration), it is computationally expensive when the group family $\mathcal{H}$ is large or infinite, since it enumerates all possible groups. To address scalability, we propose a second approach to $\mathtt{OLPO}$ that makes only a polynomial number of calls to an offline optimization (*multicalibration evaluation*) oracle, resulting in *oracle-efficient* online $\ell_1$-multicalibration with a corresponding rate of $\widetilde{\mathcal{O}}(T^{-1/4})$. Our framework also extends to certain infinite families of groups (e.g., all linear functions on the context space) by exploiting a $1$-Lipschitz property of the $\ell_1$-multicalibration error with respect to $\mathcal{H}$.

Optimization

Fast Tensor Completion via Approximate Richardson Iteration

Mehrdad Ghadiri, Matthew Fahrbach, Yunbum Kook, Ali Jadbabaie

We study tensor completion (TC) through the lens of low-rank tensor decomposition (TD). Many TD algorithms use fast alternating minimization methods to solve _highly structured_ linear regression problems at each step (e.g., for CP, Tucker, and tensor-train decompositions). However, such algebraic structure is often lost in TC regression problems, making direct extensions unclear. This work proposes a novel _lifting_ method for approximately solving TC regression problems using structured TD regression algorithms as blackbox subroutines, enabling sublinear-time methods. We analyze the convergence rate of our approximate Richardson iteration-based algorithm, and our empirical study shows that it can be 100x faster than direct methods for CP completion on real-world tensors.

Planning

Navigating the Social Welfare Frontier: Portfolios for Multi-objective Reinforcement Learning

Cheol Kim, Jai Moondra, Shresth Verma, Madeleine Pollack, Lingkai Kong, Milind Tambe, Swati Gupta

In many real-world applications of Reinforcement Learning (RL), deployed policies have varied impacts on different stakeholders, creating challenges in reaching consensus on how to effectively aggregate their preferences. Generalized $p$-means form a widely used class of social welfare functions for this purpose, with broad applications in fair resource allocation, AI alignment, and decision-making. This class includes well-known welfare functions such as Egalitarian, Nash, and Utilitarian welfare. However, selecting the appropriate social welfare function is challenging for decision-makers, as the structure and outcomes of optimal policies can be highly sensitive to the choice of $p$. To address this challenge, we study the concept of an $\alpha$-approximate portfolio in RL, a set of policies that are approximately optimal across the family of generalized $p$-means for all $p \in [-\infty, 1]$. We propose algorithms to compute such portfolios and provide theoretical guarantees on the trade-offs among approximation factor, portfolio size, and computational efficiency. Experimental results on synthetic and real-world datasets demonstrate the effectiveness of our approach in summarizing the policy space induced by varying $p$ values, empowering decision-makers to navigate this landscape more effectively.

Privacy

Underestimated Privacy Risks for Minority Populations in Large Language Model Unlearning

Rongzhe Wei, Mufei Li, Mohsen Ghassemi, Eleonora Kreacic, Yifan Li, Xiang Yue, Bo Li, Vamsi Potluru, Pan Li, Eli Chien

Large Language Models (LLMs) embed sensitive, human-generated data, prompting the need for unlearning methods. Although certified unlearning offers strong privacy guarantees, its restrictive assumptions make it unsuitable for LLMs, giving rise to various heuristic approaches typically assessed through empirical evaluations. These standard evaluations randomly select data for removal, apply unlearning techniques, and use membership inference attacks (MIAs) to compare unlearned models against models retrained without the removed data. However, to ensure robust privacy protections for every data point, it is essential to account for scenarios in which certain data subsets face elevated risks. Prior research suggests that outliers, particularly including data tied to minority groups, often exhibit higher memorization propensity which indicates they may be more difficult to unlearn. Building on these insights, we introduce a complementary, minority-aware evaluation framework to highlight blind spots in existing frameworks. We substantiate our findings with carefully designed experiments, using canaries with personally identifiable information (PII) to represent these minority subsets and demonstrate that they suffer at least 20\% higher privacy leakage across various unlearning methods, MIAs, datasets, and LLM scales. Our proposed minority-aware evaluation framework marks an essential step toward more equitable and comprehensive assessments of LLM unlearning efficacy.

XAttnMark: Learning Robust Audio Watermarking with Cross-Attention

Yixin Liu, Lie Lu, Jihui Jin, Lichao Sun, Andrea Fanelli

The rapid proliferation of generative audio synthesis and editing technologies has raised significant concerns about copyright infringement, data provenance, and the spread of misinformation through deepfake audio. Watermarking offers a proactive solution by embedding imperceptible, identifiable, and traceable marks into audio content. While recent neural network-based watermarking methods like WavMark and AudioSeal have improved robustness and quality, they struggle to achieve both robust detection and accurate attribution simultaneously. This paper introduces the Cross-Attention Robust Audio Watermark (XAttnMark), which bridges this gap by leveraging partial parameter sharing between the generator and the detector, a cross-attention mechanism for efficient message retrieval, and a temporal conditioning module for improved message distribution. Additionally, we propose a psychoacoustic-aligned temporal-frequency masking loss that captures fine-grained auditory masking effects, enhancing watermark imperceptibility. Our approach achieves state-of-the-art performance in both detection and attribution, demonstrating superior robustness against a wide range of audio transformations, including challenging generative editing with strong editing strength. This work represents a significant step forward in protecting intellectual property and ensuring the authenticity of audio content in the era of generative AI.

Reinforcement Learning and Planning

Incentivize without Bonus: Provably Efficient Model-based Online Multi-agent RL for Markov Games

Tong Yang, Bo Dai, Lin Xiao, Yuejie Chi

Multi-agent reinforcement learning (MARL) lies at the heart of a plethora of applications involving the interaction of a group of agents in a shared unknown environment. A prominent framework for studying MARL is Markov games, with the goal of finding various notions of equilibria in a sample-efficient manner, such as the Nash equilibrium (NE) and the coarse correlated equilibrium (CCE). However, existing sample-efficient approaches either require tailored uncertainty estimation under function approximation, or careful coordination of the players. In this paper, we propose a novel model-based algorithm, called VMG, that incentivizes exploration via biasing the empiricalestimate of the model parameters towards those with a higher collective best-response values of all the players when fixing the other players’ policies, thus encouraging the policy to deviate from its current equilibrium for more exploration. VMG is oblivious to different forms of function approximation, and permits simultaneous and uncoupled policy updates of all players. Theoretically, we also establish that VMG achieves a near-optimal regret for finding both the NEs of two-player zero-sum Markov games and CCEs of multi-player general-sum Markov games under linear function approximation in an online environment, which nearly match their counterparts with sophisticated uncertainty quantification.

Robotics

ELEMENTAL: Interactive Learning from Demonstrations and Vision-Language Models for Reward Design in Robotics

Letian Chen, Nina Moorman, Matthew Gombolay

Reinforcement learning (RL) has demonstrated compelling performance in robotic tasks, but its success often hinges on the design of complex, ad hoc reward functions. Researchers have explored how Large Language Models (LLMs) could enable non-expert users to specify reward functions more easily. However, LLMs struggle to balance the importance of different features, generalize poorly to out-of-distribution robotic tasks, and cannot represent the problem properly with only text-based descriptions. To address these challenges, we propose ELEMENTAL (intEractive LEarning froM dEmoNstraTion And Language), a novel framework that combines natural language guidance with visual user demonstrations to align robot behavior with user intentions better. By incorporating visual inputs, ELEMENTAL overcomes the limitations of text-only task specifications, while leveraging inverse reinforcement learning (IRL) to balance feature weights and match the demonstrated behaviors optimally. ELEMENTAL also introduces an iterative feedback-loop through self-reflection to improve feature, reward, and policy learning. Our experiment results demonstrate that ELEMENTAL outperforms prior work by 42.3% on task success, and achieves 41.3% better generalization in out-of-distribution tasks, highlighting its robustness in LfD.

Robustness

SGD Jittering: A Training Strategy for Robust and Accurate Model-Based Architectures

Peimeng Guan, Mark Davenport

Inverse problems aim to reconstruct unseen data from corrupted or perturbed measurements. While most work focuses on improving reconstruction quality, generalization accuracy and robustness are equally important, especially for safety-critical applications. Model-based architectures (MBAs), such as loop unrolling methods, are considered more interpretable and achieve better reconstructions. Empirical evidence suggests that MBAs are more robust to perturbations than black-box solvers, but the accuracy-robustness tradeoff in MBAs remains underexplored. In this work, we propose a simple yet effective training scheme for MBAs, called SGD jittering, which injects noise iteration-wise during reconstruction. We theoretically demonstrate that SGD jittering not only generalizes better than the standard mean squared error training but is also more robust to average-case attacks. We validate SGD jittering using denoising toy examples, seismic deconvolution, and single-coil MRI reconstruction. Both SGD jittering and its SPGD extension yield cleaner reconstructions for out-of-distribution data and demonstrates enhanced robustness against adversarial attacks.

Safety

Antidote: Post-fine-tuning Safety Alignment for Large Language Models against Harmful Fine-tuning Attack

Tiansheng Huang, Gautam Bhattacharya, Pratik Joshi, Joshua Kimball, Ling Liu

Safety aligned Large Language Models (LLMs) are vulnerable to harmful fine-tuning attacks — a few harmful data mixed in the fine-tuning dataset can break the LLMs’s safety alignment. While several defenses have been proposed, our evaluation shows that existing defenses fail \textit{when some specific training hyper-parameters are chosen} — a large learning rate or a large number of training epochs in the fine-tuning stage can easily invalidate the defense. To this end, we propose Antidote, a post-fine-tuning stage solution, which remains \textbf{\textit{agnostic to the training hyper-parameters in the fine-tuning stage}}. Antidote relies on the philosophy that by removing the harmful parameters, the harmful model can be recovered from the harmful behaviors, regardless of how those harmful parameters are formed in the fine-tuning stage. With this philosophy, we introduce a one-shot pruning stage after harmful fine-tuning to remove the harmful weights that are responsible for the generation of harmful content. Despite its embarrassing simplicity, empirical results show that Antidote can reduce harmful score while maintaining accuracy on downstream tasks.

Security

Topological Signatures of Adversaries in Multimodal Alignments

Minh Vu, Geigh Zollicoffer, Huy Mai, Ben Nebgen, Boian S Alexandrov, Manish Bhattarai

Multimodal Machine Learning systems, particularly those aligning text and image data like CLIP/BLIP models, have become increasingly prevalent, yet remain susceptible to adversarial attacks. While substantial research has addressed adversarial robustness in unimodal contexts, defense strategies for multimodal systems are underexplored. This work investigates the topological signatures that arise between image and text embeddings and shows how adversarial attacks disrupt their alignment, introducing distinctive signatures. We specifically leverage persistent homology and introduce two novel Topological-Contrastive losses based on Total Persistence and Multi-scale kernel methods to analyze the topological signatures introduced by adversarial perturbations. We observe a pattern of monotonic changes in the proposed topological losses emerging in a wide range of attacks on image-text alignments, as more adversarial samples are introduced in the data. By designing an algorithm to back-propagate these signatures to input samples, we are able to integrate these signatures into Maximum Mean Discrepancy tests, creating a novel class of tests that leverage topological signatures for better adversarial detection.

Sequential Models, Time series

In-Context Fine-Tuning for Time-Series Foundation Models

Matthew Faw, Rajat Sen, Yichen Zhou, Abhimanyu Das

Motivated by the recent success of time-series foundation models for zero-shot forecasting, we present a methodology for _in-context fine-tuning_ of a time-series foundation model. In particular, we design a pretrained foundation model that can be prompted (at inference time) with multiple time-series examples, in order to forecast a target time-series into the future. Our foundation model is specifically trained to utilize examples from multiple related time-series in its context window (in addition to the history of the target time-series) to help it adapt to the specific distribution of the target domain at inference time. We show that such a foundation model that uses in-context examples at inference time can obtain much better performance on popular forecasting benchmarks compared to supervised deep learning methods, statistical models, and other time series foundation models. Interestingly, our in-context fine-tuning approach even matches the performance of a foundation model that is explicitly fine-tuned on the target domain.

Linear Transformers as VAR Models: Aligning Autoregressive Attention Mechanisms with Autoregressive Forecasting

Jiecheng Lu, Shihao Yang

Autoregressive attention-based time series forecasting (TSF) has drawn increasing interest, with mechanisms like linear attention often outperforming vanilla attention. However, deeper Transformer architectures frequently misalign with autoregressive objectives, obscuring the underlying VAR structure embedded within linear attention and hindering their ability to capture the data generative processes in TSF. In this work, we first show that a single linear attention layer can be interpreted as a dynamic vector autoregressive (VAR) structure. We then explain that existing multi-layer Transformers have structural mismatches with the autoregressive forecasting objective, which impair interpretability and generalization ability. To address this, we show that by rearranging the MLP, attention, and input-output flow, multi-layer linear attention can also be aligned as a VAR model. Then, we propose Structural Aligned Mixture of VAR (SAMoVAR), a linear Transformer variant that integrates interpretable dynamic VAR weights for multivariate TSF. By aligning the Transformer architecture with autoregressive objectives, SAMoVAR delivers improved performance, interpretability, and computational efficiency, comparing to SOTA TSF models.

WAVE: Weighted Autoregressive Varying Gate for Time Series Forecasting

Jiecheng Lu, Xu Han, Yan Sun, Shihao Yang

We propose a Weighted Autoregressive Varying gatE (WAVE) attention mechanism equipped with both Autoregressive (AR) and Moving-average (MA) components. It can adapt to various attention mechanisms, enhancing and decoupling their ability to capture long-range and local temporal patterns in time series data. In this paper, we first demonstrate that, for the time series forecasting (TSF) task, the previously overlooked decoder-only autoregressive Transformer model can achieve results comparable to the best baselines when appropriate tokenization and training methods are applied. Moreover, inspired by the ARMA model from statistics and recent advances in linear attention, we introduce the full ARMA structure into existing autoregressive attention mechanisms. By using an indirect MA weight generation method, we incorporate the MA term while maintaining the time complexity and parameter size of the underlying efficient attention models. We further explore how indirect parameter generation can produce implicit MA weights that align with the modeling requirements for local temporal impacts. Experimental results show that WAVE attention that incorporates the ARMA structure consistently improves the performance of various AR attentions on TSF tasks, achieving state-of-the-art results.