Talks – Georgia Statistics Day 2023

Modeling and Active Learning for Experiments with Quantitative-Sequence Factors

Abhyuday Mandal (University of Georgia)

Abstract: A new type of experiment that aims to determine the optimal quantities of a sequence of factors is eliciting considerable attention in medical science, bioengineering, and many other disciplines. Such studies require the simultaneous optimization of both quantities and the sequence orders of several components which are called quantitative-sequence (QS) factors. Given the large and semi-discrete solution spaces in such experiments, efficiently identifying optimal or near-optimal solutions by using a small number of experimental trials is a nontrivial task. To address this challenge, we propose a novel active learning approach, called QS-learning, to enable effective modeling and efficient optimization for experiments with QS factors. QS-learning consists of three parts: a novel mapping-based additive Gaussian process (MaGP) model, an efficient global optimization scheme (QS-EGO), and a new class of optimal designs (QS-design). The theoretical properties of the proposed method are investigated, and optimization techniques using analytical gradients are developed. The performance of the proposed method is demonstrated via a real drug experiment on lymphomatreatment and several simulation studies.

Tree Aggregated Factor Regression Model with application to microbiome data analysis

Aditya Mishra (University of Georgia)

Abstract: Although the human microbiome plays a key role in health and disease, the biological mechanisms underlying the interaction between the microbiome and its host are incompletely understood. Integration with other molecular profiling data offers an opportunity to characterize the role of the microbiome and elucidate therapeutic targets. However, this remains challenging to the high dimensionality, compositionality, and rare features found in microbiome profiling data. These challenges necessitate the use of methods that can achieve structured sparsity in learning cross-platform association patterns. We propose Tree-Aggregated factor RegressiOn (TARO) for the integration of microbiome and metabolomic data. We leverage information on the phylogenetic tree structure to flexibly aggregate rare features. We demonstrate through simulation studies that TARO accurately recovers a low-rank coefficient matrix and identifies relevant features. We applied TARO to microbiome and metabolomic profiles gathered from subjects being screened for colorectal cancer to understand how gut microrganisms shape intestinal metabolite abundances. The R package TARO implementing the proposed methods is available online at https://github.com/amishra-stats/taro-package.

Classification versus regression in overparameterized regimes: Does the loss function matter?

Vidya Muthukumar (Georgia Institute of Technology)

Abstract: Recent years have seen substantial interest in a first-principles theoretical understanding of the behavior of overparameterized models that interpolate noisy training data, based on their surprising empirical success. In this talk, I compare classification and regression tasks in the overparameterized linear model. On the one hand, we show that with sufficient overparameterization, solutions obtained by training on the squared loss ( minimum-norm interpolation) typically used for regression, are identical to those produced by training on exponential and polynomially-tailed losses (e.g. the max-margin support-vector-machine), typically used for classification. On the other hand, we show that there exist regimes where these solutions are consistent when evaluated by the 0-1 test loss function, but inconsistent if evaluated by the mean-squared-error test loss function. Our results demonstrate that: a) different loss functions at the training (optimization) phase can yield similar solutions, and b) a significantly higher level of effective overparameterization admits good generalization in classification tasks as compared to regression tasks.

Statistics meets optimization: Sharp convergence predictions for iterative algorithms with random data

Ashwin Pananjady (Georgia Institute of Technology)

Abstract: Iterative algorithms are the workhorses of modern signal processing and statistical learning, and are widely used to fit large-scale, complex models to random data. While the choice of an algorithm and its hyperparameters determines both the speed and fidelity of the learning pipeline, it is common for this choice to be made heuristically, either by expensive trial-and-error or by comparing upper bounds on convergence rates of various candidate algorithms. Motivated by these issues, we develop a principled framework that produces sharp, iterate-by-iterate characterizations of solution quality for a wide variety of iterative algorithms on several nonconvex model-fitting problems with random data. Such sharp predictions can provide precise separations between families of algorithms while also revealing some nonstandard convergence phenomena.

Partial Quantile Tensor Regression with Applications to Neuroimaging Data

Limin Peng (Emory)

Abstract: Tensor data, characterized as multidimensional arrays, have become increasingly prevalent in biomedical research. To handle a tensor predictor in the regression setting, most existing methods concern its effect on the mean response, thereby failing to address the practical interest regarding the predictor’s effect on non-average or unusual outcomes. In this work, we propose a partial quantile tensor regression (PQTR) framework, which novelly applies the core principle of the partial least squares technique to achieve effective dimension reduction for quantile regression with a tensor predictor. The proposed PQTR algorithm is computationally efficient and scalable to a large size tensor predictor. Moreover, we uncover an appealing latent variable model representation for the new PQTR algorithm, justifying a simple population interpretation of the resulting estimator. We further investigate the connection of the PQTR procedure with an envelope quantile tensor regression (EQTR) model, which defines a general set of sparsity conditions tailored to quantile tensor regression. We prove the root-n consistency of the PQTR estimator under the EQTR model, and demonstrate its superior finite-sample performance compared to benchmark methods through simulation studies. We demonstrate the practical utility of the proposed method via an application to a neuroimaging study of post traumatic stress disorder (PTSD).

Constrained Sampling and Constrained Diffusion Generative Modeling via Mirror Map

Molei Tao (Georgia Institute of Technology)

Abstract: Mirror descent is a popular method that leverages mirror map to enable optimization on convex constrained sets. I will spend most of my time explaining how to turn mirror descent into an MCMC method that generates samples of (unnormalized) constrained probability distributions. The underlying dynamics, geometry, and optimal transport perspectives will be discussed. Using a general tool for analyzing samplers based on SDE discretizations, which we termed as mean-square analysis for sampling, we will also establish quantitative error bounds and elucidate how the performance scales with dimension (a critical factor for modern machine learning!) Then, in part II, I will briefly introduce a SOTA constrained generative model, based on fusing of mirror map and denoising diffusion. A new application of watermarking will be showcased, which enables whoever equipped with your private key to know what images are generated by you, but not the others.
Part I & II are respectively joint work with {Andre Wibisono, Ruilin Li, Santosh Vempala} and {Guan-Horng Liu, Tianrong Chen, Evangelos Theodorou}.

An empirical study on imbalanced data impact and treatment

Ke Wang (Wells Fargo)

Abstract: Imbalanced data are widespread in many applications. Data imbalance is often viewed as a severe issue in classification tasks. In our study, we focus on two areas of investigation. Firstly, we examine the impact of data imbalance on model performance. Our main finding reveals that the sample size, rather than the imbalance ratio, is the key factor influencing classification performance. Secondly, we compare various methods for handling imbalanced data, including under-sampling, over-sampling and weighting.

Analysis of wearable device data using functional data models

Julia Wrobel (Emory)

Abstract: The ability of individuals to engage in physical activity is a critical component of overall health and quality of life. Establishing normative trends of physical activity is essential to developing public health guidelines and informing clinical perspectives regarding individuals’ levels of physical activity. Beyond overall quantity of physical activity, patterns regarding the timing of activity provide additional insights into latent health status. Wearable accelerometers, paired with statistical methods from functional data analysis, provide the means to estimate diurnal patterns in physical activity. Using methods we developed for separating amplitude and phase variability in exponential family functional data, we uncover the distinct phenotypes, or chronotypes, that give rise to differences in these patterns in physical activity, as well as how daily patterns in physical activity change with age.

Large-Scale Domain-Specific Pretraining for Biomedical Vision-Language Processing

Yanbo Xu (Microsoft)

Abstract: Contrastive pretraining on parallel image-text data has attained great success in vision-language processing (VLP), as exemplified by CLIP and related methods. However, prior explorations tend to focus on general domains in the web. Biomedical images and text are rather different, but publicly available datasets are small and skew toward chest X-ray, thus severely limiting progress. We conducted by far the largest study on biomedical VLP, using 15 million figure-caption pairs extracted from biomedical research articles in PubMed Central. Our dataset (PMC-15M) is two orders of magnitude larger than existing biomedical image-text datasets such as MIMIC-CXR, and spans a diverse range of biomedical images. The standard CLIP method is suboptimal for the biomedical domain. We propose BiomedCLIP with domain-specific adaptations tailored to biomedical VLP. We conducted extensive experiments and ablation studies on standard biomedical imaging tasks from retrieval to classification to visual question-answering (VQA). BiomedCLIP established new state of the art in a wide range of standard datasets, substantially outperformed prior VLP approaches. Surprisingly, BiomedCLIP even outperformed radiology-specific state-of-the-art models such as BioViL on radiology-specific tasks such as RSNA pneumonia detection, thus highlighting the utility in large-scale pretraining across all biomedical image types.

LLMs as Autonomous Agents: Decision-Making through Adaptive Closed-Loop Planning

Chao Zhang (Georgia Institute of Technology)

Abstract: Large Language Models (LLMs) have shown great promise as autonomous agents for sequential decision-making tasks in interactive environments. Existing LLM agents often rely on greedy action-taking or static planning, making their performance degenerate as problem complexity and plan horizons increase. In this talk, I will introduce our recent work AdaPlanner, which allows the LLM agent to refine its self-generated plan adaptively in response to environmental feedback. In AdaPlanner, the LLM agent adaptively refines its plan from feedback with both in-plan and out-of-plan refinement strategies. Furthermore, AdaPlanner features a code generation interface that reduces hallucination; as well as a skill discovery mechanism that uses successful plans as few-shot exemplars to improve sample efficiency. AdaPlanner achieves state-of-the-art performance in environments such as ALFWorld and MiniWoB++ and showcases versatility across tasks, environments, and agent capabilities. I will conclude the talk by sharing the potential extensions of AdaPlanner and the future directions of using LLMs as agents for decision-making tasks.

Novel Empirical Likelihood Inference for the Mean Difference with Right-Censored Data

Yichuan Zhao (Georgia State University)

Abstract: This paper focuses on comparing two means and finding a confidence interval for the difference of two means with right-censored data using the empirical likelihood method combined with the i.i.d. random functions representation. Some early researchers proposed empirical likelihood-based confidence intervals for the mean difference based on right-censored data using the synthetic data approach. However, their empirical log-likelihood ratio statistic has a scaled chi-squared distribution. To avoid the estimation of the scale parameter in constructing confidence intervals, we propose an empirical likelihood method based on the i.i.d. representation of Kaplan–Meier weights involved in the empirical likelihood ratio. We obtain the standard chi-squared distribution. We also apply the adjusted empirical likelihood to improve coverage accuracy for small samples. We investigate a new empirical likelihood method, the mean empirical likelihood, within the framework of our study. Via extensive simulations, the proposed empirical likelihood confidence interval has better coverage accuracy than those from existing methods. Finally, our findings are illustrated with a real data set.

Steering the Attention of Large Language Models

Tuo Zhao (Georgia Institute of Technology)

Abstract: Large language models (LLMs) like ChatGPT and GPT-4 are transforming our daily lives through their ability to generate human-like text. The superior performance of LLMs stems from massive transformer models, self-attention mechanisms, pre-training on massive datasets, and fine-tuning with human feedback.

In this talk, I will first provide an overview of LLMs with a focus on self-attention mechanisms, and then share some recent advance on steering the self-attention of LLMs to better align them with a user’s prompts and instructions. Specifically, through simple and efficient post-hoc attention reweighting, we can highlight critical content specified by the user, enhancing the instruction-following capabilities of LLMs. This allows us to exert more control over how LLMs allocate attention, leading to more controllable and beneficial LLMs.