Abstracts – Georgia Statistics Day 2019

Transforming E-Commerce Recommender Systems with Data Science

Speaker: Khalifeh Al Jadda (Home Depot)

Abstract:

Data science as an emerging field has impacted almost every aspect of e-commerce. In this talk we will look into how data science transformed recommender systems of e-commerce and what is the impact of that on the customers. We will discuss use cases from the recommender system of The Home Depot which has been going through the journey of transforming its recommender system to an AI-based engine.

Biography:

Khalifeh AlJadda holds Ph.D. in computer science from the University of Georgia (UGA), with a specialization in machine learning. He has experience implementing large-scale, distributed machine learning algorithms to solve challenging problems in domains ranging from Bioinformatics to search and recommendation engines. He is a Sr. Manager Data Science at Home Depot, which is the largest home improvement company in the world. He is currently leading the recommendation data science team at Home Depot which is in charge of building the new generation of Home Depot’s recommendation engine. Before joining Home Depot, he was the lead data scientist at CareerBuilder where he built with his team a semantic search engine using novel NLP and machine learning approaches. After building and deploying the semantic search engine, Khalifeh led the data science team to build and deploy a hybrid recommendation engine which is serving millions of job seekers who rely on CareerBuilder’s website to find jobs. Khalifeh is the founder and organizer of the Southern Data Science Conference (https://www.southerndatascience.com) which is the major data science conference in Atlanta that aims to promote data science in the Southeast region. He also co-founded a non-profit organization ATLytiCS (www.atlytics.org).”

Collaborative Inference for Causal Effect Estimation and General Missing Data Problems

Speaker: David Benkeser (Emory University)

Abstract:

Doubly robust estimators are a popular means of estimating causal effects. Such estimators combine an estimate of the conditional mean of the study outcome given treatment and confounders (the so-called outcome regression) with an estimate of the conditional probability of treatment given confounders (the propensity score) to generate an estimate of the effect of interest. This estimate is consistent so long as at least one of these two regressions is consistently estimated. It turns out that doubly robust estimators are often statistically efficient, achieving the lower bound on the variance of regular, asymptotically linear estimators. However, in spite of their asymptotic optimality, in problems where estimands are weakly identified, doubly robust estimators may behave erratically. We propose a new framework for inference in these challenging settings.

Biography:

Machine learning promises to identify patterns and correlations with unprecedented accuracy and speed; however, actionable health research generally requires a higher standard of proof than can be provided by a correlative analysis. In order to enact health policies that are safe and effective for patients, we must be able to confidently state that an observed association is in fact indicative of a causal relationship. My research focuses on understanding whether and how machine learning methodology can be used to draw such causal inferences. My methodology has been applied in the analysis of preventive vaccines to better understand causal mechanisms of protection, in studies of HIV prevention using social media-based mobile phone applications, and other emerging areas of public health need. I received my PhD in Biostatistics from the University of Washington, completed a post-doctoral fellowship at University of California Berkeley, and am currently an Assistant Professor in the Department of Biostatistics and Bioinformatics at Emory University.

Practice makes Perfect

Speaker: William Brenneman (Procter and Gamble Company)

Abstract:

Does practice make perfect or perfect make practice? There has always been a connection between academics (perfect) and industry (practice) in that most industrial statisticians go through formal statistical training at the undergraduate or graduate level prior to practicing statistics. In the statistics field, there is a strong symbiotic relationship between academics and industry. I will discuss that relationship primarily from the perspective of a practicing statistician in industry through lessons learned on how to foster, grow and bridge the gap between statistical practice and academic research.

Biography:

William Brenneman is a Research Fellow and Global Statistics Discipline Leader at Procter & Gamble in the Data and Modeling Sciences Department and an Adjunct Professor of Practice at Georgia Tech in the Stewart School of Industrial and Systems Engineering. Since joining P&G, he has worked on a wide range of projects that deal with statistics applications in his areas of expertise: design and analysis of experiments, robust parameter design, reliability engineering, statistical process control, computer experiments, machine learning and general statistical thinking. He was also instrumental in the development of an in-house statistics curriculum. He received a Ph.D. degree in Statistics from the University of Michigan, an MS in Mathematics from the University of Iowa and a BA in Mathematics and Secondary Education from Tabor College. William is a Fellow of the American Statistical Association (ASA), a Fellow of the American Society for Quality (ASQ), and a member of the Institute of Mathematical Statistics and the Institute for Operations Research and Management Sciences. He has served as ASQ Statistics Division Chair, ASA Quality and Productivity Section Chair and is currently serving as an Associate Editor for Technometrics. William also has seven years of experience as an educator at the high school and college level.

Two-Sample High Dimensional Mean Test Using Prepivot

Speaker: Santu Ghosh (Augusta University)

Abstract:

Due to the advancement of technologies, many fields such as genomic, astrometry, and finance often encounter the analysis of massive data sets to extract useful information for discoveries. Such a high dimension, low sample size data present a substantial challenge, known as the “large p small n” problem, to the Statistics community. In many cases, researchers are interested in making inferences on the mean structures of a population. However, revered Hotelling’s T2cannot be used to make inferences on the mean structures for large p small n data. Several two-sample tests for equality of means have already been suggested when the combined sample size of both populations exceeds the dimension of the data. Some test methods show difficulty to maintain type-I error while the other methods are less powerful. We propose a test using both prepivoting and Edgeworth expansion that maintains type-I error and gives high power in this higher dimensional scenario. Our test’s finite sample performance is compared with other recently proposed methods. The usefulness of the proposed method is further illustrated through a gene expression data example.

Biography:

Santu Ghosh is an assistant professor in the division of biostatistics and data science in the department of population health sciences at Augusta University, since 2015. He obtained his Ph.D. degree in Mathematical Sciences from Northern Illinois University in 2013, and after that, he was a post-doctoral fellow in the School of Medicine at Wayne State University for two years. His research focuses on the analysis of large-scale simultaneous inference, post-selection inference, non-inferiority clinical trials, and methylation data analysis.

Interface of Statistics, Computing, and Data Science

Speaker: Xiaoming Huo (Georgia Tech)

Abstract:

Inference (aka predictive modeling) is in the core of many data science problems. Traditional approaches could be either statistically or computationally efficient, however not necessarily both. The existing principles in deriving these models – such as the maximal likelihood estimation principle – may have been developed decades ago, and do not take into account the new aspects of the data, such as their large volume, variety, velocity and veracity. On the other hand, many existing empirical algorithms are doing extremely well in a wide spectrum of applications, such as the deep learning framework; however they do not have the theoretical guarantee like these classical methods. We aim to develop new algorithms that are both computationally efficient and statistically optimal. Such a work is fundamental in nature, however will have significant impacts in all data science problems that one may encounter in the society. Following the aforementioned spirit, I will describe a set of my past and current projects including L1-based relaxation, fast nonlinear correlation, optimality of detectability, and nonconvex regularization. All of them integrates statistical and computational considerations to develop data analysis tools.

Biography:

Xiaoming Huo is an A. Russell Chandler III Professor at the Stewart School of Industrial & Systems Engineering at Georgia Tech. Huo received a Ph.D. degree in statistics from Stanford University, Stanford, CA, in 1999. In August 1999, he joined the School of Industrial and Systems Engineering, Georgia Institute of Technology, Atlanta, GA, USA, where he advanced through the ranks and became a chair professor. From 2013 to 2015, he was a program director at the National Science Foundation, managing data science related programs. He is the director of the Transdisciplinary Research Institute for Advancing Data Science at Georgia Institute of Technology (https://triad.gatech.edu/). He is a fellow of the American Statistical Association and a senior member of the IEEE.

Huo’s research interests include statistics and data science. He has made numerous contributions on topics such as sparse representation, wavelets, and statistical detectability. His papers appeared in top journals, and some of them are highly cited.

Huo won the Georgia Tech’s Sigma Xi Young Faculty Award in 2005. His work has led to an interview by Emerging Research Fronts in June 2006 in the field of Mathematics — every two months, one paper is selected. He participated in the 30th International Mathematical Olympiad (IMO), which was held in Braunschweig, Germany, in 1989, and received a golden prize.

New Model Diagnostics for Epidemiological and Ecological Models

Speaker: Max Lau (Emory University)

Abstract:

A cardinal challenge in epidemiological and ecological modelling is to develop effective and easily deployed tools for model assessment. The availability of such methods would greatly improve understanding, prediction and management of disease and ecosystems. Conventional model assessment tools have some key limitations. I will present a novel approach for diagnosing mis-specifications of a general spatio-temporal transmission model. Specifically, by proposing suitably designed non-centred parametrization schemes, we construct latent residuals whose sampling properties are known given the model specification and which can be used to measure overall fit and to elicit evidence of the nature of misspecifications of spatial and temporal processes included in the model. I will illustrate the approach by presenting a specific parametrization scheme that aims to detect misspecifications of the evolutionary component in a joint epidemiological-evolutionary model.

Biography:

Max Lau is an assistant professor in the Department of Biostatistics and Bioinformatics, RSPH, Emory. He is broadly interested in developing statistical methodology for inferring spatial and temporal transmission dynamics of infectious diseases. Recent research focuses on integrative approaches for cross-scale problems (e.g., integrated analysis of epidemiological and ecological models).

An Age-Dependent Birth-Death Model for Gene Family Evolution

Speaker: Liang Liu (University of Georgia)

Abstract:

In this study, we describe a generalized birth-death process for modeling the evolution of gene families. Use of mechanistic models in a phylogenetic framework requires an age-dependent birth-death process. Starting with a single population corresponding to the lineage of a phylogenetic tree and with an assumption of a clock that starts ticking for each duplicate at its birth, an age-dependent birth-death process is developed by extending the results from the time-dependent birth-death process. The implementation of such models in a full phylogenetic framework is expected to enable large scale probabilistic analysis of duplicates in comparative genomic studies.

Biography:

Liang Liu is currently an associate professor in the Department of Statistics at UGA. He obtained his PhD degree at the Ohio State University in 2006, and after graduation he joined the Edwards lab as a postdoc at Harvard. His research interests include statistical phylogenetics, coalescent theory, and modeling biological data. Evolutionary processes are fundamental process for understanding inheritable changes over time. Many non-biology processes, such as history of languages, can be viewed as an evolutionary process. The evolutionary process is often modeled as a tree-like branching process – a phylogenetic tree. He has developed several statistical tools for estimating phylogenetic trees from mutilocus sequence data.

Spectral Graph Matching and Regularized Quadratic Relaxations

Speaker: Cheng Mao (Georgia Tech)

Abstract:

Given two unlabeled, edge-correlated graphs on the same set of vertices, we study the “graph matching” or “network alignment” problem of matching the vertices of the two graphs. We propose a new spectral method for this problem, which obtains the matching from a similarity matrix as a sum of outer products of eigenvectors weighted by a Cauchy kernel applied to differences of eigenvalues. The similarity matrix can also be interpreted as the solution to a regularized quadratic programming relaxation of the quadratic assignment problem. We show that for a correlated Erdős–Rényi model, this method returns the exact matching with high probability if the graphs differ by at most a 1/polylog(n) fraction of edges.

Biography:

Cheng Mao is a postdoctoral researcher in the Department of Statistics and Data Science at Yale University. His research interests include high-dimensional statistics, nonparametric statistics, and statistical inference on networks. His recent work focuses on permutation estimation problems such as ranking and graph matching. Cheng obtained his Ph.D. degree in Mathematics and Statistics from MIT. He will join the School of Mathematics at Georgia Tech in January 2020 as an assistant professor.

Stratified Micro-randomized Trials with Applications in Mobile Health

Speaker: Susan Murphy (Harvard University)

Abstract:

Technological advancements in the field of mobile devices and wearable sensors make it possible to deliver treatments anytime and anywhere to users like you and me. Increasingly the delivery of these treatments is triggered by detections/predictions of vulnerability and receptivity. These observations are likely to have been impacted by prior treatments. Furthermore the treatments are often designed to have an impact on users over a span of time during which subsequent treatments may be provided. Here we discuss our work on the design of a mobile health smoking cessation study in which the above two challenges arose. This work involves the use of multiple online data analysis algorithms. Online algorithms are used in the detection, for example, of physiological stress. Other algorithms are used to forecast at each vulnerable time, the remaining number of vulnerable times in the day. These algorithms are then inputs into a randomization algorithm that ensures that each user is randomized to each treatment an appropriate number of times per day. We develop the stratified micro-randomized trial which involves not only the randomization algorithm but a precise statement of the meaning of the treatment effects and the primary scientific hypotheses along with primary analyses and sample size calculations. Considerations of causal inference and potential causal bias incurred by inappropriate data analyses play a large role throughout.

Biography:

Susan Murphy is Professor of Statistics at Harvard University, Professor of Computer Science at the Harvard John A. Paulson School of Engineering and Applied Sciences and Radcliffe Alumnae Professor at the Radcliffe Institute, Harvard University.Dr. Murphy’s research concerns the development of experimental designs and online algorithms for informing sequential decision making with applications in mobile health. She is a 2013 MacArthur Fellow, a member of the National Academy of Sciences and the National Academy of Medicine, both of the US National Academies. She is past editor of the Annals of Statistics, past president of the Bernoulli Society and current president of
the Institute for Mathematical Statistics.

Quantile Regression Modeling of Latent Trajectory Features with Longitudinal Data

Speaker: Limin Peng (Emory University)

Abstract:

Quantile regression has demonstrated promising utility in longitudinal data analysis. Existing work is primarily focused on modeling cross-sectional outcomes, while outcome trajectories often carry more substantive information in practice. In this work, we develop a trajectory quantile regression framework that is designed to robustly and flexibly investigate how latent individual trajectory features are related to observed subject characteristics. The proposed models are built under multilevel modeling with usual parametric assumptions lifted or relaxed. We derive our estimation procedure by novelly transforming the problem at hand to quantile regression with perturbed responses and adapting the bias correction technique for handling covariate measurement errors. We establish desirable asymptotic properties of the proposed estimator, including uniform consistency and weak convergence. Extensive simulation studies confirm the validity of the proposed method as well as its robustness. An application to the DURABLE trial uncovers sensible scientific findings and illustrates the practical value of our proposals.

Biography:

Dr. Limin Peng is Professor in the Department of Biostatistics and Bioinformatics at the Rollins School of Public Health, Emory University. Dr. Peng joined Emory Biostatistics faculty in 2005 after receiving her PhD in Statistics from the University of Wisconsin-Madison. Dr. Peng’s research interests include statistical method developments in the areas of survival analysis, quantile regression, high-dimensional inference, and nonparametric and semiparametric statistics. In addition, Dr. Peng has conducted applied collaborations with investigators in a variety of scientific context including cancer, cystic fibrosis, diabetes, and neurological disorder, and has published extensively in the biomedical literature. Dr. Peng has also deeply devoted to professional services, including serving on journal editorial boards and review panels for the National Institutes of Health (NIH) and other funding agencies. Dr. Peng was named American Statistical Association Fellow in 2016 and was the recipient of APHA Spiegelman Award in 2017.

Predicting AC Optimal Power Flows: Combining Deep Learning and Lagrangian Dual Methods

Speaker: Pascal VanHentenryck (Georgia Tech)

Abstract:

The Optimal Power Flow (OPF) problem is a fundamental building block for the optimization of electrical power systems. It is nonlinear and nonconvex and computes the generator setpoints for power and voltage, given a set of load demands. It is often needed to be solved repeatedly under various conditions, either in real-time or in large-scale studies. This need is further exacerbated by the increasing stochasticity of power systems due to renewable energy sources in front and behind the meter. To address these challenges, this paper presents a deep learning approach to the OPF. The learning model exploits the information available in the prior states of the system (which is commonly available in practical applications), as well as a dual Lagrangian method to satisfy the physical and engineering constraints present in the OPF. The proposed model is evaluated on a large collection of realistic power systems. The experimental results show that its predictions are highly accurate with average errors as low as 0.2%. Additionally, the proposed approach is shown to improve the accuracy of widely adopted OPF linear DC approximation by at least two orders of magnitude.

Biography:

Pascal Van Hentenryck is the A. Russell Chandler III Chair and Professor in the H. Milton Stewart School of Industrial and Systems Engineering at the Georgia Institute of Technology. Prior to this
appointment, Van Hentenryck was a Professor of Computer Science at Brown University for 20 years, the leader of the Optimization Research Group at National ICT Australia (about 70 people), and the Seth Bonder Collegiate Professor at the University of Michigan. Van Hentenryck is an INFORMS Fellow and a Fellow of the Association for the Advancement of Artificial Intelligence (AAAI). Several of his optimization systems, including OPL and CHIP, have been in commercial use for more than 20 years. Van Hentenryck’s current research is focusing on artificial intelligence and operations research with applications in energy systems, transportation, resilience, and privacy.

Online Decentralized Leverage Score Sampling for Streaming Multidimensional Time Series

Speaker: Rui Xie (University of Central Florida)

Abstract:

Estimating the dependence structure of multidimensional time series data in real-time is challenging. With large volumes of streaming data, the problem becomes more difficult when the multidimensional data are collected asynchronously across distributed nodes, which motivates us to sample representative data points from streams. We propose a leverage score sampling (LSS) method for efficient online inference of the streaming vector autoregressive (VAR) model. We define the leverage score for the streaming VAR model so that the LSS method selects informative data points in real-time with statistical guarantees of parameter estimation efficiency. Moreover, our LSS method can be directly deployed in an asynchronous decentralized environment, e.g., a sensor network without a fusion center, and produce asynchronous consensus online parameter estimation over time. By exploiting the temporal dependence structure of the VAR model, the LSS method selects samples independently on each dimension and thus is able to update the estimation asynchronously. We illustrate the effectiveness of the LSS method in synthetic, gas sensor and seismic datasets.

Biography:

Rui Xie received the Ph.D. degree in statistics from the University of Georgia in 2019 and the master degree in statistics from Georgia Institute of Technology in 2013.

He is currently an Assistant Professor in the Department of Statistics and Data Science at the University of Central Florida. His research interests include the development of statistical sketching and sampling methods for large-scale streaming dependent data, and the applications in different fields ranging from streaming online learning and sampling, spatial pattern reconstruction with sketching, to decentralized computing.

Robust Hypothesis Testing Using Wasserstein Uncertainty Sets

Speaker: Yao Xie (Georgia Tech)

Abstract:

We develop a novel computationally efficient and general framework for robust hypothesis testing. The new framework features a new way to construct uncertainty sets under the null and the alternative distributions, which are sets centered around the empirical distribution defined via the Wasserstein metric, thus our approach is data-driven and free of distributional assumptions. We develop a convex safe approximation of the minimax formulation and show that such approximation renders a nearly-optimal detector among the family of all possible tests. By exploiting the structure of the least favorable distribution, we also develop a tractable reformulation of such approximation, with complexity independent of the dimension of observation space and can be nearly sample-size-independent in general. Real-data example using human activity data demonstrated the excellent performance of the new robust detector.

Biography:

Yao Xie is an Associate Professor and Harold R. and Mary Anne Nash Early Career Professor in the H. Milton Stewart School of Industrial and Systems Engineering, Georgia Institute of Technology. She received her Ph.D. in Electrical Engineering (minor in Mathematics) from Stanford University in 2011. Before joining Georgia Tech in 2013, she worked as a Research Scientist at Duke University. Her research interests include statistics, signal processing, and machine learning. She received the National Science Foundation (NSF) CAREER Award in 2017, and multiple best paper awards at ICASSP, Allerton, and INFORMS conferences. She serves as the Associate Editor for IEEE Transactions on Signal Processing.

Explainable Neural Network for Mortgage Analytics

Speaker: Huan Yan (Wells Fargo)

Abstract:

Neural network techniques have become popular in recent years due to its superior predictive performance and flexibility in model fitting. However, most of the neural networks remain black-box models, where the inner decision-making processes cannot be easily understood. Without sufficient explainability, their applications in specialized domain areas such as finance can be largely limited. Recent research papers propose to enhance the explainability of neural networks through several architecture constraints, including orthogonal projection and additive decomposition. This work aims to further pursue the orthogonal and additive architecture constraints to build an explainable neural network and apply it to the fixed rate mortgage loan data.

Biography:

Huan Yan obtained his PhD degree from ISYE of Georgia Institute of Technology in 2014 and joined Wells Fargo Bank in the same year. He has been working in different lines of business in Wells Fargo for more than 5 years. Now he is a VP in the department of Corporate Model Risk and manages mortgage analytics development team to build and maintain an end-to-end platform for the valuation of mortgage servicing right (MSR) and different mortgage backed securities (MBS).

Optimal Nonparametric Regression on Low Dimensional Manifolds using Deep Neural Networks

Speaker: Tuo Zhao (Georgia Tech)

Abstract:

Existing statistical theory has established the information-theoretic lower bounds for nonparametric regression problems. To attain estimation consistency, the sample size n has to scale at least exponentially with the dimension d. The empirical performance of deep learning, however, is actually quite opposite: Although the dimension of the input data is large, e.g. high resolution images with hundreds of thousands of pixels, deep neural networks only need a relatively small sample size to achieve high prediction accuracy.

To bridge such a gap, we propose to investigate the statistical properties of deep neural networks for nonparametric regression on low dimensional manifolds. Specifically, we assume that the input data lies near a compact smooth q-dimensional manifold embedded in the d-dimensional Euclidean space, where q≪d. Such an assumption is motivated by a practical reason that in real-world applications such as computer vision and speech recognition, where deep learning demonstrates superior performance, the data (e.g., images and acoustic signals) often exhibit low dimensional structures. To estimate the target regression function f* with the s-th order derivative bounded, we adopt a deep ReLU neural network with O(slogn/(2s+q)) layers and O(nq/(2s+q)) neurons and weight parameters, where n is the sample size. We further show that by minimizing the sum of square errors over the data, the deep neural network attains the prediction error bound O(n-2s/(2s+q)).

We highlight that such a prediction error bound only depends on the intrinsic dimension q, which is a significant improvement over existing result of deep neural networks for nonparametric regression O(n-2s/(2s+d)). To the best of our knowledge, this is the first theoretical result on justifying that when learning from data with low dimensional structures, deep neural networks have an advantage of avoiding the curse of the ambient dimensionality over other methods, such as smoothing spline or basis spline.

The talk is based on joint work with Minshuo Chen, Haoming Jiang and Wenjing Liao.

Biography:

Tuo Zhao is an assistant professor in School of Industrial and Systems Engineering and School of Computational Science and Engineering at Georgia Tech. He received his Ph.D. degree in Computer Science at Johns Hopkins University. His research focuses on developing principled methodologies and nonconvex optimization algorithms for machine learning (especially deep learning), as well as open source software development for scientific computing.

New Non-Asymptotic Results about Accuracy of Bootstrap in a High-Dimensional Setting

Speaker: Mayya Zhilova (Georgia Tech)

Abstract:

In this talk, I would like to discuss the problem of establishing higher order accuracy of bootstrapping procedures and (non-)normal approximation in a multivariate central limit problem with independent summands and in several various settings. The established approximation bounds are non-asymptotic and optimal in terms of ratios between the dimension and the sample size (under the imposed assumptions on moments). I will present statistical applications of the proposed finite sample bounds and discuss their connections with the asymptotic Edgeworth expansion.

Biography:

Mayya Zhilova is an Assistant Professor at the School of Mathematics at Georgia Tech. Her primary research interests lie in the areas of Mathematical Statistics and Applied Probability Theory, she specializes on statistical inference for complex high-dimensional data. Dr. Zhilova received her PhD in Mathematical Statistics from Humboldt University of Berlin in 2015. She did her undergraduate studies in Mathematics and Statistics at Lomonosov Moscow State University.

A Decentralized Data Fusion Approach for Heterogeneous Scattered Data

Speaker: Wenxuan Zhong (University of Georgia)

Abstract:

We study the problem of data fusion in scattered data , i.e., the data collected and stored in local data centers. Such a problem is known to be challenging owing to two distinguishing features of scattered data: (1) each data center can only communicate with neighboring data centers; (2) data have heterogeneous distributions across all local centers. Most of the existing methods for scattered data do not take the heterogeneity of data into account. In addition, the performances of these methods highly rely on the assumption that the models across all data centers are identical. Empirical studies demonstrate that these methods have unsatisfactory performance when such an assumption is invalid and/or the data heterogeneity exists. In this talk, I present a general statistical model, which accommodates across-center heterogeneity through center specific models and integrates the center-specific models together through common parameters.

Biography:

Dr. Zhong is a professor of statistics in UGA. She directs the Big Data Analytics Lab. Dr. Zhong’s research focuses on the statistical methodology and theory development to face the striking new phenomena emerged under the big data regime. Over the past few years, Dr. Zhong has established diverse extramurally funded research programs to overcome the computational and theoretical challenges arise from the big data analysis. The basic statistical researches are successfully applied in modern genomic, epigenetic, metagenomics, text-mining, chemical sensing and imaging researches.