# Machine Learning and Artificial Intelligence Cluster at INFORMS 2020 Annual Meeting

Recently Georgia Tech ISyE faculty and ML@GT Associate Directors, George Lan and Yao Xie, were invited to organize a “committee’s choice” Cluster on Machine Learning and Artificial Intelligence for the first time at the 2020 INFORMS Annual Meeting.

Significant overlap exists between the emerging field of machine learning (ML), artificial intelligence (AI), and operations research (OR.) Both fields share the common cores in optimization, statistics, and applied probability. Many pioneers from the INFORMS community worked on ML/AI decades ago’ foundation and applications. Meanwhile, the last few years have seen an increasing interest in applying ML/AI to traditional OR fields, including supply chain, healthcare, finance, manufacturing, energy, resiliency, and sustainability. The new INFORMS Cluster aims to highlight and enhance such connections.

Below are all of the talks at INFORMS that discuss machine learning. Bold and starred talks are from Georgia Tech faculty or students.

#### Bandit & Reinforcement Learning Practices in Academia & Industry

Session Chair: Sareh Nabi

• Real World Reinforcement Learning
• By John Langford | Microsoft
• Much of the reinforcement learning in the popular press has been about agents playing games, but there is a somewhat more quiet revolution underway around deploying reinforcement learning agents to solve real-world problems in logistics, personalization, and control. I will discuss one of these that I’ve worked on (http://aka.ms/personalizer), the theory and methods behind it, the difficulties with more general forms of reinforcement learning, and some new research around discovering causal structure to get around it.
• Seeker: Real-Time Interactive Search
• By Houssam Nassif, Ari Biswas, Thai Pham, Michael Vogelsong, and Benjamin Snyder | Amazon
• This paper introduces Seeker, a system that allows users to adaptively refine search rankings in real-time, through a series of feedbacks in the form of likes and dislikes. When searching online, users may not know how to accurately describe their product of choice in words. But often users have a mental picture of the desired item and are able to answer ordinal questions of the form: “Is this item similar to what you have in mind?” With this assumption, our algorithm allows users to provide sequential feedback on search results to adapt the search feed. We show that our proposed approach works well both qualitatively and quantitatively. Unlike most previous representation-based search systems, we can quantify the quality of our algorithm by evaluating humans-in-the-loop experiments.
• Empirical Bayes Priors: Field Experiment With Theoretical Guarantees
• By Sareh Nabi
• In this work, we propose a general framework to estimate meta-prior distributions by imposing a Bayesian hierarchical model and applying empirical Bayes. Our prior estimation framework allows for incorporating any method that can provide estimates of feature effects. We provide empirical and theoretical results for our prior estimator, when the underline model aiming at estimating the feature effects is a Generalized Linear Bandit. Our experiments are performed in an Amazon production system and on a public data set. Both show marked improvements compare to using a non-informative prior scenario. Furthermore, we report theoretical properties of our empirical Bayes prior estimator.

#### Uncertainty Representations in Bandits and Reinforcement Learning

Session Chair: Alec Koppel

• Efficient Gaussian Process Bandits By Believing Only Informative Actions
• By Alec Koppel | Army Research Laboratory | Computational and Information Sciences Directorate | Purdue University
• We focus on multi-armed bandit problems, where the payoff function is sampled from a Gaussian process (GP), and action selections take the form of upper confidence bound (UCB) or expected improvement (EI). Prior works cannot allow the iteration horizon to be large, as the GP posterior scales cubically with the horizon. To circumvent this issue, we propose only incorporating an action into the posterior when it is sufficiently informative. We derive sublinear regret bounds of GP bandit algorithms for both discrete and continuous action sets, and that the GP’s posterior complexity remains finite. Experimentally, we observe state of the art tradeoffs in global optimization.
• Presenter by Tor Lattimore
• Restless Bandits: Indexability And Computation Of Whittle Index
• Restless bandits are a class of resource allocation problems where a scheduler has to select a subset of available arms at each time. Each arm is a controlled Makrov chain whose evolution depends on whether the arm is selected by the scheduler. The optimal solution of such problems is PSPACE hard. A commonly used low-complexity solution is Whittle index, which is applicable when the model satisfies a technical condition called indexability. We consider restless bandits with perfectly or partially observed states, provide multiple sufficient conditions to verify indexability, and present efficient algorithms to compute Whittle index. Numerical experiments show that Whittle index policy is close-to-optimal for small-scale systems and significantly better than myopic policies for large-scale systems
• Presenter by Ambuj Tewai
• Presenter by Andrew Wilson
• Sample Efficient Reinforcement Learning Via Representation Learning
• By Akshay Krishnamurthy | Microsoft Research
• I will discuss new provably efficient algorithms for reinforcement in rich observation settings where the state space may be arbitrarily large. Both algorithms operate by learning succinct representations of the environment, which they use in an exploration module to acquire new information. The first algorithm, called HOMER, operates in a block MDP model and uses a contrastive learning objective for representation learning. On the other hand, the second algorithm, called FLAMBE, operates in a much richer class of low-rank MDPs and is more model-based. Both algorithms can accommodate non-linear function approximation and enjoy the provable sample and computational efficiency guarantees.
• On Estimating Epistemic Uncertainty In Deep Learning
• By Yingzhen Li | Microsoft Research
• As we start to deploy deep learning models in high-risk decision making, we naturally question the confidence of the model on predicting important quantities. Bayesian deep learning (BDL) is an emerging research field which shows great promise in this regard; progress has been made on scaling up approximate Bayesian inference techniques to giant deep neural networks. This talk will discuss the quality of epistemic uncertainty estimates using BDL methods. Specifically, we identified an issue of variational mean-field approximations for ReLU networks: it struggles to represent increased uncertainty in predicting outcomes for inputs in between the training data clusters. This limitation is shown theoretically for shallow networks and empirically for deep networks, especially for active learning and Bayesian optimization applications.

#### Cooperative Multi-Agent Deep Reinforcement Learning

Session Chairs: Afshin Oroojlooy and Davood Hajinezhad

• Deal With Non-stationarity In Multi Agent Reinforcement Learning: A Review Of Fully Observable Critic Models
• By Afshin Oroojlooy | SAS Institute
• Non-stationarity of the environment is the main issue in multi-agent reinforcement learning problems. One of the common approaches to address this issue is using a fully observable critic. The fully observable critic involves the observation and actions of all agents and as a result the environment is stationary even though the policy of other agents changes. However, there have been many challenges for using this approach in practice. In this talk we present some of these challenges and cover recent works to address those.
• Consensus Optimization In Multi-agent Reinforcement Learning: A Review
• By Davood Hajinezhad | SAS Institute
• We consider a setting in which a number of identical agents receive partial pose-dependent observations of an environment. The agents are allowed to update their local viewpoints from the environment through actions. Moreover, they are able to send and receive messages from their neighbors on a directed communication graph. Based on these settings, in this work, we develop a framework that lets us solve the multi-agent image classification task using the reinforcement learning in a decentralized manner. This enables each agent to process the observations, communicate with its neighbors, decide her set of actions, and finally conduct the predictions using the information available to her. Our experiments on MNIST handwritten digit dataset demonstrates the effectiveness of the proposed framework.
• Distributed Image Classification Via Reinforcement Learning
• By Reza Nazari | SAS Institute
• We consider a setting in which a number of identical agents receive partial pose-dependent observations of an environment. The agents are allowed to update their local viewpoints from the environment through actions. Moreover, they are able to send and receive messages from their neighbors on a directed communication graph. Based on these settings, in this work, we develop a framework that lets us solve the multi-agent image classification task using the reinforcement learning in a decentralized manner. This enables each agent to process the observations, communicate with its neighbors, decide her set of actions, and finally conduct the predictions using the information available to her. Our experiments on MNIST handwritten digit dataset demonstrates the effectiveness of the proposed framework.
• Distributed Off-policy Actor-critic Reinforcement Learning With Policy Consensus
• By Yan Zhang | Duke University
• In this paper, we propose a distributed off-policy actor critic method to solve multi-agent reinforcement learning problems. Specifically, we assume that all agents keep local estimates of the global optimal policy parameter and update their local value function estimates independently. Then, we introduce an additional consensus step to let all the agents asymptotically achieve agreement on the global optimal policy function. The convergence analysis of the proposed algorithm is provided and the effectiveness of the proposed algorithm is validated using a distributed resource allocation example. Compared to relevant distributed actor critic methods, here the agents do not share information about their local tasks, but instead they coordinate to estimate the global policy function.

#### Distributionally robust optimization and machine learning

Session Chair: Rui Gao

• *Data-driven Distributional Robust Hypothesis Testing With Applications In Adversarial Learning*
• By Liyan Xie | Georgia Institute of Technology
• We develop a general framework for data-driven robust classifier, which is motivated by applications with limited samples arising from applications such as healthcare and anomaly detection. We formulate the problem as multiple hypothesis tests and construct the minimax test that is robust to the worst-case distributions within Wasserstein uncertainty sets centered around the empirical distributions of training samples. To address the computational challenge, we develop a tractable reformulation which also leads to a structured solution. Generalization properties are studied theoretically by characterizing the size of the Wasserstein uncertainty sets asymptotically. We also shed light on how the proposed minimax test can be incorporated into adversarial learning and generative models. Synthetic and real data show the good performance of our approach.
• Distributionally Robust Policy Evaluation And Learning In Offline Contextual Bandits
• By Nian Si | Stanford University
• Policy learning using historical observational data is an important problem that has found widespread applications. Examples include selecting offers, prices and advertisements. However, existing literature rests on the crucial assumption that the future environment where the learned policy will be deployed is the same as the past environment that has generated the data–an assumption that is often false or too coarse an approximation. In this paper, we lift this assumption and aim to learn a distributionally robust policy with incomplete (bandit) observational data. We propose a novel learning algorithm that is able to learn a robust policy to adversarial perturbations and unknown covariate shifts. We first present a policy evaluation procedure in the ambiguous environment and then give a performance guarantee based on the theory of uniform convergence.
• The Dao Of Robustness
• By Minglong Zhou | University of Singapore
• We propose a framework for optimization under uncertainty called robustness optimization, which is similar in purpose to, but philosophically different from, robust optimization. Unlike robust optimization approaches, we do not restrict nature to an uncertainty set but allow her to take its cause and even render solutions infeasible. We axiomatize the decision criterion associated with the robustness optimization, termed as the adversarial impact measure, which relates to the maximum level of model infeasibility that may occur relative to the magnitude of deviation from the baseline uncertainty. We also provide a representation theorem of the decision criterion and uncover different types of adversarial impact measures. Similar to robust optimization, we show that robustness optimization via minimizing the adversarial impact can also be done in a tractable way.
• Data-driven Optimistic Optimization And Bandit Learning
• By Rui Gao | University of Texas at Austin
• Optimistic optimization is the optimistic counterpart of robust optimization. We establish a connection between data-driven optimistic optimization and UCB-type algorithm in online bandit learning. We show that data-driven optimistic optimization recovers the classical UCB algorithm for iid bandit and linear bandit, but also inspire new algorithms with better performance for other bandit problems.

#### First-order methods for Machine Learning

Session Chair: Digvijay Boob

• *Stochastic First-order Methods For Nonconvex Function Constrained Optimization*
• By Digvijay Boob | Georgia Tech
• We consider the nonconvex function constrained optimization problem and introduce a new proximal point method which transforms the initial nonconvex problem into a sequence of convex function constrained subproblems. We establish the convergence and rate of convergence of this algorithm to KKT points under different constraint qualifications. For practical use, we present inexact variants of this algorithm, in which approximate solutions of the subproblems are computed using the aforementioned ConEx method and establish their associated rate of convergence. To the best of our knowledge, most of these convergence and complexity results of the proximal point method for nonconvex problems also seem to be new in the literature.
• *A Simple Algorithm For Variational Inequalities With Application To Policy Evaluation For Reinforcement Learning*
• By Georgios Kotsalis | Georgia Tech
• In this work we develop a new method for solving variational inequalities with Lipschitz continuous operator. We address both the monotone as well as the strongly monotone case. The resulting numerical scheme, is demonstrated by means of examples from the reinforcement learning literature.
• A Doubly-randomized Block-coordinate Primal-dual Method For Large-scale Saddle Point Problems
• By Erfan Yazdandoost | The Pennsylvania State University
• We consider a large-scale convex-concave saddle point problem in a finite-sum form that arises in machine learning problems involving empirical risk minimization. To contend with the challenges in computing full gradients, we employ a block-coordinate primal-dual scheme. Moreover, when the number of component functions in the finite sum is massive, we can derive further efficiency by utilizing an increasing batch-size of gradients. We proved near-optimal convergence rates for the proposed method using both variance reduction and single sampling gradient approaches. Finally, for both single and increasing sample size scenarios, almost sure convergence of the iterate sequence to a saddle point is shown.
• *Stochastic Sequential Dual Algorithm: An Optimal Method for Minimizing Convex Stochastic Multilevel Composition Functions*
• By Zhe Zhang and Guanghui Lan | Georgia Tech
• Consider minimizing $F(x):=f_k f_{k-1} … f_1(x)$ using only unbiased stochastic oracles for the level functions, $f_i$. Unbiased estimators for F’s subgradient are not available because the exact values of the arguments to outer level functions cannot be obtained. In contrast to previous approaches that focused on estimating these inner arguments, we propose a saddle point reformulation that removes the need to estimate those inner arguments and a simple stochastic sequential dual (SSD) algorithm. The SSD algorithm can achieve the optimal iteration complexities for both strongly convex and convex non-smooth problems.
• Analysis Of Q-learning: Switching System Approach
• By Donghwan Lee | Korea Advanced Institute of Science and Technology
• In this paper, we introduce a new asymptotic convergence analysis of the asynchronous Q-learning based on switching system perspectives. We show that the nonlinear ODE model associated with the asynchronous Q-learning can be formulated as a switching linear system. We further analyze its asymptotic stability by leveraging particular structure of Q-learning and existing switching system theories in the control literature. Our approach provides a simple ODE analysis for the convergence of asynchronous Q-learning under relatively weak assumptions. Compared to the previous analysis on asynchronous Q-learning, the proposed approach allows step-sizes which do not depend on the state-action visited at each iteration. The proposed approach can potentially be generalized to analyze other variants of Q-learning.

#### Frontiers of Privacy-preserving Data Analysis

Session Chair: Weijie Su

• Valid Statistical Inference Under Formal Privacy With Complex Public Information
• By Aleksandra Slavković, Jeremy Seeman and Matthew Reimherr | Pennsylvania State University
• Methods for generating differentially-private (DP) synthetic data are gaining prominence as large agencies such as the U.S. Census aim to release DP synthetic data for public usage. In the synthetic data generation process, it is common to post-process the privatized results so that the final data agrees with what the data curator considers public information. We show empirically that using post-processing to incorporate public information in contingency tables can lead to sub-optimal inference. We propose an alternative Bayesian sampling framework that directly incorporates both noise due to DP and public information constraints, leading to improved inference, and demonstrate the proposed methodology on a study of the relationship between mortality rate and race in small areas given privatized data from the CDC and U.S. Census.
• Gaussian Differential Privacy
• By Weijie Su | The Wharton School, University of Pennsylvania
• Privacy-preserving data analysis has been put on a firm mathematical foundation since the introduction of differential privacy (DP) in 2006. We propose an alternative relaxation we term “f-DP”, which has a number of nice properties and avoids some of the difficulties associated with divergence based relaxations. First, f-DP preserves the hypothesis testing interpretation of differential privacy, which makes its guarantees easily interpretable. We prove that this family is focal to f-DP by introducing a central limit theorem, which shows that the privacy guarantees of any hypothesis-testing based definition of privacy (including differential privacy) converge to Gaussian differential privacy in the limit under composition. This is joint work with Jinshuo Dong and Aaron Roth.
• Presenter by Kamalika Chaudhuri, University of California, San Diego
• Presenter by Linjun Zhang, Rutgers University
• New Methods For Differentially Private Statistical Estimation And Testing
• By Jonathan Ullman | Northeastern University
• I will present an update on the state-of-the-art in the theory of differentially private statistical estimation and hypothesis testing.

#### Interface between Machine Learning and Optimization: Methods and Applications

Session Chair: Aly Megahed

• Cognitive Analysis for Request for Proposals Documents in B2B Tendering Processes
• By Aly Megahed Aly Megahed, Paul R. Messinger, Pravar Mahajan, Juan Cappi, Ahmed Nazeem, Hamid Reza Motahari Nezhad | IBM Research – Almaden and University of Alberta
• In tendering processes, clients submit requests for proposals (RFPs) that providers need to respond to. Typically, experts at the providers manually read these RFPs to extract client requirements and map them to provider offerings so that they can prepare a solution proposal and present their bid to the client. This process is slow and error-prone. In this work, we build a novel text-mining based method to cognitively read the RFPs and extract client requirements. We also develop an optimization model and an efficient solution approach to map these requirements to provider offerings. We apply our method to real-world data from the IT services industry and show its efficiency.
• Stochastic Optimization Forests
• By Nathan Kallus and Xiaojie Mao | Cornell University
• We study conditional stochastic optimization problems, where we leverage rich auxiliary observations (e.g., customer characteristics) to improve decision-making with uncertain variables (e.g., demand). We adapt random forests to this setting by searching tree splits to directly optimize the downstream decision quality, rather than prediction accuracy. We realize this by developing approximate splitting criteria that utilize optimization perturbation analysis to eschew burdensome re-optimization for every candidate split, so that our method scales to large-scale problems. Our method can accommodate both deterministic and stochastic constraints. We prove that our method achieves asymptotic optimality and we validate its efficacy empirically, demonstrating the value of optimization-aware construction of forests and the success of our approximations.
• Histogram-Based Gradient Boosted Trees For Federated Learning
• By Yuya Jeremy Ong | IBM Research – Almaden
• Federated Learning is a novel approach to machine learning without the need to collect data in a central place. Issues of privacy concerns and regulatory restrictions make it impossible or expensive to bring all data to a centralized machine learning cluster. Federated learning helps to overcome these issues by collaboratively training a machine learning model without transmitting any raw data. In this talk, we present a novel implementation of a Histogram-Based Gradient Boosting Tree algorithm for Federated Learning and its advantages over various other Federated Learning approaches.
• Asymptotically Optimal Category-priority Policy For Dynamic Resource Allocation Problems
• By Xiangyu Zhang, Peter Frazier | Cornell University
• We consider resource allocation problems, where a decision-maker allocates resources to a subset of candidates at each period. Allocating resources to candidates generates rewards, and decision maker aims to maximize the total reward. We are interested in the asymptotic regime that the total amount of resources grows proportionally with the number of candidates. This problem can be modeled using stochastic control but is hard to solve due to the exponential time-complexity with respect to the number of candidates. Using a linear programming relaxation, we construct the so-called Category-Priority policy under polynomial time-complexity, which also achieves asymptotical optimality.
• Elasticity Management For Capacity Planning In Software As A Service Cloud Computing
• By Jon M. Stauffer, Aly Megahed, Chelliah Sriskandarajah | Texas A&M University and IBM’S Almaden Research Center
• Many companies are shifting from on-premise IT environments to cloud computing. The computing application resources (instances) need to be adjusted according to the demand (queries) over time. Determining the optimal number of instances needed in a given planning horizon is challenging due to the combinatorial nature of the optimization problem involved. Deploying too many instances results in unnecessary deployment/capacity costs, while deploying too few instances results in penalties for not being able to process all incoming queries on-time. We also develop a stochastic framework and using two different randomly generated data sets (representing problem instances in practice), we demonstrate that robust solutions can be obtained.

#### Interface of Statistics, Optimization and Learning

Session Chair: Yuejie Chi and Yuxin Chen

• Training Binarized Neural Networks: Solving Discrete Optimization Using Gradient Information

• By Meisam Razaviyayn, Tianjian Huang | University of Southern California
• Quantization of the parameters of machine learning models, such as deep neural networks, requires solving optimization problems with constraints formed by the Cartesian product of many simple discrete sets. For such optimization problems, we study Alternating Direction Method of Multipliers for Quantization (ADMM-Q), which is a variant of the widely-used ADMM method designed for discrete optimization problems. We establish convergence of the iterates to certain stationary points. Our theoretical insights leads to several variants of ADMM-Q that can handle inexact update rules, and have improved performance via the use of “soft projection” and “injecting randomness to the algorithm”. We empirically evaluate the efficacy of our proposed approaches.
• Understanding The Learning Rate In Deep Learning
• By Weijie Su | University of Pennsylvania
• We present a general theoretical analysis of the effect of the learning rate in stochastic gradient descent (SGD). Our analysis is based on the use of a learning-rate-dependent stochastic differential equation (lr-dependent SDE) that serves as a surrogate for SGD. We establish a linear rate of convergence for this continuous-time formulation of SGD, highlighting the fundamental importance of the learning rate in SGD. Strikingly, this expression clearly reveals the dependence of the linear convergence rate on the learning rate–the linear rate decreases rapidly to zero as the learning rate tends to zero for a broad class of nonconvex functions. Based on this sharp distinction between nonconvex and convex problems, we provide a mathematical interpretation of the benefits of using learning rate decay for nonconvex optimization.
• Presenter by Song Mei, Stanford University
• Deep learning methods operate in regimes that defy the traditional statistical mindset. Despite the non-convexity of empirical risks and the huge complexity of neural network architectures, stochastic gradient algorithms can often find the global minimizer of the training loss and achieve small generalization error on test data. As one possible explanation to the training efficiency of neural networks, tangent kernel theory shows that a multi-layers neural network — in a proper large width limit — can be well approximated by its linearization. In this talk, I will discuss the generalization error of linearized neural networks, which reveals two interesting phenomena: the staircase decay and the double-descent curve. Through the lens of these phenomena, I will also address the benefits and limitations of the linearization approach for neural networks.
• Convex Relaxation Of Low-rank Operators By Tensor Products Of Banach Spaces And Its Application To Inverse Problems
• By Kiryung Lee, Rakshith S. Srinivasa, Marius Junge, Justin Romberg | The Ohio State University, Georgia Institute of Technology, University of Illinois at Urbana-Champaign
• Low-rank matrix models have been popularly used in classical and modern applications in signal processing and statistics. In geometric functional analysis, a class of matrix norms, known as tensor product norms, was introduced to study the action of matrices as a linear operator between Banach spaces. Certain pairs of tensor product norms provide a convex relaxation of low-rank matrix models that induces a low-rank solution to inverse problems. We present estimates of the entropy number for selected pairs of Banach spaces and its utility to analyze convex programs to distributed subspace learning and blind deconvolution.
• Analysis Of The Optimization Landscapes For Overcomplete Representation Learning
• By Qing Qu, New York University
• We study nonconvex optimization landscapes for learning overcomplete representations, including learning (i) sparsely used overcomplete dictionaries and (ii) convolutional dictionaries, where these unsupervised learning problems find many applications in high-dimensional data analysis. In this work, we show these problems can be formulated as ℓ4-norm optimization problems with spherical constraint, and study the geometric properties of their nonconvex optimization landscapes. For both problems, we show the nonconvex objectives have benign (global) geometric structures, in the sense that every local minimizer is close to one of the target solutions and every saddle point exhibits negative curvature. This discovery enables the development of guaranteed global optimization methods using simple initializations.

#### Intersection between Optimization and Statistical Learning

Session Chair: Hongcheng Liu, Sam Davanloo Tajbakhsh

• Estimation And Inference Of A High-dimensional Risk Score Through A Concord Learning
• By Jaeyoung Park, Muxuan Liang, Xiang Zhong | University of Florida, Fred Hutchinson Cancer Research Center
• We propose a novel framework for estimation and inference of a high-dimensional risk score through a concord learning approach. It provides a model-free decision rule and one application is to stratify patients based on readmission risks. When the outcome Y is ordinal or continuous, such as the count of rehospitalizations or combined endpoints including death, the information beyond a binary indicator can be utilized to obtain a better risk score. Our proposed method is compatible with a variety of loss functions including huberized hinge loss and is not susceptible to model mis-specification. Our preliminary analysis suggests that it outperforms logistic regression and Poisson regression with LASSO penalty in mis-specified cases.
• Towards Better Global Landscape Of GANs And Understanding Mode Collapse
• By Ruoyu Sun | University of Illinois at Urbana-Champaign
• Our understanding of GAN training is still very limited since it is a non-convex-concave min-max optimization. Most recent studies focused on the analysis in the local region around the equilibrium. In this talk, we discuss how to perform a global analysis of GANs. We prove that the original GAN has exponentially many bad strict local minima which are perceived as mode-collapse. We also prove that a simple modification to the original GAN called RSGAN enjoys a better global landscape in the sense that it has no bad basins. Our experiments on standard datasets show that RSGAN outperforms the original GAN and WGAN-GP, with only two lines of code changed in PyTorch. Based on theory, we predict that RSGAN has a bigger advantage for smaller neural nets, and our experiments verify that with 20% parameters RSGAN can still achieve similar performance, while other GANs perform badly.
• Generalizability Of Deep Neural Nets At Tractable And Quantized Solutions​
• By Bijan Taslimi, Charles David Hernandez, Hung Yi Lee, Yuanbo Wang, Panayote (Panos) M. Pardalos, Hongcheng Liu | University of Florida
• While many theoretical frameworks have been proposed for analyzing DL in the literature, two critical gaps between theory and practice have not been filled. First, empirical evidences often suggest a high tolerance of DL architectures to variations of training procedures. Second, although DL models have been utilized in several applications, a limited generalization analysis on those models is available. Therefore, we present an almost-algorithm-independent theory to ensure that any solution algorithm, with a proper and tractable initialization, can be generalizable. We next derive characterizations of computable, quantized local solutions that ensure generalization performance and derive algorithms that provably satisfy those conditions.
• Presenter by Rolinek Michal

#### Inverse Reinforcement Learning

Session Chair: Yanling Chang

• Inverse Reinforcement Learning with Hidden States and Dynamics
• By Yanling Chang, Lu Sun, Zhide Wang, ALfredo L. Garcia | Texas A&M University
• Inverse reinforcement learning has focused on training a model of dynamic decisions by a single human agent based upon the history of implemented actions and observable states. However, in many realistic scenarios, the system state is not directly observable. In this research, we develop a new methodology for jointly estimating the dynamics of hidden states and structure of preferences underlying the observable dynamic decisions employed by a human operator engaged in a given task. We also use a zero-order algorithm to tackle the scalability issue that has not been thoroughly examined in literature.
• Heuristic Guided Deep Reinforcement Learning For Sensor Tasking
• By Ashton E. Harvey | George Mason University, United States Air Force
• Existing approaches for tasking sensors to track space objects employ heuristic methods. The current heuristic method, special perturbations tasker, uses artificial constraints to obtain feasible solutions. This work frames the sensor allocation problem as a partially observed Markov decision process using an unscented Kalman filter model. We employ off-policy learning to transfer knowledge from a heuristic based policy to a Deep Reinforcement Learning (DRL) based policy. We compare the results from heuristic, naive DRL, and Heuristic Guided DRL policies. We demonstrate a path for transitioning systems using existing heuristic policies to DRL policies with fewer artificial constraints.
• Modeling Mental Fatigue In Human Control Tasks
• By Ran Wei, Anthony McDonald, Alfredo Garcia | Texas A&M University
• Mental fatigue is a complex construct contributing to numerous accidents in control tasks such as driving. Recent theories suggest that mental fatigue is a bodily sensation of the perceived value of the current task outweighed by other candidate tasks. We cast the human agent as solving a Markov Decision Problem with hidden states, which represent different mental states, such as fatigued vs. not fatigue. Using environments from OpenAI Gym, we test a reinforcement learning methodology to fit a model of how fatigue sets in human control. Findings uncovered by this model can shed light on the cost-benefit models of mental fatigue and provide design cues for human-aware automations.
• Sometimes Simple Is Better For Reinforcement Learning
• By Horia Mania | University of California-Berkeley
• With many potential applications, recent years witnessed a staggering interest in building autonomous agents that interact with the world. Despite impressive successes in games and robotic demonstrations, known reinforcement learning algorithms remain data hungry. In the quest to find methods that require less data, the general trend has been to develop increasingly complicated methods. Is this complexity necessary? I will discuss the benefits and limitations of a simple derivative free optimization method, showcasing its performance on popular simulated locomotion tasks and pointing out some problems with the common evaluation methodology used in RL. A more thorough evaluation reveals that local optima are a major problem for our method on the simulated locomotion tasks, leading to interesting questions about reward shaping and algorithm design.

#### Theory of Neural Networks and Causal Inference

Session Chair: Anirbit Mukherjee

• The Benefits Of Jointly Training Autoencoders: An NTK Viewpoint
• By Thanh Nguyen, Raymond Wong, Chinmay Hegde | Iowa State University, Texas A&M, New York University
• A remarkable recent discovery in machine learning has been that deep neural networks can achieve impressive performance in the regime where they are massively over-parameterized. However, the majority of existing work only applies to supervised learning scenarios, and the unsupervised setting has gained far less attention. In this talk, we analyze the gradient dynamics of two-layer over-parameterized autoencoders with ReLU activation. We rigorously prove the linear convergence of gradient descent in two learning regimes: (i) the weakly-trained regime where only the encoder is trained, and (ii) the jointly-trained regime where both the encoder and the decoder are trained. We also analyze the case of weight-tied autoencoders (which is a common architectural choice in practical settings) and prove that training such networks can lead to unexpected degeneracies.
• Over-parameterization eliminates bad basins in neural networks
• By Ruoyu Sun, University of Illinois Urbana-Champaign
• Wide neural networks are often believed to have nice optimization landscape, but what rigorous results can we prove? To understand the benefit of width, it is important to identify the difference between wide and narrow networks. We prove that for any continuous activation functions, the loss surface of wide networks has no sub-optimal basin, where “basin” is defined as the set-wise strict local minimum. Furthermore, for a class of networks with width below a threshold, we construct sub-optimal strict local minima. These two results together suggest that one benefit of width is the disappearance of bad basins. We also prove that under the same conditions, sub-optimal local minima can exist for a dense set of activation functions, for arbitrarily wide networks, thus it is unlikely that over-parameterization can remove all bad local minima.
• Alpha-loss: A Tunable Class of Loss Functions for Robust Learning
• By Lalitha Sankar, Arizona State University
• We introduce alpha-loss as a parameterized class of loss functions that resulted from operationally motivating information-theoretic measures. Tuning the parameter alpha from 0 to infinity yields a class of loss functions that admit continuous interpolation between log-loss (alpha=1), exponential loss (alpha=1/2), and 0-1 loss (alpha=infinity). We discuss how different regimes of the parameter alpha enables the practitioner to tune the sensitivity of their algorithm to address both robustness and class imbalance. We study the optimization landscape of the average loss for the logistic regression model and quantify the merits and challenges of approaching 0-1 loss by tuning alpha.
• Inference In Observational Studies Using Machine Learning
• By Rajarshi Mukherjee, Harvard University
• It is widely believed that tools from machine learning and artificial intelligence are natural choices to alleviate the burden of assumptions that exist across the spectrum of causal inference in observational studies. Although true at heart, most machine learning methods are geared to attain low prediction errors in regression type problems — whereas estimation of quantities like causal effects might require a somewhat different understanding. In this talk, I will discuss one unified framework to obtain “optimal” inferential procedures for quantities of interest in common observational studies. This framework will allow us to explore properties that machine-learning algorithms need to satisfy for successful application in observational studies.
• Dynamics Of Deep Neural Networks And Neural Tangent Hierarchy
• By Jiaoyang Huang, Institute for Advanced Study
• The evolution of a deep neural network trained by the gradient descent in the overparametrization regime can be described by its neural tangent kernel (NTK) (Jacot et al.,2018; Du et al., 2018b;a; Arora et al., 2019b). It was observed (Arora et al., 2019b) that there is a performance gap between the kernel regression using the limiting NTK and the deep neural networks. We study the dynamic of neural networks of finite width and derive an infinite hierarchy of differential equations, the neural tangent hierarchy (NTH). We prove that the NTH hierarchy truncated at the level $p\geq 2$ approximates the dynamic of the NTK up to arbitrary precision under certain conditions on the neural network width and the data set dimension. The assumptions needed for these approximations become weaker as $p$ increases. Finally, NTH can be viewed as higher order extensions of NTK.

#### Stochastic Bandits

Session Chair: Nathan Kallus and Yichun Hu

• Smooth Contextual Bandits: Bridging The Parametric And Nonparametric Regret Regimes
• By Yichun Hu, Nathan Kallus, Xiaojie Mao | Cornell University
• We study a non-parametric contextual bandit problem where the mean reward functions are β-smooth. We show how this interpolates between two extremes that were previously studied in separation: non-differentiable bandits (β≤1) and linear-response bandits (β=∞). We develop a novel algorithm that carefully adjusts to any smoothness setting and achieves rate-optimal regret on the whole spectrum of β. In this sense, our results bridge the gap between the divergent regret rates in the literature and between bandit algorithms that leverage global and local reward information and rigorously formalize the intuitive relationship between regret and the hardness of learning response functions.
• Dtr Bandit: Learning To Make Response-adaptive Decisions With Low Regret
• By Yichun Hu, Nathan Kallus | Cornell University
• Dynamic treatment regimes (DTRs) are personalized, adaptive, multi-stage treatment plans that adapt treatment decisions both to an individual’s initial features and to intermediate outcomes and features at each subsequent stage, which are affected by decisions in prior stages. While existing literature mostly focuses on estimating the optimal DTR from offline data, we study the problem of developing the optimal DTR in an online manner, where the interaction with each individual affect both our cumulative reward and our data collection for future learning. We propose a novel algorithm that is guaranteed to achieve rate-optimal regret when the transition and reward models are linear. We demonstrate our algorithm and its benefits both in synthetic experiments and in a case study of adaptive treatment of major depressive disorder using real-world data.
• Bypassing The Monster: A Faster And Simpler Optimal Algorithm For Contextual Bandits Under Realizability
• By David Simchi-Levi, Yunzong Xu | Massachusetts Institute of Technology
• We consider the general (stochastic) contextual bandit problem under the realizability assumption, i.e., the expected reward, as a function of contexts and actions, belongs to a general function class F. We design a fast and simple algorithm that achieves the statistically optimal regret with only O(log T) calls to an offline least-squares regression oracle across all T rounds. The number of oracle calls can be further reduced to O(log log T) if T is known in advance. Our results provide the first universal and optimal reduction from contextual bandits to offline regression, solving an important open problem in contextual bandits. A direct consequence of our results is that any advances in offline regression immediately translate to contextual bandits, statistically and computationally.
• By Ahmadreza Momeni, Yonatan Gu, Stefan Wager | Stanford University, Stanford GSB
• We study a non-parametric multi-armed bandit problem with stochastic covariates, where a key complexity driver is the smoothness of payoff functions with respect to covariates. In this work, we consider a framework where the smoothness of payoff functions is not known, and study when and how algorithms may adapt to unknown smoothness. First, we establish that designing algorithms that adapt to unknown smoothness of payoff functions is, in general, impossible. However, under a self-similarity condition (which does not reduce the minimax complexity of the dynamic optimization problem at hand), we establish that adapting to unknown smoothness is possible, and further devise a general policy for achieving smoothness-adaptive performance.
• TS-UCB: Improving On Thompson Sampling With Little To No Additional Computation
• By Jackie W. Baek, Vivek Farias | Massachusetts Institute of Technology
• Thompson sampling has become a ubiquitous approach to online decision problems with bandit feedback. The key algorithmic task for Thompson sampling is drawing a sample from the posterior of the optimal action. Given one or more samples from the posterior, we propose an alternative arm selection rule we dub TS-UCB, that requires negligible additional computational effort. We apply TS-UCB to the K-armed bandit and the linear bandit, and we establish optimal regret bounds in both settings. More importantly, we show that TS-UCB achieves materially lower regret on all problem instances in a deep bandit suite proposed in Riquelme et al. (2018).

#### Learning from multimodal data: Promises and challenges

Session Chair: Mostafa Reisi Gahrooei and Kamran Paynabar

• Neural Granger Causality For Nonlinear Time Series
• By Ali Shojaie
• While most classical approaches to Granger causality detection assume linear dynamics, many interactions in applied domains are inherently nonlinear. In these cases, using linear models may lead to inconsistent estimation of Granger causal interactions. We propose a class of nonlinear methods by applying structured multilayer perceptrons (MLPs) or recurrent neural networks (RNNs) combined with sparsity-inducing penalties on the weights. By encouraging specific sets of weights to be zero–in particular through the use of convex group-lasso penalties–we extract the Granger causal structure. Our neural Granger causality methods naturally capture long-range dependencies between series. They also outperform state-of-the-art methods when applied to challenging benchmark data and provide plausible interactions in a human motion capture dataset.
• Presenter by Fatemeh Ghoreishi
• Design problems are pervasive in scientific and industrial endeavors: scientists design experiments to gain insights into physical and social phenomena, engineers design machines to execute tasks more efficiently, and pharmaceutical researchers design new drugs to fight disease. All these design problems are fraught with choices, choices that are often complex and high-dimensional, with constraints and uncertainties that make them difficult for individuals to reason about. Despite several advances made in design and decision making in recent years, lack of reliability and lack of scalability have prevented their applications to a wide range of practical problems. This talk will focus on large-scale and reliable design and decision-making from the machine learning and Bayesian statistical perspective.
• Discriminant Subgraph Learning From Functional Brain Sensory Data
• By Lujia Wang, Jing Li, Todd J. Schwedt, Catherine D. Chong, Teresa Wu | Arizona State University, Mayo Clinic
• We propose a novel Discriminant Subgraph Learner (DSL) to identify a sub-network within the functional connectivity network (FCN) based on functional brain sensory data. The technical novelties of DSL lie in aspects such as using a Bayesian hierarchical model to capture the natural and unique two-level variability of FCN, proposing an integrated optimization framework to simultaneously learn the FCN of each class and identify the discriminant subgraph, and developing tractable and converging algorithms to solve the optimization. The identified sub-network could be used as a novel biomarker to improve diagnostic accuracy of neurological diseases and injuries such as migraine and concussion.
• A Game-theoretic Approach To Sequential Detection In Adversarial Environments
• By Ruizhi Zhang, Shaofeng Zou | University of Nebraska-Lincoln, University at Buffalo
• The problem of sequential binary hypothesis testing in an adversarial environment is investigated. Specifically, if there is no adversary, the samples are generated independently by a distribution p; and if the adversary is present, the samples are generated independently by another distribution q. The adversary picks a distribution $q\in Q$ with cost c(q). The goal of the defender is to decide whether there is an adversary using samples as few as possible; and the goal of the adversary is to fool the defender. The problem is formulated as a non-zero-sum game between the adversary and the defender. A pair of strategies (attack strategy from the adversary and the sequential hypothesis testing scheme from the detector) is proposed and proved to be a Nash equilibrium pair for the non-zero-sum game asymptotically. Numerical experiments are provided to validate our results.

#### Learning to Make Decisions with Missing Data

Session Chair: Zhengyuan Zhou

• Optimal No-regret Learning In Repeated First-price Auctions
• By Yanjun Han, Zhengyuan Zhou, Tsachy Weissman | Stanford University, Stern School of Business
• We study online learning in repeated first-price auctions with censored feedback, where a bidder, only observing the winning bid at the end of each auction, learns to adaptively bid to maximize her cumulative payoff. We develop the first learning algorithm that achieves the near-optimal $\tilde{O}(\sqrt{T})$ regret when the bidder’s private values are stochastically generated, by providing an algorithm on a general class of problems called monotone group contextual bandits. However, we also provide an $\Omega(T^{2/3})$ lower bound for the general problem with adversarial contexts. Despite this, we further exploit the structure of first-price auctions and develop a learning algorithm achieving an $\tilde{O}(\sqrt{T})$ regret even under adversarially generated private values, hence completely characterize the optimal learning guarantees for this problem.
• Breaking The Sample Size Barrier In Model-based Reinforcement Learning With A Generative Model
• By Gen Li, Yuting Wei, Yuejie Chi, Yuantao Gu, Yuxin Chen | Tsinghua University, Carnegie Mellon University, Princeton University
• We investigate the sample efficiency of reinforcement learning in a γ-discounted Markov decision process (MDP) with state space S and action space A, assuming access to a generative model. Despite a number of prior work tackling this problem, a complete picture of the trade-offs between sample complexities and accuracy is yet to be determined. In particular, all prior results impose a sample size barrier that requires the sample size to exceed at least |S||A|/(1-γ)^2, which however, does not meet the information-theoretic limit. We break this barrier by demonstrating the minimax optimality of model-based reinforcement learning as soon as the sample size exceeds the order of |S||A|/(1-γ). To the best of our knowledge, this work provides the first minimax-optimal guarantee in a generative model that accommodates the entire range of sample sizes.
• Random Graph Asymptotics For Treatment Effect Estimation Under Network Interference
• By Shuangning Li, Stefan Wager | Stanford University
• Most work on treatment effect estimation is developed under a “no-interference” assumption, whereby treatment assigned to one person does not affect the outcomes of others. This no-interference assumption, however, is clearly not applicable in many settings of interest: For example, if my friend gets vaccinated, it may also help me avoid getting sick. In this talk, I’ll present some early results on large-sample asymptotics for treatment effect estimation under interference, given an assumption that the exposure graph (i.e., the pairs of people whose treatments may affect each others outcomes) is drawn from a graphon. Interestingly, some settings arise where natural IPW-style estimators are not consistent, but appropriately adjusted estimators are.
• Inference And Uncertainty Quantification For Noisy Matrix Completion
• By Cong Ma, Yuxin Chen, Jianqing Fan, Yuling Yan | Princeton University
• Noisy matrix completion aims at estimating a low-rank matrix given only partial and corrupted entries. Despite substantial progress in designing efficient estimation algorithms, it remains largely unclear how to assess the uncertainty of the obtained estimates and how to perform statistical inference on the unknown matrix (e.g. constructing a valid and short confidence interval for an unseen entry). This talk takes a step towards inference and uncertainty quantification for noisy matrix completion. We develop a simple procedure to compensate for the bias of the widely used convex and nonconvex estimators. The resulting de-biased estimators admit nearly precise non-asymptotic distributional characterizations, which in turn enable optimal construction of confidence intervals/regions for, say, the missing entries and the low-rank factors.

#### Machine Learning and Causal Inference for Effective Decision Making from Data

Session Chair: Angela Zhou

• A Semiparametric Instrumental Variable Approach To Optimal Treatment Regimes Under Endogeneity
• By Yifan Cui | Wharton School of the University of Pennsylvania
• In this paper, we propose a general instrumental variable approach to learning optimal treatment regimes under endogeneity. Specifically, we establish identification of both value function for a given regime and optimal regime with the aid of a binary instrumental variable, when no unmeasured confounding fails to hold. We also construct novel multiply robust classification-based estimators. Furthermore, we propose to identify and estimate the optimal treatment regime among those who would comply to the assigned treatment under a standard monotonicity assumption. In this latter case, we establish the somewhat surprising result that the complier optimal regime can be consistently estimated without directly collecting compliance information and therefore without the complier average treatment effect itself being identified.
• Confounding-robust Policy Evaluation In Infinite-horizon Reinforcement Learning
• By Angela Zhou, Nathan Kallus | Cornell University ORIE, Cornell University
• Off-policy evaluation of sequential decision policies from observational data is necessary in applications of batch reinforcement learning such as education and healthcare. We develop a robust approach that estimates sharp bounds on the (unidentifiable) value of a given policy in an infinite-horizon problem given data from another policy with unobserved confounding subject to a sensitivity model. We show how to express this set using a new partially identified estimating equation and prove convergence to the sharp bounds, as we collect more confounded data. We prove that membership in the set can be checked by solving a linear program, while the support function is given by a difficult nonconvex optimization problem. We demonstrate the resulting bounds empirically.
• Large Dimensional Latent Factor Modeling with Missing Observations and Applications to Causal Inference
• By Ruoxuan Xiong, Markus Pelger | Stanford University
• This paper develops the inferential theory for latent factor models estimated from large dimensional panel data with missing observations. We estimate a latent factor model by applying principal component analysis to an adjusted covariance matrix estimated from partially observed panel data. We derive the asymptotic distribution for the estimated factors, loadings and imputed values under a general approximate factor model. The key application is to estimate counterfactual outcomes in causal inference from panel data. The counterfactual outcomes are modeled as missing values that are inferred from the latent factor model. The inferential theory for the imputed values allows us to test for individual treatment effects at any time. We apply our method to investment strategies and find that 15% strategies’ returns are significantly reduced by the academic publication.
• Stateful Off-Policy Evaluation and Learning (Placeholder)
• By Angela Zhou, Cornell University ORIE

#### Machine Learning in Markets

Session Chair: Rachel Cummings

• Presenter by Kira Goldner
• Causal Feature Discovery Through Strategic Modification
• Yahav Bechavod, Katrina Ligett, Z Steven Wu, Juba Ziani | Hebrew University, University of Minnesota, University of Pennsylvania
• We consider an online regression setting in which individuals adapt to the regression model: arriving individuals may access the model throughout the process, and invest strategically in modifying their own features to improve their assigned score. We find that this strategic manipulation may help a learner recover causal variables, in settings where an agent can invest in improving impactful features that also improve his true label. We show that even simple behavior on the learner’s part (i.e., periodically updating her model based on the observed data so far, via least-square regression) allows her to simultaneously i) accurately recover which features have an impact on an agent’s true label, provided they have been invested in significantly, and ii) incentivize agents to invest in these impactful features, rather than in features that have no effect on their true label.
• How Fair Is The Mortgage Market? Adapting Fair Machine Learning For Real-world Systems
• Hadi Elzayn, Simon Freyaldenhoven, Minchul Shin | University of Pennsylvania, Federal Reserve Bank of Philadelphia
• Using publicly available data from the Home Mortgage Disclosure Act, as well as a proprietary dataset covering outcomes for a large fraction of all mortgages in the United States, we examine lending disparities in the mortgage market, and how these disparities vary by time, place, and lender type, with a particular focus on differences between traditional and so-called “fintech” institutions. To do so, we adapt a variety of metrics defined for individual machine-learned models to the system as a whole. This presents challenges due to the strategic nature of the market and the selective labels problem. We overcome these challenges by using techniques from both causal inference and structural modeling.
• Auction Design Underinterdependentvalue
• By Alon Eden | Harvard University
• We study combinatorial auctions with interdependent valuations. In such settings, every agent has a private signal, and every agent has a valuation function that depends on the private signals of all the agents. Such valuations capture settings where agents lack information to determine their own valuations such as auctions for artwork or oil drilling rights. Most works in the about interdependent values assume restrictive conditions in order to achieve full welfare; Otherwise , there are strong impossibility results. In contrast, we study welfare maximization for interdependent valuations through the lens of approximation. We study properties that enable positive results: a relaxed, parameterized version of single crossing, and a submodularity condition over the signals. For various scenarios, We obtain a host of approximation guarantees under these two notions.
• Attribute Privacy: Framework And Mechanisms
• By Wanrong Zhang | Georgia Institute of Technology
• Ensuring the privacy of training data is a growing concern given that many machine learning models are trained on confidential and potentially sensitive data. Much attention has been devoted to methods for protecting individual privacy during analyses of large datasets. However in many settings, global properties of the dataset may also be sensitive. In this work, we depart from individual privacy to initiate the study of attribute privacy, where a data owner is concerned about revealing sensitive properties of a whole dataset during analysis. We propose definitions to capture attribute privacy in two relevant cases where global attributes may need to be protected: (1) properties of a specific dataset and (2) parameters of the underlying distribution from which dataset is sampled. We also provide two efficient mechanisms that satisfy attribute privacy for these settings.

#### Machine Learning in OR

Session Chair: Xin Guo

• Generating Financial Time Series With Guaranteed Tail Behaviors
• By Renyuan Xu, University of Oxford
• Generating financial time series and scenarios across multiple instruments with realistic tail behaviors are important for backtesting trading strategies and measuring risks. Modeling the joint dynamics of financial instruments from different categories, for example, VIX, S&P500, and interest rate, by stochastic processes are difficult. An alternative way is to use data-driven methods such as generative adversarial networks (GANs). However, how to validate financial time series data generated by GANs remains open and challenging. In this talk, we propose a new framework for GAN, where the validation criterion is directly incorporated in the loss function. This guarantees the generated time series enjoy consistent expected shortfall and value at risk as to the input data. This is based on joint work with Rama Cont, Mihai Cucuringu, and Chao Zhang (Oxford).
• Perturbed Gradient Descent With Occupation Time
• By Wenpin Tang | UC Berkeley
• This work develops further the idea of perturbed gradient descent, by adapting perturbation with the history of state via the notation of occupation time for saddle points. The proposed algorithm PGDOT is shown to converge at least as fast as perturbed gradient descent (PGD) algorithm, and is guaranteed to avoid getting stuck at saddle points. The analysis is corroborated by experimental results.
• Provably Efficient Exploration in Policy Optimization
• By Cai Qi, Zhuoran Yang, Jin Chi, Zhaoran Wang | Northwestern University, Princeton University
• While policy-based reinforcement learning (RL) achieves tremendous successes in practice, it is significantly less understood in theory. In particular, it remains elusive how to design a provably efficient policy optimization algorithm that incorporates exploration. To bridge such a gap, this paper proposes an Optimistic variant of the Proximal Policy Optimization algorithm (OPPO), which follows an “optimistic version” of the policy gradient direction. This paper proves that, in the problem of episodic Markov decision process with unknown transition and adversarial reward with full-information feedback, OPPO achieves a sublinear regret with polynomial dependency on the problem size. To the best of our knowledge, OPPO is the first provably efficient~policy optimization algorithm that explores.
• Regularity And Stability Of Feedback Relaxed Controls
• By Christoph Reisinger, Yufei Zhang | University of Oxford
• This talk proposes a relaxed control regularization with general exploration rewards to design robust feedback controls for multi-dimensional continuous-time stochastic exit time problems. We establish that the regularized control problem admits a Holder continuous feedback control, which is Lipschitz stable with respect to parameter perturbations. Moreover, we show that a pre-computed feedback relaxed control has a robust performance in a perturbed system, and derive a first-order sensitivity equation for both the value function and optimal feedback relaxed control. We finally prove first-order monotone convergence of the value functions for relaxed control problems with vanishing exploration parameters, and construct the pure exploitation strategy of the original control problem based on the feedback relaxed controls.

#### Statistical Reinforcement Learning from Batch Data

Session Chair: Yaqi Duan and Botao Hao

• Statistical Inference of the Value Function for Reinforcement Learning in Infinite Horizon Settings
• By Sheng Zhang | North Carolina State University
• The goodness of a policy is measured by its value function starting from some initial state. The focus of this paper is to construct confidence intervals (CIs) for a policy’s value in infinite horizon settings. We propose to model the action-value state function associated with a policy based on series/sieve method to derive its confidence interval. When the target policy depends on the observed data as well, we propose a SequentiAl Value Evaluation (SAVE) method to recursively update the estimated policy and its value estimator. As long as either the number of trajectories or the number of decision points diverges to infinity, we show that the proposed CI achieves nominal coverage even in cases where the optimal policy is not unique. We apply the proposed method to a dataset from mobile health studies and find that RL algorithms could help improve patient’s health status.
• Presenter by Masatoshi Uehara
• Off-policy evaluation is notoriously difficult due to the curse of the horizon, wherein the overlap between any policy and the observed data diminishes exponentially with trajectory length. To tackle this issue, we consider for the first time the semi-parametric efficiency limits of both policy evaluation in a Markov decision process. Since these bounds describe the best-achievable error, they exactly characterize when the curse of horizon makes the problem intractable. In particular, our results show there is hope when one leverage Markovian and/or time-invariant structure. To capitalize on this, we propose a new off-policy evaluation estimator, which we show is efficient under weak conditions on the estimation of nuisances using flexible machine learning methods. We demonstrate the significant benefits of our approaches in various aspects.
• Minimax-optimal Off-policy Evaluation With Linear Function Approximation
• By Yaqi Duan, Mengdi Wang | Princeton University
• We study the statistical theory of off-policy evaluation in batch data reinforcement learning with function approximation. In particular, we consider a regression-based fitted Q iteration method, and prove that this method is information-theoretically optimal. We establish a finite-sample instance-dependent error upper bound and a nearly-matching minimax lower bound. The policy evaluation error depends sharply on a restricted χ<sup>2</sup>-divergence over the function class between the long-term distribution of the target policy and the distribution of past data. This restricted χ<sup>2</sup>-divergence is both instance-dependent and function-class-dependent. It characterizes the statistical limit of off-policy evaluation.
• Presenter by Zhiyuan Li
• Statistically Efficient Off-Policy Policy Gradients
• By Nathan Kallus, Masatoshi Uehara | Cornell University
• Policy gradient methods in reinforcement learning update policy parameters by taking steps in the direction of an estimated gradient of policy value. We consider the statistically efficient estimation of policy gradients from off-policy data, where the estimation is particularly non-trivial. We derive the asymptotic lower bound on the feasible mean-squared error and show that existing estimators fail to achieve it in general settings. We propose a meta-algorithm that achieves the lower bound without any parametric assumptions and exhibits a unique 3-way double robustness property, and we establish guarantees on the rate at which we approach a stationary point when we take steps in the direction of our new estimated policy gradient.

#### Modern analysis of spatio-temporal data

Session Chair: Hao Yan and Yajun Mei

• A Multi-armed Bandit Approach For Rapid Change Detection Of High-dimensional Data With Multiple Failure Patterns
• By Hao Yan, Xinyu Zhao | Arizona State University
• We investigate the problem of online monitoring of high-dimensional streaming data in resource-constrained environments where multiple correlated failure patterns may present. It is assumed that any of the global or local failure patterns may occur in the system with distinct distributions. The objective is to decide how to choose the observed local data components locally and how to raise a correct global-level alarm to detect the failure pattern. We propose to combine multi-armed bandit approaches with sequential change-point detection to develop an efficient algorithm to balance the exploration and exploitation to search the appropriate failure pattern. Extensive numerical simulations and case studies demonstrate the statistical and computational efficiency of our proposed algorithm.
• Asymptotically Optimal Control Of FDR And Other Metrics For Sequential Multiple Testing
• By Jay Bartroff | University of Southern California
• I will discuss asymptotically optimal multiple testing procedures for sequential data in the context of prior information on the number of signals, for controlling FDR/FNR, pFDR/pFNR, and other metrics. We show that by appropriately adjusting the critical values of the Song & Fellouris (2017) procedures for controlling type 1 and 2 familywise error rates (FWEs), they can be made asymptotically optimal for controlling <i>any</i> multiple testing error metric that is bounded between multiples of FWE in a certain sense. These include the per-comparison and per-family error rates, and the false positive rate. Our asymptotic setup allows the number of null hypotheses to approach infinity. I will also mention recent results on extending these methods to control “generalized” metrics including k-FWE and gamma-FDP.
• A Spatio-temporal Solution To Online Maritime Vessel Track Association Problem
• By Imtiaz Ahmed | Texas A & M University
• We have an automatic identification system (AIS) vessel data which is timestamped information on maritime vessel movement. The AIS data consist of many records, where each record includes information about the state of a maritime vessel at a given time, including its course of direction, speed of the vessel, current position, timestamp, and a vessel identification number (VID). The problem is if the VID information is removed from each data record along with the total number of vessels present, is there a way to identify how many vessels are there along with which records correspond to which vessel? So, it is basically a track association challenge. In this work, we develop a Spatio-temporal online track association model to solve the problem.
• Quick Isolation Of A Gaussian Correlation Structure
• By Anamitra Chaudhuri | UIUC
• The problem of identifying the correlation structure in a Gaussian random vector is considered on the basis of sequentially acquired observations. The goal is to stop sampling as quickly as possible and identify upon stopping the correlated pairs, if any, to infer about the underlying correlation structure. We propose a procedure that controls the probabilities of at least one false positive and at least one false negative below user-specified tolerance levels. Moreover, the proposed procedure is shown to achieve the smallest possible average sample size to a first-order approximation as the error probability rates go to 0 at certain rates. Finally, a simulation study is presented that compares the proposed rule with an alternative testing procedure that controls the same error metrics.
• Passenger Clustering Of Metro System Based On Tensor Probabilistic Topic Model
• By Ziyue Li, Hao Yan, Chen Zhang, Fugee Tsung | The Hong Kong University of Science and Technology, Arizona State University, Tsinghua University, KUST
• Passenger Clustering is essential for the individualized service of the public metro system. Knowing the hidden but innate mobility pattern of one certain type of passenger traveling from certain origin to a certain destination at a certain time is instructive for operation. This paper structures passenger mobility behavior as a tensor, and introduces the topic model, combining the tensor probabilistic model, to cluster passenger, origin, destination, and time.

#### Nonconvex Optimization for Machine Learning

• Bridging Convex And Nonconvex Optimization In Robust PCA: Noise, Outliers, And Missing Data
• By Yuxin Chen, Jianqing Fan, Cong Ma, Yuling Yan | Informs Account, Princeton University
• This paper delivers improved theoretical guarantees for the convex programming approach in low-rank matrix estimation, in the presence of random noise, gross sparse outliers, and missing data. When the unknown matrix is well-conditioned, incoherent, and of constant rank, we demonstrate that a principled convex program achieves near-optimal statistical accuracy, in terms of both the Euclidean loss and the entrywise loss. All of this happens even when nearly a constant fraction of observations are corrupted by outliers with arbitrary magnitudes. The key analysis idea lies in bridging the convex program in use and an auxiliary nonconvex optimization algorithm, and hence the title of this paper.
• Global Convergence and Variance-reduced Optimization for a Class of Nonconvex-nonconcave Minimax Problems
• By Junchi Yang | University of Illinois at Urbana-Champaign
• Nonconvex minimax problems appear frequently in emerging machine learning applications. Simple algorithms such as the gradient descent ascent (GDA) receive lots of empirical success, however it is known that these vanilla GDA algorithms with constant step size can potentially diverge even in the convex setting. We show that for a subclass of nonconvex-nonconcave objectives satisfying a so-called two-sided Polyak- Lojasiewicz inequality, the alternating gradient descent ascent (AGDA) algorithm converges globally at a linear rate. We further develop a variance reduced algorithm that attains a provably faster rate than AGDA when the problem has the finite-sum structure.
• Multi-agent Reinforcement Learning With Communication
• By Ai Kagawa | Brookhaven National Laboratory
• The recent advance in deep single-agent reinforcement learning has shown superhuman level performance for many decision making problems such as navigating robots and playing games. For more complex systems involving cooperative multi-path decision making, multi-agent reinforcement learning (MARL) is more suitable. This talk shows that our decentralized parallel MARL implementation with communication among agents improved the performance and scalability.
• By Zebang Shen, Hamed Hassani, Alejandro Ribeiro | University of Pennsylvania
• Reducing the variance of estimators for policy gradient has long been the focus of reinforcement learning research. While classic algorithms like REINFORCE find an \epsilon-approximate first-order stationary point in O(1/epsilon^4) random trajectory simulations, no provable improvement on the complexity has been made so far. This paper presents a Hessian aided policy gradient method with the first improved sample complexity of O(1/epsilon^3). While our method exploits information from the policy Hessian, it can be implemented in linear time with respect to the parameter dimension and is hence applicable to sophisticated DNN parameterization. Simulations on standard tasks validate the efficiency of our method.

#### On prediction and optimization in data-driven decision-making systems

Session Chair: Meng Qi

• Generalization Bounds in the Predict-then-optimize Framework
• By Othman El Balghiti, Adam Elmachtoub, Paul Grigas, Ambuj Tewari | Columbia University, UC Berkeley, University of Michigan
• The predict-then-optimize framework is fundamental in many settings: predict the unknown parameters of an optimization problem, and solve the problem using the predicted values of the parameters. A loss function is to consider the cost of the decisions induced by the predicted parameters. We provide generalization bounds for the SPO loss function. We derive bounds based on the Natarajan dimension that scale at most logarithmically in the number of extreme points. By exploiting the structure of the SPO loss and a strong convexity assumption on the feasible region, we can improve the dependence on the dimension and corresponding bounds that are akin to the margin guarantees in classification problems.
• An Integrated Estimation Framework For Contextual Stochastic Optimization Problems
• By Paul Grigas, Meng Qi, Zuo-Jun Max Shen | University of California, Berkeley
• We consider a class of stochastic optimization problems where the uncertainty defining the objective function depends on a contextual feature vector. In contrast to the standard way of first estimating the uncertainty then optimizing the objective based on estimation, we propose an estimation framework that leverages the structure of the underlying optimization problem. We show that our framework is consistent and tractable under certain conditions as well as providing generalization bounds. Numerical experiments are also conducted to show the success of our approach in practice.
• Decision-focused Learning: From Data To Decisions With Differentiable Combinatorial Optimization
• By Bryan Wilder, Harvard University
• Many applications of operations research and artificial intelligence span the pipeline from data, to predictive models, to decisions. These components are typically approached separately: a machine learning model is first trained via a loss function measuring accuracy, and then its predictions input into an optimization algorithm which produces a decision. However, for difficult learning problems, all predictive models are imperfect and the choice of loss implicitly trades off different errors. Standard losses are often misaligned with the end goal: to make the best decisions possible. We introduce a framework for decision-focused learning, where a predictive model is directly trained to induce good decisions via the optimization algorithm, and demonstrate large improvements over traditional approaches.
• An Operator Splitting Algorithm For Sparse And Low-rank Optimization
• By Shuvomoy Das Gupta, Bartolomeo Stellato, Bart Paul Gerard Van Parys | MIT, Princeton University
• Sparse and low-rank optimization is core to many machine learning algorithms because it facilitates prediction and prescription based on modern high-dimensional data sets which count many irrelevant features. However, the non-convexity of sparse and low-rank constraints often renders these problems intractable in real-world settings. We propose a novel linearly-convergent operator splitting algorithm tailored for optimization problems with sparse and low-rank constraints. By exploiting the geometry of the constraint set, we construct an iterative first-order method that finds a local optimal point by solving a sequence of exact penalizations of the original problem. Benchmarks on a wide variety of synthetic and real-world examples show that our method is able to compute high-quality solutions with significant speedups over state-of-the-art tools.
• On Data-Driven Prescriptive Analytics With Side
• By Prateek Raj Srivastava, Grani Adiwena Hanasusanto, Chin Pang Ho | University of Texas at Austin, City University of Hong Kong
• We consider generic stochastic optimization problems in the presence of side information which enables a more insightful decision. The side information constitutes observable exogenous covariates that alter the conditional probability distribution of the random problem parameters. A decision maker who adapts her decisions according to the observed side information solves an optimization problem where the objective function is specified by the conditional expectation of the random cost. If the joint probability distribution is unknown then the conditional expectation can be approximated in a data-driven manner using the Nadaraya-Watson (NW) kernel regression. In this paper, we establish guarantees for this approach by leveraging techniques from moderate deviations theory.

#### On the Intersection of Machine Learning, Control and Games

Session Chair: Ruimeng Hu

• Double Deep Q-learning For Optimal Execution
• By Sebastian Jaimungal | University of Toronto
• Optimal trade execution is an important problem faced by essentially all traders. Much optimal execution research uses stringent model assumptions and applies continuous time stochastic control to solve them. Here, we instead take a model free approach and develop a variation of Deep Q-Learning to estimate the optimal actions. The model is a fully connected Neural Network trained using Experience Replay and Double DQN with input features given by the current state of the limit order book, volatility measures, and other trading signals, while the output is the Q-value function estimating the future rewards under an arbitrary action. We apply our model to nine different stocks and find that it outperforms the standard benchmark approach on most stocks using the measures of (i) mean and median out-performance, (ii) probability of out-performance, and (iii) gain-loss ratios.
• Deep Fictitious Play For Solving Multi-agent Games
• By Jiequn Han | Princeton University
• We propose a deep neural network-based algorithm to solve the Nash equilibrium of general large N-player stochastic games. Following the idea of fictitious play, we decouple the N-player game into N individual decision problems and solve them repeatedly with deep learning techniques. The resulted algorithm can solve large N-player games for which conventional numerical methods would suffer from the curse of dimensionality. The algorithm also directly handles common noise that is numerically challenging for the mean-field game approach. We present multiple numerical examples showing the performance of the proposed algorithm in large group games, as well as some convergence analysis.
• On Reinforcement Learning Methods For MFG And MFC
• By Mathieu Lauriere | Princeton University
• Mean Field Games (MFG) have been introduced to tackle games with a large number of competing players. In this setting, Nash equilibria are studied by considering the interaction of a typical player with the population distribution. The situation in which the players cooperate corresponds to Mean Field Control (MFC), which can also be viewed as optimal control driven by a McKean-Vlasov dynamics. These two types of problems have found a wide range of applications, in particular in finance and economics. Numerical methods are a key step towards real world applications. Most methods developed thus far assume a perfect knowledge of the dynamics and the cost function. However, in many situations the model is not known. In this talk, we present several methods to learn MFG and MFC solutions in a model-free fashion. Numerical results including applications to finance will be provided.
• Learning Mean-Field Controls
• By Haotian Gu, Xin Guo, Xiaoli WEI, Renyuan Xu | University of California-Berkeley
• Multi-agent reinforcement learning (MARL) suffers from the curse of dimensionality caused by the exponential growth of the joint state-action space as the number of agents increases. Mean-field controls (MFC) with infinitely many agents provide good approximations to N-agent collaborative games. In this talk, I will first present the dynamic programming principle for learning MFC, then propose a model-free kernel-based Q-learning algorithm (CDD-Q), and show that its convergence rate and sample complexity are independent of the number of agents. Our empirical studies on the network traffic congestion model demonstrate strong performances of CDD-Q compared to existing state-of-the-art algorithms. Moreover, the CDD-Q algorithm can be applied to a general class of Markov decision problems (MDPs) with deterministic dynamics and continuous state-action space.

#### Optimization algorithms for machine learning applications

Session Chair: Yi Zhou

• First-order Methods For Conditional Stochastic Optimization–biased Oracles, Acceleration, And Lower Bounds
• By Yifan Hu | University of Illinois at Urbana-Champaign
• Conditional Stochastic Optimization (CSO) covers a variety of applications ranging from meta-learning and causal inference to invariant learning. However, constructing unbiased gradient estimates in CSO is challenging due to the composition structure. As an alternative, we propose a biased stochastic gradient descent (BSGD) algorithm and study the bias-variance tradeoff under different structural assumptions. We establish the sample complexities of BSGD for strongly convex, convex, and weakly convex objectives, under smooth and non-smooth conditions. We also provide matching lower bounds of BSGD for convex CSO objectives. Extensive numerical experiments are conducted to illustrate the performance of BSGD on robust logistic regression, model-agnostic meta-learning, and instrumental variable regression.
• Enhancing Optimality-feasibility Tradeoff For Data-driven Optimization Under Uncertain Constraints
• By Henry Lam, Huajie Qian | Columbia University
• Optimization formulations to handle data-driven decision-making under uncertain constraints, such as robust optimization, often encounter a statistical trade-off between feasibility and optimality that potentially leads to over-conservative solutions. We show how we can exploit the intrinsic low dimensionality of the solution sets possibly output from these formulations to enhance this trade-off. We demonstrate the dimension-free performance of our strategy in obtaining solutions that are, in a sense, both feasible and asymptotically optimal among the considered class of formulations. We apply our approach to data-driven optimization under uncertainty and prediction interval construction in regression analysis.
• Gradient Sliding Framework For Decentralized Optimization
• By Yuyuan Ouyang | Clemson University
• We proposed a gradient sliding algorithm that can be applied to decentralized optimization. To this end, we analyze a class of convex optimization problems with linear constraints. We prove that the linear operator computations can be skipped from time to time, while maintaining the overall iteration complexity for solving the constrained problem. By choosing the linear operator to be the graph Laplacian, we obtain an algorithm for decentralized optimization that is able to skip communication from time to time, while maintaining the overall complexity.
• A Unified Convergence Analysis For Shuffling-type Gradient Methods
• Lam M. Nguyen, Quoc Tran Dinh, Dzung Phan, Phuong Ha Nguyen, Marten van Dijk | IBM Research, Department of Statistics and Operations Research at UNC, University of Connecticut
• We provide a unified convergence analysis for a class of shuffling-type gradient methods for solving a well-known finite-sum minimization problem commonly used in machine learning. This algorithm covers various variants such as randomized reshuffling, single shuffling, and cyclic/incremental gradient schemes. We consider two different settings: strongly convex and non-convex problems. Our main contribution consists of new non-asymptotic and asymptotic convergence rates for a general class of shuffling-type gradient methods to solve both non-convex and strongly convex problems. While our rate in the non-convex problem is new, the rate on the strongly convex case matches (up to a constant) the best-known results. However, unlike existing works in this direction, we only use standard assumptions such as smoothness and strong convexity.

#### Optimization and Machine Learning Interface

Session Chair: Mert Gurbuzbalaban

• Stochastic Subgradient Methods With Averaging For Nonsmooth And Nonconvex Stochastic Optimization
• By Andrzej Ruszczynski | Rutgers University
• We prove convergence of a single time-scale stochastic subgradient method with subgradient averaging for constrained problems with a nonsmooth and nonconvex objective function having the property of generalized differentiability. We illustrate its application to training a ReLU neural network. Then we extend our approach to multi-level problems involving compositions of non-smooth and non-convex functions and prove convergence under fairly general conditions.
• A Primal-dual Algorithm With Linesearch For Nonbilinear Convex-concave Saddle Point Problems
• By Erfan Yazdandoost Hamedani, Necdet Serhat Aybat | Penn State University
• We propose a first-order primal-dual method, APDB, to solve convex-concave saddle point problems of the form L(x,y) = f(x) + Phi(x,y) – h(y) with a nonbilinear coupling Phi. Assuming the partial gradients of Phi satisfy certain continuity property, we show the iterates converge to a saddle point; and for any (x,y), we derive error bounds for L(x_k,x)-L(y,y_k) for the ergodic sequence {x_k, y_k}. APDB employs a line search to determine the primal and dual stepsizes; hence, it does not require Lipschitz constants as input. We show O(1/k) rate when the problem is merely convex in x; furthermore, assuming Phi is affine in y and f is strongly convex, we can improve the rate to O(1/k^2), where k denotes the total oracle calls. To best of our knowledge, for the strongly convex-affine setting, a line-search ensuring O(1/k^2) rate is proposed for the ﬁrst time for when Phi is not bilinear.
• Stochastic Approximation Algorithms For Multi-level Stochastic Optimization Problems
• By Saeed Ghadimi, Krishnakumar Balasubramanian | University of Waterloo, University of California
• In this talk, we present two weighted averaged stochastic approximation algorithms for solving multi-level stochastic nested problems. We show that the first algorithm achieves a sample complexity of $O(1/ε^6)$ for finding a $ε$-stationary point of the problem by taking mini-batch of samples per iterations. We then improve this complexity bound to $O(1/ε^4)$ by modifying the algorithm and making slightly stronger assumptions. This bound seems to be unimprovable even for single-level general smooth stochastic optimization.

#### Optimization in unsupervised learning

Session Chair: Kaizheng Wang

• Presenter by Purnamrita Sarkar
• SDP Relaxation For Clustering Under Gaussian Mixture Model: Hidden Integrality, Statistical Optimality And Semirandom Robustness
• By Yingjie Fei | Cornell University
• We will introduce the clustering problem under Gaussian Mixture Model. After discussing several popular algorithms, we will present an algorithm based on semidefinite programming (SDP). We show that despite being a relaxation, this algorithm achieves a nearly optimal error rate in terms of distance to the target solution, and that this result is enabled by a surprising connection with an Oracle integer program. Moreover, this algorithm is robust under the so-called semirandom model, a property many algorithms lack.
• Dual Principal Component Pursuit: A Non-convex Framework For Learning A Union Of Subspaces
• By Rene Vidal | Johns Hopkins University
• We consider the problem of learning a union of subspaces from data corrupted by outliers. Classical approaches based on convex optimization are designed for subspaces whose dimension is small relative to the ambient dimension. Here we propose a non-convex approach called Dual Principal Component Pursuit (DPCP), which is specifically designed for subspaces whose dimension is close to the ambient dimension. We provide theoretical guarantees under which every global solution to DPCP is a vector in the orthogonal complement to one of the subspaces. We also study various optimization algorithms for solving the DPCP problem and analyze their convergence to a global minimum. Experiments show that the proposed methods are able to handle more outliers and higher relative dimensions than state-of-the-art methods. Joint work with Daniel Robinson, Manolis Tsakiris and Zhihui Zhu.
• Sketching Semidefinite Programs For Faster Clustering
• By Dustin Mixon | The Ohio State University
• Many clustering problems enjoy solutions by semidefinite programming. Theoretical results in this vein frequently consider data with a planted clustering and a notion of signal strength such that the semidefinite program exactly recovers the planted clustering when the signal strength is sufficiently large. In practice, semidefinite programs are notoriously slow, and so speedups are welcome. In this talk, we show how to sketch a popular semidefinite relaxation of a graph clustering problem known as minimum bisection, and our analysis supports a meta-claim that the clustering task is less computationally burdensome when there is more signal.
• Biconvex Clustering
• By Jason Xu | Duke University
• We introduce feature weights to the convex clustering objective. The resulting problem becomes biconvex, and as such remains well-behaved statistically and algorithmically. In particular, we derive a fast algorithm with convergence guarantees and establish finite-sample bounds on prediction error. Our biconvex clustering method performs feature selection adaptively throughout the clustering task. As the learned weights change the effective feature space, pairwise affinities based can be updated across show that our approach effectively addresses the challenges of dimensionality while largely removing the strong dependence of existing approaches on carefully tuned heuristics.
• Clustering Via Uncoupled Regression
• By Kaizheng Wang, Yuling Yan, Mateo Diaz | Columbia University, Princeton University, Cornell University
• Consider a clustering problem where one receives unlabeled samples drawn from a balanced mixture of two elliptical distributions and aims for a classifier to estimate the labels. Many popular methods require individual components of the mixture to be spherical and perform poorly when they are stretched. We propose a non-convex program seeking for an affine transform to turn the data into a one-dimensional point cloud concentrating around -1 and 1, which makes clustering easy. We show that the non-convex loss function exhibits desirable landscape properties as long as the sample size is large. We leverage this to prove that an efficient first-order algorithm achieves near-optimal statistical precision without good initialization. We also propose a general methodology for multi-class clustering tasks with flexible choices of feature transforms and loss objectives.

#### Optimization methods for Machine Learning

Session Chair: Haihao Lu and Azam Asl

• Concentration Results For Heavy-tailed Random Distributions
• By Arian Maleki | Columbia University
• The concentration of measure inequalities have recently received substantial attention in high-dimensional statistics and machine learning. While concentration inequalities are well-understood for subGaussian and subexponential random variables, in many application areas, such as signal processing, machine learning and optimization we need concentration results for sums of random variables with heavier tails. The standard technique, i.e. finding upper bounds for the moment generating function (MGF), fails for heavy-tailed distributions. In this talk, we obtain concentration and large deviation for the sums of independent and identically distributed random variables with heavy-tailed distributions. We further discuss a few applications of our concentration inequalities in studying non-convex optimization problems in a few data science applications.
• Can graph neural networks count substructures?
• By Soledad Villar | New York University
• The ability to detect and count certain substructures in graphs is important for solving many tasks on graph-structured data, especially in the contexts of computational chemistry and biology as well as social network analysis. Inspired by this, we propose to study the expressive power of graph neural networks (GNNs) via their ability to count attributed graph substructures, extending recent works that examine their power in graph isomorphism testing and function approximation. Joint work with Z. Chen, L. Chen and J. Bruna.
• Early-Learning Regularization Prevents Memorization of Noisy Labels
• By Carlos Fernandez-Granda | New York University
• In this talk we will present a novel framework to perform classification via deep learning in the presence of noisy annotations. When trained on noisy labels, deep neural networks have been observed to first fit the training data with clean labels during an “early learning” phase, before eventually memorizing the examples with false labels. Our method exploits the early learning phase to achieve robustness to noisy labels. There are two key elements to our approach. First, we leverage semi-supervised learning techniques to produce target probabilities based on the model outputs. Second, we design a regularization term that steers the model towards these targets, implicitly preventing memorization of the false labels.
• Behavior of L-BFGS on Semidefinite Programming and their Nesterov Smoothings
• By Azam Asl, Michael Overton | University of Chicago, New York University
• We empirically investigating the behavior of L-BFGS on nonsmooth optimization, focusing on a difficult problem of Nesterov, as well as eigenvalue optimization problems arising in semidefinite programming applications. We find that although L-BFGS is often a reliable method for minimizing ill-conditioned smooth problems, when the condition number is so large that the function is effectively nonsmooth, L-BFGS frequently fails. This behavior is in sharp contrast to the behavior of full BFGS, which is consistently reliable for nonsmooth optimization problems. We arrive at the conclusion that, for large-scale nonsmooth optimization problems for which full BFGS and other methods for nonsmooth optimization are not practical, it is likely to be preferable to apply L-BFGS to a smooth approximation of a nonsmooth problem than to apply it directly to the nonsmooth problem.

#### Pascal Van Hentenryck

Session Chair: Pascal Van Hentenryck

• Lagrangian Duality For Constrained Deep Learning
• By Ferdinando Fioretto, Pascal Van Hentenryck, Cuong Tran | Syracuse University, Georgia Institute of Technology
• This talk discusses the potential of Lagrangian duality for deep learning tasks that feature complex constraints. Such constraints arise in many science and engineering domains, where the task amounts to learning optimization problems which must be solved repeatedly and include hard physical and operational constraints. The talk focuses on applications where the learning task must enforce constraints on the predictor itself, either because they are natural properties of the function to learn or because it is desirable from a societal standpoint to impose them. Finally, it demonstrate experimentally the benefits that Lagrangian duality brings for these applications.
• Presenter by Alexandre Velloso, Ferdinando Fioretto, Pascal Van Hentenryck
• Security-constrained optimal power flow (SCOPF) is a central problem in power systems that connects the primary response (PR) of generators with the short-term schedule. This problem is repetitively solved to determine robust schedules against component failures. Unfortunately, the consideration of the PR within the SCOPF problem results in complex instances. To address this challenge, a novel approach combining deep learning and robust optimization techniques is presented. The learning model predicts directly the SCOPF implementable solution with feasibility being enforced in two steps. First, a Lagrangian scheme penalizes violations of physical constraints, which are iteratively added as necessary to the learning model. Second, a column-and-constraint-generation algorithm restores feasibility by finding the closest feasible solution to the prediction.
• Presenter by Minas Chatzos
• The Optimal Power Flow (OPF) problem is one of the most important problems in optimization of electrical power systems. It is a non-convex optimization problem which computes the generator set points for power and voltage, given a set of load demands. We develop a Deep Learning-based approximation algorithm to the OPF. The proposed model is evaluated on a large collection of realistic large-sized real topologies. The experimental results show that a mapping between the loads and the generator set points can be computed efficiently and employed as an extremely fast, high-quality approximation to the OPF problem. Moreover, we develop a decomposition-based learning scheme that exploits the graph properties of the power network that significantly reduces the time and memory required for learning the mapping in very large-scale power systems with over 10^4 buses.
• Real-time Dispatching Of Large-scale Ride-sharing Systems: Integrating Optimization, Machine Learning, And Model Predictive Control
• By Connor Riley, Enpeng Yuan, Pascal Van Hentenryck | Georgia Institute of Technology
• This talk presents an end-to-end approach for large-scale, real-time ride-sharing that tightly integrates a state-of-the-art dispatching algorithm, a machine-learning model to predict zone-to-zone demand over time, and a model predictive control optimization to relocate idle vehicles. Experiments using historic taxi trips in New York City indicate that this integration decreases average waiting times from previous work by about 30% over all test cases and reaches close to 55% on the largest instances for high-demand zones.
• Learning Optimal Power Flow With Data Encoding
• By Terrence W K Mak | Georgia Institute of Technology
• AC Optimal Power Flow (AC-OPF) is a fundamental building block in power system optimization. It is often solved repeatedly, especially in regions with large penetration of renewable generations to avoid violating operational limits. To address these challenges, deep learning approaches to predict OPF are proposed. By the curse of dimensionality, learning models often suffer from scalability issues when applied in real world grids. This paper proposes an interesting encoding scheme to reduce the dimensionality of the learning process. A bi-level optimization problem, solved by successive Lagrangian relaxation, is proposed and various learning models are also tested to show the potential in dimension reduction. Experimental results on realistic systems show that learning speed is significantly improved and predictions are still highly accurate.

#### Semantic Search and Recommendation in E-Commerce

Session Chair: Chu Wang

• The Role Of Knowledge (system) In E-commerce Machine Learning
• By Da Xu | Walmart Labs
• Past progress in e-commerce machine learning has focused mostly on population-level performances for the individual search and recommendation tasks. In recent years, growing interests on personalized responses seek a unified schema that leverages knowledge across tasks and domains. We explore the role of knowledge system by first demonstrate effective learning, reasoning and transferring techniques. Joining the knowledge system with state-of-the-art machine learning models is then accomplished by either using the transferrable representations or building deep Bayesian networks that generate prior distributions. Knowledge system creates possibilities for machine learning objectives challenged by the problem complexity and data sparsity.
• Two Overlooked Facts In How Users Interact With Products In E-commerce Search And Recommendation
• By Ruocheng Guo | Arizona State University
• For large-scale online marketplaces with over millions of products at Etsy, users come to rely on search and recommendations to find relevant products from massive inventory. There are two overlooked facts in user-product interaction that could significantly impact ranking algorithms: (1) products are often displayed in 2D grids, and (2) a user can interact with products in a multitude of ways (i.e, click, favorite, purchase). In this work, we discuss how we incorporate them to debiase grid-based search ranking and learn interaction-specific embeddings for recommendations.
• Presenter by Parth Gupta
• Presenter by Xiangyu Zhao

#### Prediction (Forecasting) using artificial neural networks

Session Chair: Nahid Jafari

• Market Share Forecasting As A Multiple Inputs And Multiple Outputs Deep Neural Networks
• By Nahid Jafari | SUNY
• In this study, we address the problem of time series forecasting in which multiple inputs are correlated with one another. We develop deep neural network architectures to handle multiple inputs and multiple outputs forecast. We examine our models using a case study of air passenger market demand of the U.S. commercial aviation industry in 21st century. We first develop several Convolution Neural Networks and Recurrent Neural Networks models on a single input network and examine those on our extracted dataset. We select the gated recurrent unit (GRU) model which performed best overall. We also compare the performance of our fitted model with two well-known statistical methods.

#### Quantum Inspired Optimization and Machine Learning

• Classification With Hybrid Quantum Classical Deep Neural Networks
• By Mohammad Pirhooshyaran, Tamás Terlaky | Lehigh University
• We introduce new advances in hybrid quantum classical deep neural networks that can represent labeled data, and be trained by supervised learning using parameterized quantum gates using the parameter-shift rule and gate decomposition. Results demonstrate the ability of the proposed framework in fitting nonlinear behaviors and classify not-linearly separable sets of data.
• Combining Classical And Quantum Approaches For Max-cut
• By Majid Farhadi, Swati Gupta, Creston Herold, Greg Mohler, Reuben Tate | Georgia Institute of Technology, Georgia Tech Research Institute
• Although the max-cut problem is NP-Complete, there are several heuristics for Max-Cut. Using MQLib and BiqCrunch, we obtain data from a large variety of graphs and investigate how the relationship between the heuristics, exact solution, allowed runtime, and graph properties.<br />We also look at the QAOA algorithm. The QAOA parameters γ and β need to be tuned appropriately; with circuit depth p, this means doing non-convex optimization over a 2p-dimensional space, a non-trivial task. We discuss properties of the parameter landscape and the impact on this using classical warmstarts.
• Template-based Minor Embedding For Adiabatic Quantum Optimization
• By Thiago Serra, Teng Huang, Arvind Raghunathan, David Bergman | Bucknell University, University of Connecticut, Mitsubishi Electric Research Laboratories
• Quantum Annealing (QA) can be used to quickly obtain near-optimal solutions for Quadratic Unconstrained Binary Optimization (QUBO) problems. In QA hardware, each decision variable should be mapped to one or more adjacent qubits in such a way that pairs of variables defining a quadratic term in the objective function are mapped to some pair of adjacent qubits. However, qubits have limited connectivity in existing QA hardware. In this paper, we use integer linear programming to find an embedding of the problem graph into certain classes of minors of the Chimera graph, which we call template embeddings. On an extensive test set consisting of random graphs from five different classes of varying size and sparsity, we can embed more graphs than a state-of-the-art heuristic, our approach scales better with the hardware size, and the runtime is generally orders of magnitude smaller.

#### Recent advances in Gaussian Processes

Session Chair: Raed Al Kontar, Xubo Yue, and Seokhyun Chung

• Weakly-supervised Multi-output Regression Via Correlated Gaussian Processes
• By Seokhyun Chung, Raed Al Kontar, Zhenke Wu | University of Michigan
• In this paper, we consider multi-output regression under a weakly-supervised setting where a subset of data points from multiple groups are unlabeled. We use dependent Gaussian processes for multiple outputs constructed by convolutions with shared latent processes. We introduce hyperpriors for the multinomial probabilities of the unobserved labels and optimize the hyperparameters which we show improves estimation. We derive two variational bounds: (i) a modified variational bound for fast and stable convergence in model inference, (ii) a scalable variational bound that is amenable to stochastic optimization. We use experiments on synthetic and real-world data to show that the proposed model outperforms state-of-the-art models with more accurate estimation of multiple latent functions and unobserved labels.
• Stochastic Gradient Descent in Correlated Settings: A Study on Gaussian Processes
• By Raed Al Kontar | University of Michigan
• On Negative Transfer And Structure Of Latent Functions In Multi-output Gaussian Processes
• By Moyan Li, Raed Kontar | University of Michigan
• We tackle the main problem in MGP regarding how to construct an MGP model that can borrow strength across outputs without forcing correlation. We proposed arrowhead model and pairwise model that can scale to arbitrarily large datasets, can avoid negative transfer and allow any kernel or sparse approximations to be used within. These structures also allow regularization which can provide consistent and automatic selection of related outputs.
• The Renyi Gaussian Process: Toward Improved Generalization
• By Xubo Yue | University of Michigan
• We introduce a new bound on Gaussian process ($\mathcal{GP}$) likelihood using R\’enyi divergence. This bound is a convex combination of Nystr\”om approximation and exact $\mathcal{GP}$. It can tune enforced regularization on model and is a generalization of variational $\mathcal{GP}$. Experiments show that our model delivers improvement over several state-of-the-art methods.

#### Reinforcement Learning and Bandit Algorithms

Session Chair: Botao Hao, Yaqi Duan, and Zhaoran Wang

• Presenter by Wen Sun
• Uncertainty Estimation In Deep Reinforcement Learning
• By Ruiyi Zhang | Duke University
• Most existing deep reinforcement learning (RL) methods directly optimize parameters of a policy/value network by maximizing the expected total reward, or its surrogate. Though often achieving encouraging empirical success, these methods cannot handle the uncertainty in sequential decision-making. To tackle this issue, we place RL problems into the space of probability measures, and interpret it as Wasserstein gradient flows. On the probability-measure space, under specified circumstances, reinforcement learning becomes a convex problem in terms of distribution optimization. To make optimization feasible, we develop efficient approximate algorithms by numerically solving the corresponding discrete gradient flows. Extensive experiments on both synthetic and real large-scale data demonstrate the superior performance of the proposed methods.
• Low-rank Tensor Bandit
• By Botao Hao, Jie Zhou, Will Wei Sun, Zheng Wen | Princeton University, University of Miami, Purdue University, Deepmind
• We introduce a tensor bandit framework to solve the interactive recommendation with tensor action space and bandit feedbacks. Our tensor bandit utilizes online tensor completion to fully exploit the low-rank tensor structure and balances the trade-off between the exploration and exploitation to dynamically update the recommendation for maximal total rewards. Comparing to the competitive method which applies upper confidence bound (UCB) on the vectorized action, our procedure has an improved regret bound.
• Reinforcement Learning With Feedback Graphs
• By Christoph Dann | Google
• We study reinforcement learning in MDPs when the agent receives additional feedback per step in the form of transition observations. Such side observations are available in many tasks through extended sensors or prior knowledge (e.g., when certain actions yield similar outcome). We formalize this setting using a feedback graph and show that model-based algorithms can use side observations for more efficient learning. We present a regret bound that, ignoring logarithmic factors and lower-order terms, depends only on the size of the largest acyclic subgraph of the feedback graph, in contrast with a polynomial dependency on the number of states and actions in the absence of side observations.
• Continuous-time Linear-quadratic Episodic Reinforcement Learning With Sub-linear Regret
• By Matteo Basei, Xin Guo, Anran Hu | EDF R&D, University of California-Berkeley, UC Berkeley
• This paper studies the continuous-time linear-quadratic reinforcement learning problem in the episodic settings. We first show that directly applying discrete-time reinforcement learning algorithms to the continuous-time setting results in a linear regret. Then, we propose an algorithm with continuous-time controls based on continuous-time regularized least-squares estimation. We establish a sub-linear regret bound by combining the analysis of the finite sample parameter estimation errors with the perturbation errors. A stronger result is obtained when the parameter set is compact and when projection is applied. Finally, we discuss the one-dimensional special case and establish an improved O(√N) regret bound, where N is the number of episodes.
• Calibrating Sales Forecast Under Pandemic: Random Design Online Isotonic Regression
• By David Simchi-Levi, Michelle Wu, Ruihao Zhu | Massachusetts Institute of Technology
• We study the random-design online isotonic regression setting, where a decision-maker iteratively predicts the label of the covariate sampled sequentially from a given distribution. The goal is to attain small regret bound against the loss-minimized isotonic/monotone function in the hindsight. We first characterize the minimiax-optimal regret bound for this problem with the (non-constructive) relaxation framework derived from the sequential complexity theory of general online non-parametric regression settings. More importantly, we design a novel sampling exponential weights (SEW) policy that incorporates the “random playout” technique into the (dynamic programming accelerated) exponential weights algorithm to achieve the minimax-optimal regret bound in a computationally-efficient manner.
• Presenter by Haoyang Cao
• Risk-Sensitive Reinforcement Learning: Near-Optimal Risk-Sample Tradeoff In Regret
• By Yingjie Fei | Cornell University
• We study risk-sensitive reinforcement learning with the exponential utility. We propose two model-free algorithms, which implement a form of risk-sensitive optimism in the face of uncertainty and adapt to both risk-seeking and risk-averse modes of exploration. We prove that both algorithms achieve regret that is $\tilde{O}(\sqrt{T})$ and scales exponentially in some other model parameters. On the flip side, we establish a regret lower bound showing that the exponential dependence is unavoidable for any algorithm with an $\tilde{O}(\sqrt{T})$ regret, thus certifying the near-optimality of the proposed algorithms. Our results demonstrate that incorporating risk awareness into reinforcement learning necessitates an exponential cost, which quantifies the fundamental tradeoff between risk sensitivity and sample efficiency.