ISyE 4803: Foundations of Modern Data Science (Spring 2025)

Instructor: Ashwin Pananjady
TA: Yinuo Yu

(Tentative) Schedule:
Lectures: Tu Th 3.30-4.45pm, Groseclose 119
Instructor OH: 4.45-5.30pm Tu, Groseclose 443
Problem solving session: 12-1.15pm Th, Groseclose 118
TA OH: 3.30-4pm M and 12.30-1pm W, ISYE Main 224

Description: This course for advanced undergraduates develops the (mathematical) foundations of data science, introducing students to the life-cycle of data-driven decision-making based on principles of probabilistic modeling, optimization-based model-fitting, and statistical inference. Topics include: frequentist and Bayesian paradigms of data science, low-dimensional parameter estimation and confidence intervals, inference and uncertainty quantification in (linear and nonlinear) regression, high-dimensional models and fundamental limits of inference, high dimensional stochastic optimization problems, and (multiple) hypothesis testing. Time permitting, we will also cover some basic causal inference, robustness, and stochastic control/reinforcement learning. Our focus will be on looking under the hood of modern data science methodology: while there will be some exposure to implementing these methods on real data, our primary focus will be on understanding these methods and developing judgement for them.

Upon successful completion of the course, you will have learned:
(a) How to recognize and justify a reasonable data generating process for a problem at hand and design the appropriate methodology—by formulating and solving optimization problems—to make useful inferences from the data.
(b) How to think carefully and critically about modeling assumptions under which popular parametric and nonparametric methods are expected to work
(c) To have a qualitative and quantitative understanding of which method is more suitable given the various constraints (data/computation/flexibility) at hand.
(d) How to evaluate, in a principled manner, the performance of complex methods (if you do a simulation/theory project), or how to apply some of these methods on a concrete real world problem involving data (if you do a real data project).

Prerequisites: probability and statistics (ISYE 3030 at a minimum), linear algebra and multivariable calculus (MATH 1553 and 2551 or equivalent), basic optimization (at level of ISYE 3133), proficiency with Python programming (CS 1301 at minimum). If you know Matlab or R instead that is fine, but please note that support from the course staff will be limited in languages other than Python. The most important prerequisite is mathematical maturity and familiarity with proof-based arguments. In particular, you should have taken a proof-based course at the level of MATH 2106 or CS 3510.

We will try our best to bring you up to speed with some linear algebra, probability, and basic Python coding by using some auxiliary handouts, a review lecture, and optional homework (HW0). Given the breadth of modern data science and the fact that we will try to cover topics with rigor, the class will be proceed at a fast pace. Students should expect to do extra work in proportion to the amount of background that they are missing.

Note: If you are not sure if you have the prerequisites for this class but are interested enough to want to take it, please reach out to the instructor to discuss your case.

Below is a list of candidate topics:

Part I: Prediction

Basic statistical principles: Generative modeling, Bayesian and frequentist thinking, maximum likelihood and MAP. Assumptions on which confident decision making relies.
Linear regression (reviews of multivariable calculus and linear algebra)
Nonlinear (in particular polynomial) regression
Regularization, bias-variance trade-off, and validation on holdout data
Predicting probabilities: classification and logistic regression
Nonparametric methods: Nearest neighbors
Classification with neural networks
Fundamental understanding of optimization algorithms for prediction problems
Quantifying model flexibility through statistical analysis

Part II: Inference and confident decision-making

Cross-validation and resampling
Bootstrapping
Binary decisions, p-values, and confidence intervals
A/B testing with parametric and nonparametric methods
Multiple testing
Sequential testing using multi-armed bandits
Observational data and causal inference

Bonus topics in data science (will be covered only if there is time/interest)

Robustness
Sequential decision-making via reinforcement learning