Instructor: Ashwin Pananjady
TA: Zheyi Tang
(Tentative) Schedule:
Lectures: Tu Th 3.30-4.45pm, Groseclose 119
Instructor OH: 4.45-5.30pm Tu and 12-1pm Th
Problem solving session: 12-1pm F
Description: This course for advanced undergraduates develops the (mathematical) foundations of data science, introducing students to the life-cycle of data-driven decision-making based on principles of probabilistic modeling, optimization-based model-fitting, and statistical inference. Topics include: frequentist and Bayesian paradigms of data science, low-dimensional parameter estimation and confidence intervals, inference and uncertainty quantification in (linear and nonlinear) regression, high-dimensional models and fundamental limits of inference, high dimensional stochastic optimization problems, and (multiple) hypothesis testing. Time permitting, we will also cover some basic causal inference, robustness, and stochastic control/reinforcement learning. Our focus will be on looking under the hood of modern data science methodology: while there will be some exposure to implementing these methods on real data, our primary focus will be on understanding these methods and developing judgement for them.
Upon successful completion of the course, you will have learned:
(a) How to recognize and justify a reasonable data generating process for a problem at hand and design the appropriate methodology—by formulating and solving optimization problems—to make useful inferences from the data.
(b) How to think carefully and critically about modeling assumptions under which a particular method is expected to work
(c) To have a qualitative and quantitative understanding of which method is more suitable given the various constraints (data/computation/flexibility) at hand.
(d) How to evaluate, in a principled manner, the performance of complex methods (if you do a simulation/theory project), or how to apply some of these methods on a concrete real world problem involving data (if you do a real data project).
Prerequisites: probability and statistics (ISYE 3030 at a minimum), linear algebra and multivariable calculus (MATH 1553 and 2551 or equivalent), basic optimization (at level of ISYE 3133), proficiency with Python programming (CS 1301 at minimum). If you know Matlab or R instead that is fine, but please note that support from the course staff will be limited in languages other than Python. The most important prerequisite is mathematical maturity and familiarity with proof-based arguments. In particular, you should have taken a proof-based course at the level of MATH 2106 or CS 3510.
We will try our best to bring you up to speed with some linear algebra, probability, and basic Python coding by using some auxiliary handouts, a review lecture, and optional homework (HW0). Given the breadth of modern data science and the fact that we will try to cover topics with rigor, the class will be proceed at a fast pace. Students should expect to do extra work in proportion to the amount of background that they are missing.
Note: If you are not sure if you have the prerequisites for this class but are interested enough to want to take it, please reach out to the instructor to discuss your case.
Below is a list of candidate topics:
Part I: Prediction
- Basic statistical principles: Generative modeling, Bayesian and frequentist thinking, maximum likelihood and MAP. Assumptions on which confident decision making relies.
- Linear regression (reviews of multivariable calculus and linear algebra)
- Nonlinear (in particular polynomial) regression
- Regularization, bias-variance trade-off, and validation on holdout data
- Predicting probabilities: classification and logistic regression
- Nonparametric methods: Nearest neighbors
- Classification with neural networks
- Fundamental understanding of optimization algorithms for prediction problems
- Quantifying model flexibility through statistical analysis
Part II: Inference and confident decision-making
- Cross-validation and resampling
- Bootstrapping
- Binary decisions, p-values, and confidence intervals
- A/B testing with parametric and nonparametric methods
- Multiple testing
- Sequential testing using multi-armed bandits
- Observational data and causal inference
Bonus topics in data science (will be covered only if there is time/interest)
- Robustness
- Sequential decision-making via reinforcement learning