Course Objective
Achieving the full potential of modern computing platforms at all scales from cell phones and laptops up to supercomputers is imperative for meeting the ever-growing demand for computational power. Unfortunately, the computational power of modern machines comes with a great deal of complexity, such that exploiting even a fraction of a computing platform’s potential requires a deep understanding of all layers of the underlying platform as well as the application at hand. Learning and taking complete advantage of modern machines is necessary to develop high-performance codes that can process greater volumes of data orders of magnitude more efficiently, which will become increasingly important as data and problem sizes continue to grow.
This course teaches hands-on practical performance engineering on high-performance computer systems, starting with single-processor multicore platforms up to large-scale distributed systems. Topics include cache-friendly algorithm analysis and implementation, parallel programming, concurrent data structures, tools for effective performance engineering, and modern high-performance computing (HPC) applications.
Course Goals and Learning Outcomes
After completing this course, students should have the following capabilities:
- Design high-performance code. Good design requires not only good algorithms for the problem at hand but also good understanding of the underlying computer system via performance models. This course will discuss practical performance modeling of current architectures, including multi-core, coprocessor (e.g., GPUs) and many-core designs.
- Implement fast code for shared-memory machines, distributed-memory machines, and GPUs.
- Become familiar with a variety of use cases for high-performance computing from different domains.
- Present and write up their proposed work and results clearly.
Prerequisites
- Proficiency in C/C++ programming.
- Algorithms and data structures.
- Preferred courses:
- Algorithms (one of CS 6550 / CSE 6140 / CSE 6220)
- Data structures (e.g., CS 1332)
- Preferred courses:
- Operating systems basics (optional but preferred)
The amount of time students spend in this class will greatly vary depending on their background. Ideally, students should be comfortable with sequential programming in “mainstream” general-purpose compiled languages (e.g., Fortran, C, C++) and familiar with algorithm analysis (i.e., big-O).
Grading scale
Grades will be earned according to the standard scale:
A 90-100%
B 80-89%
C 70-79%
D 60-69%
F 0-59%
- Passing grades for Pass/Fail students are C or better.
- Auditing students are welcome to come to office hours to discuss exercises and projects; they may also submit code components of homework for automatic evaluation (e.g., we will put their submissions into the queue for benchmarking scripts). They will not, however, receive graded feedback.
Detailed schedule
Here is a link to a detailed schedule.
Here is a link to a calendar option.
Syllabus
Here is a link to the full syllabus.
Lecture materials
Lecture topic | Presenter | Slides | Video (all on the course YT channel) |
Lecture 1 – Course logistics & Intro to HPC | Helen | Lecture 1 slides | link |
Review session – Intro to cluster computing & C | Kaan and Abhishek | Review session slides | link |
Lecture 2 – Locality & Matrix Multiplication | Helen | Lecture 2 slides | link |
Lecture 3 – More Matrix Multiplication & Roofline Model | Helen | Lecture 3 slides | link |
Lecture 4 – I/O-friendly Data Structures I (Trees, skip lists) | Helen | Lecture 4 slides | link |
Lecture 5 – I/O-friendly Data Structures II (LSM trees, Filters, PMA) | Helen | Lecture 5 slides | link |
Lecture 6 – Shared-Memory Programming | Helen | Lecture 6 slides | link |
Lecture 7 – Locking and Nondeterminism | Helen | Lecture 7 slides | link |
Lecture 8 – Analysis of Multithreaded Algorithms | Helen | Lecture 8 slides | link |
Lecture 9 – Theory of Data-Parallel Algorithms | Helen | Lecture 9 slides | link |
Lecture 10 – Practical Vectorization | Brian Wheatman | Vectorization notes | link |
Lecture 11 – MPI I | Helen | Lecture 11 slides | link |
Lecture 12 – MPI II | Helen | Lecture 12 slides | link |
Lecture 13 – GPU Programming | Helen | Lecture 13 slides | link |
Lecture 14 – Kokkos | Damien Lebrun-Grandie | Lecture 14 slides | link |
Lecture 15 – Sparse Matrix-Vector Multiplication | Helen | Lecture 15 slides | link |
Lecture 16 – Graph Optimization | Helen | Lecture 16 slides | link |
Lecture 17 – Accelerating GNN Dataloading on Multi-GPU Systems: dgl.graphbolt | Fatih Balin | Lecture 17 slides | link |
Lecture 18 – Graph Representations | Helen | Lecture 18 slides | link |
Lecture 19 – GPU Provisioning | Michael Pellauer | Lecture 19 slides | link |
Lecture 20 – Computational Journeys in a Sparse Universe | Aydin Buluc | Lecture 20 slides | link |
Lecture 21 – BP-tree | Helen | Lecture 21 slides | link |
Lecture 22 – Exploiting Data Sparsity in Scientific Applications | Yuxi Hong | Lecture 22 slides | link |
Lecture 23 – Synchronization Without Locks | Helen | Lecture 23 slides | link |
Lecture 24 – Filters | Helen | Lecture 24 slides | link |
Assignments