Course Objective

Achieving the full potential of modern computing platforms at all scales from cell phones and laptops up to supercomputers is imperative for meeting the ever-growing demand for computational power. Unfortunately, the computational power of modern machines comes with a great deal of complexity, such that exploiting even a fraction of a computing platform’s potential requires a deep understanding of all layers of the underlying platform as well as the application at hand. Learning and taking complete advantage of modern machines is necessary to develop high-performance codes that can process greater volumes of data orders of magnitude more efficiently, which will become increasingly important as data and problem sizes continue to grow.

This course teaches hands-on practical performance engineering on high-performance computer systems, starting with single-processor multicore platforms up to large-scale distributed systems. Topics include cache-friendly algorithm analysis and implementation, parallel programming, concurrent data structures, tools for effective performance engineering, and modern high-performance computing (HPC) applications.

Course Goals and Learning Outcomes

After completing this course, students should have the following capabilities:

  • Design high-performance code. Good design requires not only good algorithms for the problem at hand but also good understanding of the underlying computer system via performance models. This course will discuss practical performance modeling of current architectures, including multi-core, coprocessor (e.g., GPUs) and many-core designs.
  • Implement fast code for shared-memory machines, distributed-memory machines, and GPUs.
  • Become familiar with a variety of use cases for high-performance computing from different domains.
  • Present and write up their proposed work and results clearly.


  • Proficiency in C/C++ programming.
  • Algorithms and data structures.
    • Preferred courses:
      • Algorithms (one of CS 6550 / CSE 6140 / CSE 6220)
      • Data structures (e.g., CS 1332)
  • Operating systems basics (optional but preferred)

The amount of time students spend in this class will greatly vary depending on their background. Ideally, students should be comfortable with sequential programming in “mainstream” general-purpose compiled languages (e.g., Fortran, C, C++) and familiar with algorithm analysis (i.e., big-O).

Grading scale

Grades will be earned according to the standard scale:

A 90-100%

B 80-89%

C 70-79%

D 60-69%

F 0-59%

  • Passing grades for Pass/Fail students are C or better.
  • Auditing students are welcome to come to office hours to discuss exercises and projects; they may also submit code components of homework for automatic evaluation (e.g., we will put their submissions into the queue for benchmarking scripts). They will not, however, receive graded feedback.

Detailed schedule

Here is a link to a detailed schedule.

Here is a link to a calendar option.


Here is a link to the full syllabus.

Lecture materials

Lecture topicPresenterSlidesVideo (all on the course YT channel)
Lecture 1 – Course logistics & Intro to HPCHelenLecture 1 slideslink
Review session – Intro to cluster computing & CKaan and AbhishekReview session slideslink
Lecture 2 – Locality & Matrix MultiplicationHelenLecture 2 slideslink
Lecture 3 – More Matrix Multiplication & Roofline ModelHelenLecture 3 slideslink
Lecture 4 – I/O-friendly Data Structures I (Trees, skip lists)HelenLecture 4 slideslink
Lecture 5 – I/O-friendly Data Structures II (LSM trees, Filters, PMA)HelenLecture 5 slideslink
Lecture 6 – Shared-Memory ProgrammingHelenLecture 6 slideslink
Lecture 7 – Locking and NondeterminismHelenLecture 7 slideslink
Lecture 8 – Analysis of Multithreaded AlgorithmsHelenLecture 8 slideslink
Lecture 9 – Theory of Data-Parallel AlgorithmsHelenLecture 9 slideslink
Lecture 10 – Practical VectorizationBrian WheatmanVectorization noteslink
Lecture 11 – MPI IHelenLecture 11 slideslink
Lecture 12 – MPI IIHelenLecture 12 slideslink
Lecture 13 – GPU ProgrammingHelenLecture 13 slideslink
Lecture 14 – KokkosDamien Lebrun-GrandieLecture 14 slideslink
Lecture 15 – Sparse Matrix-Vector MultiplicationHelenLecture 15 slideslink
Lecture 16 – Graph OptimizationHelenLecture 16 slideslink
Lecture 17 – Accelerating GNN Dataloading on Multi-GPU Systems: dgl.graphboltFatih BalinLecture 17 slideslink
Lecture 18 – Graph RepresentationsHelenLecture 18 slideslink
Lecture 19 – GPU ProvisioningMichael PellauerLecture 19 slideslink
Lecture 20 – Computational Journeys in a Sparse UniverseAydin BulucLecture 20 slideslink
Lecture 21 – BP-treeHelenLecture 21 slideslink
Lecture 22 – Exploiting Data Sparsity in Scientific ApplicationsYuxi HongLecture 22 slideslink
Lecture 23 – Synchronization Without LocksHelenLecture 23 slideslink
Lecture 24 – FiltersHelenLecture 24 slideslink