CSE 6230 – Spring 2024

Course Objective

Achieving the full potential of modern computing platforms at all scales from cell phones and laptops up to supercomputers is imperative for meeting the ever-growing demand for computational power. Unfortunately, the computational power of modern machines comes with a great deal of complexity, such that exploiting even a fraction of a computing platform’s potential requires a deep understanding of all layers of the underlying platform as well as the application at hand. Learning and taking complete advantage of modern machines is necessary to develop high-performance codes that can process greater volumes of data orders of magnitude more efficiently, which will become increasingly important as data and problem sizes continue to grow.

This course teaches hands-on practical performance engineering on high-performance computer systems, starting with single-processor multicore platforms up to large-scale distributed systems. Topics include cache-friendly algorithm analysis and implementation, parallel programming, concurrent data structures, tools for effective performance engineering, and modern high-performance computing (HPC) applications.

Course Goals and Learning Outcomes

After completing this course, students should have the following capabilities:

Design high-performance code. Good design requires not only good algorithms for the problem at hand but also good understanding of the underlying computer system via performance models. This course will discuss practical performance modeling of current architectures, including multi-core, coprocessor (e.g., GPUs) and many-core designs.
Implement fast code for shared-memory machines, distributed-memory machines, and GPUs.
Become familiar with a variety of use cases for high-performance computing from different domains.
Present and write up their proposed work and results clearly.

Prerequisites

Proficiency in C/C++ programming.
Algorithms and data structures.
- Preferred courses:
  - Algorithms (one of CS 6550 / CSE 6140 / CSE 6220)
  - Data structures (e.g., CS 1332)
Operating systems basics (optional but preferred)

The amount of time students spend in this class will greatly vary depending on their background. Ideally, students should be comfortable with sequential programming in “mainstream” general-purpose compiled languages (e.g., Fortran, C, C++) and familiar with algorithm analysis (i.e., big-O).

Grading scale

Grades will be earned according to the standard scale:

A 90-100%

B 80-89%

C 70-79%

D 60-69%

F 0-59%

Passing grades for Pass/Fail students are C or better.
Auditing students are welcome to come to office hours to discuss exercises and projects; they may also submit code components of homework for automatic evaluation (e.g., we will put their submissions into the queue for benchmarking scripts). They will not, however, receive graded feedback.

Detailed schedule

Here is a link to a detailed schedule.

Here is a link to a calendar option.

Syllabus

Here is a link to the full syllabus.

Lecture materials

Lecture topic	Presenter	Slides	Video (all on the course YT channel)
Lecture 1 – Course logistics & Intro to HPC	Helen	Lecture 1 slides	link
Review session – Intro to cluster computing & C	Kaan and Abhishek	Review session slides	link
Lecture 2 – Locality & Matrix Multiplication	Helen	Lecture 2 slides	link
Lecture 3 – More Matrix Multiplication & Roofline Model	Helen	Lecture 3 slides	link
Lecture 4 – I/O-friendly Data Structures I (Trees, skip lists)	Helen	Lecture 4 slides	link
Lecture 5 – I/O-friendly Data Structures II (LSM trees, Filters, PMA)	Helen	Lecture 5 slides	link
Lecture 6 – Shared-Memory Programming	Helen	Lecture 6 slides	link
Lecture 7 – Locking and Nondeterminism	Helen	Lecture 7 slides	link
Lecture 8 – Analysis of Multithreaded Algorithms	Helen	Lecture 8 slides	link
Lecture 9 – Theory of Data-Parallel Algorithms	Helen	Lecture 9 slides	link
Lecture 10 – Practical Vectorization	Brian Wheatman	Vectorization notes	link
Lecture 11 – MPI I	Helen	Lecture 11 slides	link
Lecture 12 – MPI II	Helen	Lecture 12 slides	link
Lecture 13 – GPU Programming	Helen	Lecture 13 slides	link
Lecture 14 – Kokkos	Damien Lebrun-Grandie	Lecture 14 slides	link
Lecture 15 – Sparse Matrix-Vector Multiplication	Helen	Lecture 15 slides	link
Lecture 16 – Graph Optimization	Helen	Lecture 16 slides	link
Lecture 17 – Accelerating GNN Dataloading on Multi-GPU Systems: dgl.graphbolt	Fatih Balin	Lecture 17 slides	link
Lecture 18 – Graph Representations	Helen	Lecture 18 slides	link
Lecture 19 – GPU Provisioning	Michael Pellauer	Lecture 19 slides	link
Lecture 20 – Computational Journeys in a Sparse Universe	Aydin Buluc	Lecture 20 slides	link
Lecture 21 – BP-tree	Helen	Lecture 21 slides	link
Lecture 22 – Exploiting Data Sparsity in Scientific Applications	Yuxi Hong	Lecture 22 slides	link
Lecture 23 – Synchronization Without Locks	Helen	Lecture 23 slides	link
Lecture 24 – Filters	Helen	Lecture 24 slides	link

Assignments

CSE 6230: HPC Tools and Applications – Spring ’24

Georgia Institute of Technology