## Announcements / Reminders

-Pre-proposal due this Thursday (Jan 18) - use your GT login to access papers, if necessary
-HW1 due next Monday (Jan 22)
-PACE ICE down Jan 23-25

- Originally the plan was to release HW2 on Jan 25th due to PACE maintenance - I will release it earlier (on Jan 22nd) to allow extra time to read the handout. The deadline will stay the same (Feb 6)
-Please search for project partners via Ed discussion thread


## Previously on CSE 6230

Memory hierarchy
Caches (and associativity) -> also, a short sidebar to answer a question from last lecture

Memory latency and bandwidth
Ideal-Cache model
Matrix multiplication
Cache-aware algorithms
TO FINISH -> Cache-oblivious algorithms

## Set associativity and resource augmentation

Given an $\alpha$-way set-associative cache of total size k , assuming sets are chosen with a fully-random hash function,

Recall: tradeoff in $\alpha$ between hit rate and search latency

- For $\alpha=\omega(\log k)$, the paging cost of an $\alpha$-way set-associative LRU cache is within additive $O(1)$ of that a fully-associative LRU cache of size $(1-o(1)) k$, with probability $1-1 / \operatorname{poly}(k)$, for all request sequences of length poly $(k)$.
- For $\alpha=o(\log k)$, and for all $c=O(1)$ and $r=O(1)$, the paging cost of an $\alpha$-way setassociative LRU cache is not within a factor $c$ of that a fully-associative LRU cache of size $k / r$, for some request sequence of length $O\left(k^{1.01}\right)$.
- For $\alpha=\omega(\log k)$, if the hash function can be occasionally changed, the paging cost of an $\alpha$-way set-associative LRU cache is within a factor $1+o(1)$ of that a fully-associative LRU cache of size $(1-o(1)) k$, with probability $1-1 / \operatorname{poly}(k)$, for request sequences of arbitrary (e.g., super-polynomial) length.


## Recall: Tiled (Blocked) matrix multiply

void Tiled_Mult (double ${ }^{*}$ C, double ${ }^{*}$ A, doubld Challenging in multilevel
for (int $64 \_$i $1=0$; i1<n/s; i1+=s) for (int64_t j1=0; j1<n; j1+=s) for (int64_t k1=0; k1<n; k1+=s)
for (int64_t i=i1;


How?
n; j++) for (int64-t $j=j$
for (int64_t k $C\left[i^{*} n+j\right]+=$
n+j];


Tune s so that the $s=\Theta\left(M^{1 / 2}\right)$.
Submatrix Caching Lemma it misses per submatrix. $Q(n)=\Theta\left((n / s)^{3}\left(s^{2} / B\right)=\Theta\left(n^{3} /\left(B M^{1 / 2}\right)\right.\right.$.

Can we achieve cache optimality with none (or at least fewer) tuning che:

## parameters?

 environments

## Divide and Conquer

## Recursive matrix multiplication

Divide-and-conquer on nxn matrices:


8 multiply-adds of $(n / 2) \times(n / 2)$ matrices.

## Recursive code

```
// Assume that n is an exact power of 2.
void Rec_Mult(double *C, double *A, double *B,
            int64_t n, int64_t rowsize) {
    if (n == 1)
        C[0] += A[0] * B[0];
    else {
        int64_t d11 = 0;
        int64_t d12 = n/2;
```

Coarsen base case to overcome functioncall overheads.

```
    int64_t d21 = (n/2) * rowsize;
    int64_t d22 = (n/2) * (rowsize+1);
    Rec_Mult(C+d11, A+d11, B+d11, n/2, rowsize);
    Rec_Mult(C+d11, A+d12, B+d21, n/2, rowsize);
    Rec_Mult(C+d12, A+d11, B+d12, n/2, rowsize);
    Rec_Mult(C+d12, A+d12, B+d22, n/2, rowsize);
    Rec_Mult(C+d21, A+d21, B+d11, n/2, rowsize);
    Rec_Mult(C+d21, A+d22, B+d21, n/2, rowsize);
    Rec_Mult(C+d22, A+d21, B+d12, n/2, rowsize);
    Rec_Mult(C+d22, A+d22, B+d22, n/2, rowsize);
} }
```


## Recursive code

```
// Assume that n is an exact power of 2.
void Rec_Mult(double *C, double *A, double *B,
            int64_t n, int64_t rowsize) {
    if (n == 1)
        C[0] += A[0] * B[0];
    else {
        int64_t d11 = 0;
        int64_t d12 = n/2;
    int64_t d21 = (n/2) * rowsize;
    int64_t d22 = (n/2) * (rowsize+1);
Rec_Mult(C+d11, A+d11, B+d11, n/2, rowsize);
Rec_Mult(C+d11, A+d12, B+d21, n/2, rowsize);
Rec_Mult(C+d12, A+d11, B+d12, n/2, rowsize);
Rec_Mult(C+d12, A+d12, B+d22, n/2, rowsize);
Rec_Mult( \(C+d 21, A+d 21, B+d 11, n / 2\), rowsize);
Rec_Mult(C+d21, A+d22, B+d21, n/2, rowsize);
Rec_Mult(C+d22, A+d21, B+d12, n/2, rowsize);
Rec_Mult(C+d22, A+d22, B+d22, n/2, rowsize);
\} \}
```



## Analysis of work

```
// Assume that n is an exact power of 2.
void Rec_Mult(double *C, double *A, double *B,
            int64_t n, int64_t rowsize) {
    if (n == 1)
    C[0] += A[0] * B[0];
    else {
    int64_t d11 = 0;
    int64_t d12 = n/2;
    int64_t d21 = (n/2) * rowsize;
    int64_t d22 = (n/2) * (rowsize+1);
int64_t d21 = (n/2) * rowsize;
Rec_Mult(C+d11, A+d11, B+d11, \(\mathrm{n} / 2\), rowsize);
Rec_Mult(C+d11, A+d12, B+d21, n/2, rowsize); Rec_Mult(C+d12, A+d11, B+d12, n/2, rowsize); Rec_Mult(C+d12, A+d12, B+d22, n/2, rowsize); Rec_Mult(C+d21, A+d21, B+d11, n/2, rowsize); Rec_Mult(C+d21, A+d22, B+d21, n/2, rowsize); Rec_Mult( \(C+d 22, A+d 21, B+d 12, n / 2\), rowsize); Rec_Mult(C+d22, A+d22, B+d22, n/2, rowsize);
} }
```



$$
\begin{aligned}
W(n)= & 8 W(n / 2)+\Theta(1) \\
= & \Theta\left(n^{3}\right)
\end{aligned}
$$

## Analysis of work

$$
W(n)=8 W(n / 2)+\Theta(1)
$$

recursion tree $\quad W(n)$

## Analysis of work

$$
W(n)=8 W(n / 2)+\Theta(1)
$$

recursion tree

$$
W(n / 2) \quad W(n / 2) \quad \cdots \quad W(n / 2)
$$

Analysis of work

$$
W(n)=8 W(n / 2)+\Theta(1)
$$



## Analysis of work



## Analysis of cache misses

```
// Assume that n is an exact power of 2.
void Rec_Mult(double *C, double *A, double *B,
    int64_t n, int64_t rowsize) {
    if (n == 1)
        C[0] += A[0] * B[0];
    else {
        int64_t d11 = 0;
        int64_t d12 = n/2;
        int64_t d21 = (n/2) * rowsize;
        int64_t d22 = (n/2) * (rowsize+1);
    Rec_Mult(C+d11, A+d11, B+d11, n/2, rowsize);
    Rec_Mult(C+d11, A+d12, B+d21, n/2, rowsize);
    Rec_Mult(C+d12, A+d11, B+d12, n/2, rowsize);
    Rec_Mult(C+d12, A+d12, B+d22, n/2, rowsize);
    Rec_Mult(C+d21, A+d21, B+d11, n/2, rowsize)
    Rec_Mult(C+d21, A+d22, B+d21, n/2, rowsize);
    Rec_Mult(C+d22, A+d21, B+d12, n/2, nowsize);
    Rec_Mult(C+d22, A+d22, B+d22, n/2, rowsize);
} }
\[
Q(n)=\left\{\begin{array}{l}
\Theta\left(n^{2} / \mathcal{B}\right) \text { if } n^{2}<c \mathcal{M} \text { for suff. small const } c \leq 1, \\
8 Q(n / 2)+\Theta(1) \text { otherwise. }
\end{array}\right.
\]
```


## Analysis of cache misses

$$
Q(n)=\left\{\begin{array}{l}
\Theta\left(n^{2} / \mathcal{B}\right) \text { if } n^{2}<c \mathcal{M} \text { for suff. small const } c \leq 1, \\
8 Q(n / 2)+\Theta(1) \text { otherwise. }
\end{array}\right.
$$

recursion tree $Q(n)$

## Analysis of cache misses

$$
Q(n)=\left\{\begin{array}{l}
\Theta\left(n^{2} / \mathcal{B}\right) \text { if } n^{2}<c \mathcal{M} \text { for suff. small const } c \leq 1, \\
8 Q(n / 2)+\Theta(1) \text { otherwise. }
\end{array}\right.
$$

recursion tree


Analysis of cache misses

$$
Q(n)=\left\{\begin{array}{l}
\Theta\left(n^{2} / \mathcal{B}\right) \text { if } n^{2}<c \mathcal{M} \text { for suff. small const } c \leq 1, \\
8 Q(n / 2)+\Theta(1) \text { otherwise. }
\end{array}\right.
$$



## Analysis of cache misses

$$
Q(n)=\left\{\begin{array}{l}
\Theta\left(n^{2} / \mathcal{B}\right) \text { if } n^{2}<c \mathcal{M} \text { for suff. small const } c \leq 1, \\
8 Q(n / 2)+\Theta(1) \text { otherwise. }
\end{array}\right.
$$



Same cache misses as with tiling! $\quad Q(n)=\Theta\left(n^{3} / \mathcal{B} \mathcal{M}^{1 / 2}\right)$

## Efficient cache-oblivious algorithms

- No tuning parameters.
- No explicit knowledge of cache sizes.
- Handle multilevel caches automatically (asymptotically optimally).
- Good in multiprogrammed environments (see: cache-adaptive algorithms).



## Next on CSE 6230

Today: How do the theoretical models and analysis we learned last class pan out in practice?
How can we further model specific machines and applications?

# Lecture 3: Practical Matrix Multiplication and the Roofline Model <br> Helen Xu hxu615@gatech.edu 

Gr
Georgia Tech College of Computing
School of Computational
Science and Engineering

# Case study: Matrix multiplication <br> (From MIT 6.172 OCW) 

## (Square) Matrix Multiplication

```
void Mult(double *C, double *A, double *B, int64_t n) {
    for (int64_t i=0; i < n; i++)
        for (int64_t j=0; j < n; j++)
            for (int64_t k=0; k < n; k++)
            C[i*n+j] += A[i*n+k] * B[k*n+j];
```

\}

$$
\begin{array}{r}
\left(\begin{array}{cccc}
c_{11} & c_{12} & \cdots & c_{1 n} \\
c_{21} & c_{22} & \cdots & c_{2 n} \\
\vdots & \vdots & \ddots & \vdots \\
c_{n 1} & c_{n 2} & \cdots & c_{n n}
\end{array}\right)= \\
\text { C }
\end{array}=\begin{array}{cccc}
\left(\begin{array}{ccccc}
a_{11} & a_{12} & \cdots & a_{1 n} \\
a_{21} & a_{22} & \cdots & a_{2 n} \\
\vdots & \vdots & \ddots & \vdots \\
a_{n 1} & a_{n 2} & \cdots & a_{n n}
\end{array}\right) \cdot\left(\begin{array}{cccc}
b_{11} & b_{12} & \cdots & b_{1 n} \\
b_{21} & b_{22} & \cdots & b_{2 n} \\
\vdots & \vdots & \ddots & \vdots \\
b_{n 1} & b_{n 2} & \cdots & b_{n n}
\end{array}\right) \\
B
\end{array}
$$

## AWS c4.8xlarge Machine Specs

| Feature | Specification |
| :--- | :--- |
| Microarchitecture | Haswell (Intel Xeon E5-2666 v3) |
| Clock frequency | 2.9 GHz |
| Processor chips | 2 |
| Processing cores | 9 per processor chip |
| Hyperthreading | 2 way |
| Floating-point unit | 8 double-precision operations, including <br> fused-multiply-add, per core per cycle |
| Cache-line size | 64 B |
| L1-icache | 32 KB private 8 -way set associative |
| L1-dcache | 32 KB private 8 -way set associative |
| L2-cache | 256 KB private 8 -way set associative |
| L3-cache (LLC) | 25 MB shared 20-way set associative |
| DRAM | 60 GB |

Peak $=\left(2.9 \times 10^{9}\right) \times 2 \times 9 \times 2 \times 8=839$ GFLOPS

## Version 1: Nested loops in Python

```
import sys, random
from time import *
n = 4096
A = [[random.random()
    for row in xrange(n)]
    for col in xrange(n)]
B = [[random.random()
    for row in xrange(n)]
    for col in xrange(n)]
C = [[0 for row in xrange(n)]
    for col in xrange(n)]
start = time()
for i in xrange(n):
    for j in xrange(n):
        for k in xrange(n):
            C[i][j] += A[i][k] * B[k][j]
end = time()
print '%0.6f' % (end - start)
```

Running time
$=21042$ seconds
$\approx 6$ hours
Is this fast?
Should we expect more from our machine?

## Version 1: Nested loops in Python

```
import sys, random
```

import sys, random
from time import *
from time import *
n = 4096
n = 4096
A = [[random.random()
A = [[random.random()
for row in xrange(n)]
for row in xrange(n)]
Back-of-the-envelope calculation
Back-of-the-envelope calculation
2n}\mp@subsup{n}{}{3}=2(\mp@subsup{2}{}{12}\mp@subsup{)}{}{3}=\mp@subsup{2}{}{37}\mathrm{ floating-point operations
Running time = 21042 seconds
\thereforePython gets 277/21042\approx6.25 MFLOPS
Peak \approx 836 GFLOPS
Python gets }\approx0.00075%\mathrm{ of peak

```
```

print '%0.6f' % (end - start)

```
```

print '%0.6f' % (end - start)

```

\section*{Version 2: Java}
```

import java.util.Random;
public class mm_java {
static int n = 4096;
static double[][] A = new double[n][n];
static double[][] B = new double[n][n];
static double[][] C = new double[n][n];
public static void main(String[] args) {
Random r = new Random();
for (int i=0; i<n; i++) {
for (int j=0; j<n; j++) {
A[i][j] = r.nextDouble();
B[i][j] = r.nextDouble();
C[i][j] = 0;
}
}
long start - Sy>cem.nanoTime();

```
            Running time \(=2,738\) seconds
                \(\approx 46\) minutes
... about \(8.8 \times\) faster than Python.

\section*{Version 3: C}


\section*{Where we stand so far}
\begin{tabular}{|c|l|r|r|r|r|r|}
\hline Version & Implementation & \begin{tabular}{r} 
Running \\
time \((s)\)
\end{tabular} & \begin{tabular}{r} 
Relative \\
speedup
\end{tabular} & \begin{tabular}{c} 
Absolute \\
Speedup
\end{tabular} & GFLOPS & \begin{tabular}{r} 
Percent \\
of peak
\end{tabular} \\
\hline 1 & Python & 21041.67 & 1.00 & 1 & 0.007 & 0.001 \\
\hline 2 & Java & 2387.32 & 8.81 & 9 & 0.058 & 0.007 \\
\hline 3 & C & 1155.77 & 2.07 & 18 & 0.119 & 0.014 \\
\hline
\end{tabular}

Why is Python so slow and C so fast?
- Python is interpreted.
- C is compiled directly to machine code.
- Java is compiled to byte-code, which is then interpreted and just-in-time (JIT) compiled to machine code.

\section*{Interpreters are versatile but slow}
- The interpreter reads, interprets, and performs each program statement and updates the machine state.
- Interpreters can easily support high-level programming features - such as dynamic code alteration - at the cost of performance.


\section*{JIT Compilation}

JIT compilers can recover some of the performance lost by interpretation.
- When code is first executed, it is interpreted.
- The runtime system keeps track of how often the various pieces of code are executed.
- Whenever some piece of code executes sufficiently frequently, it gets compiled to machine code in real time.

https://keeplearning.dev/jit-compiler-in-php-655d690f976c
- Future executions of the code use the more-efficient compiled version.

\section*{Recall: Loop order}

We can change the order of the loops in this program without affecting correctness.
```

void Mult(double *C, double *A, double *B, int64_t n) {
for (int64_t i=0; i < n; i++)
for (int64_t j=0; j < n; j++)
for (int\overline{64_t k=0; k < n; k++)}
C[i*n+j] += A[i*n+k] * B[k*n+j];
}

```
```

void Mult(double *C, double *A, double *B, int64_t n) {
for (int64_t i=0; i < n; i++)
for (int64_t k=0; k < n; k++)
for (int64_t j=0; j < n; j++)
C[i*n+j] += A[i*n+k] * B[k*n+j];

```

\section*{Performance of different loop orders}
\begin{tabular}{|lr|}
\hline Loop order \\
(outer to inner)
\end{tabular}\(r\)\begin{tabular}{rr} 
Running \\
time (s)
\end{tabular}\(|\)\begin{tabular}{lr}
1155.77 \\
\hline i, j, k & 177.68 \\
\hline i, k, j & 1080.61 \\
\hline j, i, k & 3056.63 \\
\hline j, k, i & 179.21 \\
\hline k, i, j & 3032.82 \\
\hline k, j, i & \\
\hline
\end{tabular}

\footnotetext{
Loop order affects running time by a factor of 18 !
}

\section*{Recall: Row-major order}

In this matrix-multiplication code, matrices are laid out in memory in row-major order.

\section*{Matrix}
\begin{tabular}{|c|}
\hline Row 1 \\
\hline Row 2 \\
\hline Row 3 \\
\hline Row 4 \\
\hline Row 5 \\
\hline Row 6 \\
\hline Row 7 \\
\hline Row 8 \\
\hline
\end{tabular}

Row 1 Row 2 Row 3

\section*{Access pattern for order i, j, k}
```

for (int i = 0; i < n; ++i)
for (int j = 0; j< n; ++j)
for (int k = 0; k < n; ++k)
C[i][j] += A[i][k] * B[k][j];

```

Running time:
1155.77 s


In-memory layout Excellent spatial locality


\section*{Access pattern for order \(\mathbf{i}, \mathbf{k}, \mathbf{j}\)}
```

for (int i = 0; i < n; ++i)
for (int k = 0; k < n; ++k)
for (int j = 0; j < n; ++j)
C[i][j] += A[i][k] * B[k][j];

```


In-memory layout

        \(=\)


\section*{Access pattern for order j, k, i}
```

for (int j = 0; j < n; ++j)
for (int k = 0; k< n; ++k)
for (int i = 0; i < n; ++i)
C[i][j] += A[i][k] * B[k][j];

```

Running time: 3056.63s


In-memory layout
\(=\)
 B


\section*{Cache performance with different orders}

We can measure the effect of different access patterns using the Cachegrind cache simulator:
```

\$ valgrind --tool=cachegrind ./mm

```
\begin{tabular}{lrr}
\hline \begin{tabular}{l} 
Loop order \\
(outer to inner)
\end{tabular} & \begin{tabular}{r} 
Running \\
time (s)
\end{tabular} & \begin{tabular}{r} 
Last-level-cache \\
miss rate
\end{tabular} \\
\hline i, j, k & 1155.77 & \(7.7 \%\) \\
\hline i, k, j & 177.68 & \(1.0 \%\) \\
\hline j, i, k & 1080.61 & \(8.6 \%\) \\
\hline j, k, i & 3056.63 & \(15.4 \%\) \\
\hline k, i, j & 179.21 & \(1.0 \%\) \\
\hline k, j, i & 3032.82 & \(15.4 \%\) \\
\hline
\end{tabular}

\section*{Version 4: Interchange loops}
\(\left.\)\begin{tabular}{|crrrrrr|}
\hline Version Implementation & \begin{tabular}{r} 
Running \\
time \((\mathrm{s})\)
\end{tabular} & \begin{tabular}{rl} 
Relative \\
speedup \\
spsolute
\end{tabular} & Speedup
\end{tabular} \begin{tabular}{r} 
GFLOPS
\end{tabular} \begin{tabular}{r} 
Percent \\
of peak
\end{tabular} \right\rvert\,

What other simple changes can we try?

\section*{Compiler optimization}

Clang provides a collection of optimization switches. You can specify a switch to the compiler to ask it to optimize.
\begin{tabular}{|c|r|}
\hline Opt. level & Meaning \\
\hline-00 & Do not optimize \\
\hline-01 & Optimize \\
\hline-02 & Optimize even more \\
\hline-03 & Optimize yet more \\
\hline & 54.54 \\
\hline
\end{tabular}

Clang also supports optimization levels for special purposes, such as -0s, which aims to limit code size, and -0g, for debugging purposes.

\section*{Version 5: Optimization switches}
\begin{tabular}{|lrrrrrr|}
\hline Version Implementation & \begin{tabular}{r} 
Running \\
time \((s)\)
\end{tabular} & \begin{tabular}{r} 
Relative \\
speedup
\end{tabular} & \begin{tabular}{c} 
Absolute \\
Speedup
\end{tabular} & GFLOPS & \begin{tabular}{r} 
Percent \\
of peak
\end{tabular} \\
\hline 1 & Python & 21041.67 & 1.00 & 1 & 0.006 & 0.001 \\
\hline 2 & Java & 2387.32 & 8.81 & 9 & 0.058 & 0.007 \\
\hline 3 & C & 1155.77 & 2.07 & 18 & 0.118 & 0.014 \\
\hline 4 & + interchange loops & 177.68 & 6.50 & 118 & 0.774 & 0.093 \\
\hline 5 & + optimization flags & 54.63 & 3.25 & 385 & 2.516 & 0.301 \\
\hline
\end{tabular}

With simple code and compiler technology, we can achieve \(0.3 \%\) of the peak performance of the machine.

\section*{What's causing the low performance?}

\section*{Multicore parallelism}


Intel Haswell E5: 9 cores per chip

The test machine has 2 of these chips.

We're running on just 1 of the 18 parallel processing cores on the system. What happens if we use them all?

\section*{Parallel loops}

Many parallel programming methods (e.g., OpenMP, Cilk, TBB, etc.) provide parallel for loops that allow all iterations of the loop to execute in parallel.


Which loops should we parallelize?

\section*{Experimenting with parallel loops}

Parallel i loop
```

parallel_for (int64_t i=0; i < n; i++)
for (int64_t k=0; k < n; k++)
for (int64_t j=0; j < n; j++)

```
            C[i*n+j] += A[i*n+k] * B[k*n+j];
```

```
```

            C[i*n+j] += A[i*n+k] * B[k*n+j];
    ```
```

Runtime: 3.04s

Parallel j loop

```
for (int64_t i=0; i < n; i++)
    for (int64_t k=0; k < n; k++)
        parallel_for (int64_t j=0; j < n; j++)
            C[i*n+j] += A[i*n+k] * B[k*n+j];
```

Parallel i and j loop

```
parallel_for (int64_t i=0; i < n; i++)
    for (int64_t k=0; k < n; k++)
        parallel_for (int64_t j=0; j < n; j++)
        C[i*n+j] += A[i*n+k] * B[k*n+j];
```


## Version 6: Parallel loops

| Version | Implementation | Running time (s) | Relative speedup | Absolute Speedup | GFLOPS | Percent of peak |
| :---: | :---: | :---: | :---: | :---: | :---: | :---: |
| 1 | Python | 21041.67 | 1.00 | 1 | 0.006 | 0.001 |
| 2 | Java | 2387.32 | 8.81 | 9 | 0.058 | 0.007 |
| 3 | C | 1155.77 | 2.07 | 18 | 0.118 | 0.014 |
| 4 | + interchange loops | 177.68 | 6.50 | 118 | 0.774 | 0.093 |
| 5 | + optimization flags | 54.63 | 3.25 | 385 | 2.516 | 0.301 |
| 6 | Parallel loops | 3.04 | 17.97 | 6,921 | 45.211 | 5.408 |

Using parallel loops gets us almost $18 \times$ speedup on 18 cores! (Disclaimer: Not all code is so easy to parallelize effectively.)

Why are we still getting just $5 \%$ of peak?

## Recall: Cache basics

Cache is fast (expensive) memory which keeps a copy of the data; it is hidden from software.

Cache-line length: number of bytes loaded together in one entry (often 64 bytes).
Simple example: data at address $\mathrm{xxxxx10}$ is stored at cache location 10.


Cache hit: access to a memory address in cache - cheap Cache miss: non-cached memory access - expensive
Need to look in next, slower level of memory.

## Data reuse: Loops

How many memory accesses must the looping code perform to fully compute 1 row of C ?

- 4096 * $1=4096$ writes to C,
- 4096 * $1=4096$ reads from A, and
- 4096 * $4096=16,777,216$ reads from B, which is
- 16,785,408 memory accesses total.



## Data reuse: Blocks

How about to compute a $64 \times 64$ block of C ?

- $64 \cdot 64=4096$ writes to C,
- $64 \cdot 4096=262,144$ reads from A, and
- $4096 \cdot 64=262,144$ reads from B, or
- 528,384 memory accesses total.



## Tiled matrix multiplication

```
void Tiled_Mult(double *C, double *A, double *B, int64_t n) {
    parallel_for (int64_t i1=0; i1<n; i1+=s)
        parallel_for (int64_t j1=0; j1<n; j1+=s)
        for (int64_t k1=0; k1<n; k1+=s)
            for (int64_t i=i1; i<i1+s && i<n; i++)
                        for (int64_t k=k1; k<k1+s && k<n; k++)
                    for (int64_t j=j1; j<j1+s && j<n; j++)
```

                        \(C\left[i^{*} n+j\right]+=A\left[i^{*} n+k\right] * B[k * n+j] ;\)
    \}


| Tile size | Running <br> time $(\mathrm{s})$ |
| ---: | ---: |
| 4 | 6.74 |
| 8 | 2.76 |
| 16 | 2.49 |
| 32 | 1.74 |
| 64 | 2.33 |
| 128 | 2.13 |

## Version 7: Tiling

| Version | Implementation | Running time (s) | Relative speedup | Absolute Speedup | GFLOPS | Percent of peak |
| :---: | :---: | :---: | :---: | :---: | :---: | :---: |
| 1 | Python | 21041.67 | 1.00 | 1 | 0.006 | 0.001 |
| 2 | Java | 2387.32 | 8.81 | 9 | 0.058 | 0.007 |
| 3 | C | 1155.77 | 2.07 | 18 | 0.118 | 0.014 |
| 4 | + interchange loops | 177.68 | 6.50 | 118 | 0.774 | 0.093 |
| 5 | + optimization flags | 54.63 | 3.25 | 385 | 2.516 | 0.301 |
| 6 | Parallel loops | 3.04 | 17.97 | 6,921 | 45.211 | 5.408 |
| 7 | + tiling | 1.79 | 1.70 | 11,772 | 76.782 | 9.184 |


| Implementation | Cache references <br> (millions) | Li-d cache <br> misses (millions) | Last-level cache <br> misses (millions) |
| :--- | ---: | ---: | ---: |
| Parallel loops | 104,090 | 17,220 | 8,600 |
| + tiling | 64,690 | 11,777 | 416 |

The tiled implementation performs about $62 \%$ fewer cache references and incurs 68\% fewer cache misses.

## Recall: Divide-and-conquer matrix multiplication

IDEA: Tile for every power of 2 simultaneously.

$$
\begin{aligned}
{\left[\begin{array}{ll}
C_{00} & C_{01} \\
C_{10} & C_{11}
\end{array}\right] } & =\left[\begin{array}{ll}
A_{00} & A_{01} \\
A_{10} & A_{11}
\end{array}\right] \cdot\left(\begin{array}{ll}
\mathrm{B}_{00} & B_{01} \\
B_{10} & B_{11}
\end{array}\right] \\
& =\left(\begin{array}{ll}
A_{00} B_{00} & A_{00} B_{01} \\
A_{10} B_{00} & A_{10} B_{01}
\end{array}\right]+\left[\begin{array}{ll}
A_{01} B_{10} & A_{01} B_{11} \\
A_{11} B_{10} & A_{11} B_{11}
\end{array}\right]
\end{aligned}
$$

8 multiplications of $\mathrm{n} / 2 \times \mathrm{n} / 2$ matrices.
1 addition of $\mathrm{n} \times \mathrm{n}$ matrices.

## Serial cache-oblivious matrix multiplication

```
// C += A * B
void mm_dac(double *C, int n_C, double *A, int n_A, double *B, int n_B, int n) {
    assert((n & (-n)) == n);
    if (n<= 1) {
        *C += *A * *B;
    } else {
#define X(M,r,c) (M + (r*(n_ ## M) + c)*(n/2))
        mm_dac(X(C,0,0), n_C, X(A,0,0), n_A, X(B,0,0), n_B, n/2);
        mm_dac(X(C,0,1), n_C, X(A,0,0), n_A, X(B,0,1), n_B, n/2);
        mm_dac(X(C,1,0), n_C, X(A,1,0), n_A, X(B,0,0), n_B, n/2);
        mm_dac(X(C,1,1), n_C, X(A,1,0), n_A, X(B,0,1), n_B, n/2);
        mm_dac(X(C,0,0), n_C, X(A,0,1), n_A, X(B,1,0), n_B, n/2);
        mm_dac(X(C,0,1), n_C, X(A,0,1), n_A, X(B,1,1), n_B, n/2);
        mm_dac(X(C,1,0), n_C, X(A,1,1), n_A, X(B,1,0), n_B, n/2);
        mm_dac(X(C,1,1), n_C, X(A,1,1), n_A, X(B,1,1), n_B, n/2);
    }
}
```


## Parallelizing divide-and-conquer matrix multiply



## Parallelizing divide-and-conquer matrix multiply

```
// C += A * B
void mm_dac(double *C, int n_C,
    assert((n & (-n)) == n);
    if (n<= 1) {
        *C += *A * *B;
    } else {
```



```
        parallel_spam The base case is too small.
        parallel_spawr We must coarsen to
        parallel_spawr
        parallel_spawr
        parallel_sync
        overcome function-call
                    overheads.
        parallel_spawn mm_dac(X(C,0,0), n_C, X(A,0,1), n_A, X(B,1,0), n_B, n/2);
        parallel_spawn mm_dac(X(C,0,1), n_C, X(A,0,1), n_A, X(B,1,1), n_B, n/2);
        parallel_spawn mm_dac(X(C,1,0), n_C, X(A,1,1), n_A, X(B,1,0), n_B, n/2);
        parallel_spawn mm_dac(X(C,1,1), n_C, X(A,1,1), n_A, X(B,1,1), n_B, n/2);
        parallel_sync;
    }
}
```


## Coarsening the recursion



## Coarsening the recursion



## Coarsening the recursion



# Version 8: Divide and Conquer 

| Version | Implementation | Running time (s) | Relative speedup | Absolute Speedup | GFLOPS | Percent of peak | carefully-tuned cacheaware programs are |
| :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
| 1 | Python | 21041.67 | 1.00 | 1 | 0.006 | 0.001 | faster than carefully- |
| 2 | Java | 2387.32 | 8.81 | 9 | 0.058 | 0.007 | d cache-oblivious |
| 3 | C | 1155.77 | 2.07 | 18 | 0.118 | 0.014 |  |
| 4 | + interchange loops | 177.68 | 6.50 | 118 | 0.774 | 0.093 | See: "An experimental |
| 5 | + optimization flags | 54.63 | 3.25 | 385 | 2.516 | 0.301 | comparison of cacheoblivious and cache- |
| 6 | Parallel loops | 3.04 | 17.97 | 6,921 | 45.211 | 5.408 | conscious programs" |
| 7 | + tiling | 1.79 | 1.70 | 11,772 | 76.782 | 9.184 | by Yotov et al. '07 |
| 8 | Parallel divide-and-conquer | 1.30 | 1.38 | 16,197 | 105.722 | 12.646 |  |


| Implementation | Cache references <br> (millions) | L1-d cache <br> misses (millions) | Last-level cache <br> misses (millions) |
| :--- | ---: | ---: | ---: | ---: |
| Parallel loops | 104,090 | 17,220 | 8,600 |
| + tiling | 64,690 | 11,777 | 416 |
| Parallel divide-and-conquer | 58,230 | 9,407 | 64 |

## Vector Hardware

Modern microprocessors incorporate vector hardware to process data in single-instruction stream, multiple-

We will have more indepth lectures later on SIMD in theory and in practice data stream (SIMD) fashion.


## Vectorization flags

Programmers can direct the compiler to use modern vector instructions using compiler flags such as the following:

- -mavx: Use Intel AVX vector instructions.
- -mavx2: Use Intel AVX2 vector instructions.
- -mfma: Use fused multiply-add vector instructions.
- -march=<string>: Use whatever instructions are available on the specified architecture.
- -march=native: Use whatever instructions are available on the architecture of the machine doing compilation.

Due to restrictions on floating-point arithmetic, additional flags, such as -ffast-math, might be needed for these vectorization flags to have an effect.

## Version 9: Compiler vectorization

| Version Implementation | Running <br> time $(\mathrm{s})$ | Relative <br> speedup | Absolute | Speedup | GFLOPS | Percent <br> of peak |
| ---: | :--- | ---: | ---: | ---: | ---: | ---: |
| $\mathbf{1}$ | Python | 21041.67 | 1.00 | 1 | 0.006 | 0.001 |
| 2 | Java | 2387.32 | 8.81 | 9 | 0.058 | 0.007 |
| 3 | C | 1155.77 | 2.07 | 18 | 0.118 | 0.014 |
| 4 | + interchange loops | 177.68 | 6.50 | 118 | 0.774 | 0.093 |
| 5 | + optimization flags | 54.63 | 3.25 | 385 | 2.516 | 0.301 |
| 6 | Parallel loops | 3.04 | 17.97 | 6,921 | 45.211 | 5.408 |
| 7 | + tiling | 1.79 | 1.70 | 11,772 | 76.782 | 9.184 |
| 8 | Parallel divide-and-conquer | 1.30 | 1.38 | 16,197 | 105.722 | 12.646 |
| 9 | + compiler vectorization | 0.70 | 1.87 | 30,272 | 196.341 | 23.486 |

Using the flags -march=native -ffast-math nearly doubles the program's performance!

Can we be smarter than the compiler?

## Intel vector intrinsics

Intel provides C-style functions, called intrinsic instructions, that provide direct access to hardware vector operations:
https://software.intel.com/sites/landingpage/IntrinsicsGuide
Inte ${ }^{\ominus}$ Intrinsics Guide
Updated
12/07/2023
Version
3.6.7


## Version 10: AVX Intrinsics

| Version Implementation | Running <br> time $(\mathrm{s})$ | Relative <br> speedup | Absolute <br> Speedup | GFLOPS | Percent <br> of peak |  |
| :---: | :--- | ---: | ---: | ---: | ---: | ---: |
| 1 | Python | 21041.67 | 1.00 | 1 | 0.006 | 0.001 |
| 2 | Java | 2387.32 | 8.81 | 9 | 0.058 | 0.007 |
| 3 | C | 1155.77 | 2.07 | 18 | 0.118 | 0.014 |
| 4 | + interchange loops | 177.68 | 6.50 | 118 | 0.774 | 0.093 |
| 5 | + optimization flags | 54.63 | 3.25 | 385 | 2.516 | 0.301 |
| 6 | Parallel loops | 3.04 | 17.97 | 6,921 | 45.211 | 5.408 |
| 7 | + tiling | 1.79 | 1.70 | 11,772 | 76.782 | 9.184 |
| 8 | Parallel divide-and-conquer | 1.30 | 1.38 | 16,197 | 105.722 | 12.646 |
| 9 | + compiler vectorization | 0.70 | 1.87 | 30,272 | 196.341 | 23.486 |
| 10 | + AVX intrinsics | 0.39 | 1.76 | 53,292 | 352.408 | 41.677 |

## Comparison with Intel MKL

| Version Implementation | Running <br> time $(\mathrm{s})$ | Relative <br> speedup | Absolute <br> Speedup | GFLOPS | Percent <br> of peak |  |
| ---: | :--- | ---: | ---: | ---: | ---: | ---: |
| 1 | Python | 21041.67 | 1.00 | 1 | 0.006 | 0.001 |
| 2 | Java | 2387.32 | 8.81 | 9 | 0.058 | 0.007 |
| 3 | C | 1155.77 | 2.07 | 18 | 0.118 | 0.014 |
| 4 | + interchange loops | 177.68 | 6.50 | 118 | 0.774 | 0.093 |
| 5 | + optimization flags | 54.63 | 3.25 | 385 | 2.516 | 0.301 |
| 6 | Parallel loops | 3.04 | 17.97 | 6,921 | 45.211 | 5.408 |
| 7 | + tiling | 1.79 | 1.70 | 11,772 | 76.782 | 9.184 |
| 8 | Parallel divide-and-conquer | 1.30 | 1.38 | 16,197 | 105.722 | 12.646 |
| 9 | + compiler vectorization | 0.70 | 1.87 | 30,272 | 196.341 | 23.486 |
| 10 | + AVX intrinsics | 0.39 | 1.76 | 53,292 | 352.408 | 41.677 |
| 11 | Intel MKL | 0.41 | 0.97 | 51,497 | 335.217 | 40.098 |

Version 10 is competitive with Intel's professionally engineered Math Kernel Library!

## Comparison with Intel MKL

| Version Implementation | Running <br> time $(s)$ | Relative <br> speedup | Absolute | Speedup |
| :---: | ---: | ---: | ---: | ---: | ---: | ---: | GFLOPS | Percent |
| ---: | :--- |
| of peak |$|$

You generally won't see such extreme performance improvements ( $\sim 50,000 x$ ) as we did for matrix multiplication.

But in CSE 6230, we will learn general techniques for performance improvement of a wide variety of applications.

| 9 | + compiler vectorization | 0.70 | 1.87 | 30,272 | 196.341 | 23.486 |
| :---: | :--- | :---: | :---: | :---: | :---: | :---: |
| 10 | + AVX intrinsics | 0.39 | 1.76 | 53,292 | 352.408 | 41.677 |
| 11 | Intel MKL | 0.41 | 0.97 | 51,497 | 335.217 | 40.098 |

## Version 10 is competitive with Intel's professionally engineered Math Kernel Library!

The roofline model for performance (Slides inspired by UC Berkeley CS267)

## What is a performance model?

A performance model is a mathematical description based on a simplified machine model (ignoring many details) that aims to come up with a quantitative estimate for expected performance.

## Energy use

Bandwidth
Clock speed
Memory footprint


Performance
(Often runtime)

Total work (in flops)
Fraction of peak
...and other factors

## Why use a performance model?

Performance models help us to understand and predict performance behavior without confounding factors from architectures, programming models, implementations, etc.

?


?

?


## Why use a performance model?



Help identify performance bottlenecks


Find where the bottleneck is (software? hardware?
algorithm?)


Determine when we're done optimizing

## Roofline model

Main idea: applications are limited by either compute peak or memory bandwidth.
Bandwidth bound - e.g., matrix-vector multiply
Compute bound - e.g., matrix-matrix multiply


Sam Williams

## 2847 citations!

Samuel Williams, Andrew Waterman, David Patterson. "Roofline: an insightful visual performance model for multicore architectures." Communications of the ACM 52.4 (2009): 65-76.

## Components of roofline model



Memory bandwidth (bytes/s): Not including latency (best case)


Maximum achievable performance for your application

## Roofline performance model

Example machine:


## Roofline performance model

Example machine:


## How good is flops/s as a model?



## Machine balance

Machine balance =
Peak Performance (flop/s)
Peak Bandwidth (bytes/s)

Ideally, balance $\sim 1$, but usually, it is much greater than 1 (and increasing).

Typical is 5-10 flops/byte.

https://www.hpcwire.com/2016/11/07/mccalpin-traces-hpc-system-balance-trends/

## Computational (Arithmetic) Intensity

| Depends on <br> application |
| :---: |
| Computational intensity $(\mathrm{CI})=\frac{\text { Flops performed }}{\text { Data moved (in bytes) }}$. |


| Operation | FLOPs | Data | Cl |
| :--- | :--- | :--- | :--- |
| Dot Prod | $\mathrm{O}(\mathrm{n})$ | $\mathrm{O}(\mathrm{n})$ | $\mathrm{O}(1)$ |
| Mat Vec | $\mathrm{O}\left(\mathrm{n}^{2}\right)$ | $\mathrm{O}\left(\mathrm{n}^{2}\right)$ | $\mathrm{O}(1)$ |
| MatMul | $\mathrm{O}\left(\mathrm{n}^{3}\right)$ | $\mathrm{O}\left(\mathrm{n}^{2}\right)$ | $\mathrm{O}(\mathrm{n})$ |
| (infinite |  |  |  |
| N-Body | $\mathrm{O}\left(\mathrm{n}^{2}\right)$ | $\mathrm{O}(\mathrm{n})$ | $\mathrm{O}(\mathrm{n})$ |
| FFT | $\mathrm{O}(\mathrm{n} \log \mathrm{n})$ | $\mathrm{O}(\mathrm{n})$ | $\mathrm{O}(\log \mathrm{n})$ |



## (DRAM) Roofline

Assuming idealized processors/caches and cold start (data in DRAM):


## Roofline performance model

Example machine:


## Roofline performance model

Example machine:


## Roofline performance model

Example machine:


## Roofline performance model

Example machine:


## Three categories of software optimization

## Maximizing attained in-core performance



No fused multiply-add (FMA), SIMD, instruction-level parallelism (ILP) will lower what is attainable.

Software optimizations such as explicit SIMD can increase the horizontal ceiling (what can be expected from the compiler).

Other examples include loop unrolling, reordering, etc.

## Maximizing attained memory bandwidth

Example machine:


Compilers won't give great out-of-the-box bandwidth.

Increase bandwidth ceilings:

- NUMA aware allocation and parallelization
- Software prefetching
- Maximize memory-level parallelism (MLP)


## Cache prefetching

Prefetching is a technique that fetches instructions / data from their original storage before they are needed.

Hardware prefetching - triggered by two or more consecutive cache misses in succeeding or preceding cache lines (not necessarily unit stride).

Software prefetching - the compiler can insert instructions (or the programmer can manually add


## Minimizing memory traffic



Use performance counters to measure empirical arithmetic intensity (flops/byte).

Might be much lower than the best case:

- Cache capacity / associativity
- Pad structures to avoid conflict misses
- Use cache blocking to avoid capacity misses

Effective Roofline (before and after)
Example machine:


Before optimization, traffic, and limited bandwidth performance is limited to a very narrow window.

After optimization, ideally, performance is significantly better.

