

## **Heterogenous Integration for Artificial Intelligence**

Saibal Mukhopadhyay, S. Yalamanchili and Madhavan Swaminathan

#### Students:

- H. M. Torun
- Y. Long
- B. Mudassar
- C. S. Nair
- B. H. DeProspo
- D. Kim

1

### Faculty:

M. Kathaperumal V. Smet

Ref: S. Mukhopadhyay, Y. Long, B. Mudassar, C. S. Nair, B. H. DeProspo, H. M. Torun, M. Kathaperumal, V. Smet, D. Kim, S. Yalamanchili and M. Swaminathan, "Heterogonous Integration for Artificial Intelligence: Challenges and Opportunities", IBM Journal of R&D, 2019.

Nov. 7-8, 2019

## Outline



Bandwidth Limitation of Neural Networks

□ High Bandwidth Memory with Silicon and Glass Interposer □ Electrical Analysis and Optimization

□ Architectural Alternatives to HBM for Neural Networks

**U**Summary

#### **PRC** Confidential

Georgia Tech

### Bandwidth Limitation of NNs



- Neural networks can be processed highly parallel.
- Overall performance is generally limited by insufficient <u>memory bandwidth</u> and <u>latency</u>.
- On-chip memory bandwidth is not enough for processing large image files.
- The burden is on off-chip bandwidth between logic and memory.

#### PRC Confidential

# Glass vs Silicon 2.5D Interposer: A Comparison





- Commonly used integration technology for AI applications is High Bandwidth Memory (HBM)
  Heterogenous integration of GPU and memory dies on an interposer
- RDL layers interconnect a large GPU die (20 mm x 20 mm) to multiple HBMs (7mm x 5mm)
- □ We compare silicon and glass interposer technologies for such scheme.
- Glass interposer has several advantages:
  - $\Box$  Low Dk polymers instead of *SiO*<sub>2</sub>.
  - □ High aspect ratio lines.
  - □ Eliminates the need for organic package.

Ref: Heterogeneous Integration for Artificial Intelligence: Challenges and Opportunities, IBM Journal of Research & Development, 2019

4

#### **PRC** Confidential



**PRC IAB Meeting** 

- □ A signaling channel between GPU and memory.
- □ Interconnects lengths are between 1-6 mm.
- $\Box$  Current technologies allow for 55  $\mu m$  bump pitch.
  - □ RDL Density: 200 IOs/mm/layer to 400 IOs/mm/layer
- □ Intermetal dielectric material used in simulations:
  - □ For Si interposer: Si $O_2$  ( $\varepsilon_r = 3.9$ )
  - □ For Glass interposer: Low Dk Polymer ( $\varepsilon_r = 2.4$ )
- Commercially used data rate per signal line for silicon interposer: 2 Gbps
  - □ Can we go beyond 2 Gbps per line?

#### **PRC** Confidential



Georgia Tech

□ Interconnects on both silicon and glass interposer have  $\frac{W/S/AR}{W} = 2\mu m/2\mu m/1$ 

□ For an eye height of 800 mV for 6 mm channel, maximum achievable data rates are:

□ Si interposer: 3.2 Gbps

□ Glass interposer: 6 Gbps

□ Aggregate total bandwidth: 1.63 TB/s and 3.07 TB/s for silicon and glass interposer

□ 4 memory stacks, each with 8 channels that contain 128-bit data interface

Glass: 3.3 pJ/bit per transmitter @ 6 Gbps; Si: 5.4 pJ/bit per transmitter @ 3.2 Gbps

#### **PRC** Confidential

### Glass vs Silicon for HBM





- Simulation settings:
- Statistical simulation at BER = 1E-12
- Rise/Fall Time = 10 %UI
  TX Impedance = 50 Ω
- RX termination at  $\mu$ -bumps

- Channel length = 6 mm
- Bump Pitch/Diameter = 55/20 um
- Line Width/Spacing/Thickness = 2 um

Low Dk polymers used in glass interposer enables significant signal integrity improvement!

#### **PRC** Confidential

Georgia Tech

### Optimizing Aspect Ratio using Machine Learning for Glass Interposer



□ Fabrication of high aspect ratio (AR) lines enables flexibility for interconnect design.
 □ Multiple trade-offs must be considered to determine the optimal line characteristics
 □ High AR decreases R, but increases L & C as well as mutual C.
 □ We use ML based optimization to determine optimal trade-off.
 □ Under the constraint that routing density is fixed to 333 signal lines per layer.
 □ Optimized interconnects on glass:
 □ 15 Gbps per signal line (8.19 TB/s total bandwidth) at 2.6 pJ/bit.
 □ Interconnect geometry → W/S/AR: 0.5/4/4.2

#### **PRC** Confidential



Micron's Hybrid Memory Cube (HMC): Stack DRAM dies & single base logic die through TSVs.
 Enables parallel access to memory for high performance.

□ Neurocube integrates a logic layer within the 3D high-density memory package of HMC.

- □ Heterogeneous data flow architecture for different data types/sizes.
- □ Logic and memory dies can be fabricated using different process technologies.

9

#### **PRC** Confidential

### Neurocube: Communication Arch. and Performance





(d) Merging



 Heterogeneous data flow architecture of the Neurocube enables significantly <u>improved throughput compared to GPUs.</u>

D. Kim et. Al, IEEE TCAD 2018

(c) Partitioning

#### **PRC** Confidential



□ NMP is still bounded by logic-centric computation where data and logic are separated.

- More aggressive approach is referred to as PIM.
  - Direct computation inside memory.
- □ For very complex/deep networks, not feasible to map the whole network to on-chip memory.
- Multi-chip PIM architecture integrated on interposer is considered.
  - Each chip is responsible for computation for one-layer in a deep network.

#### **PRC** Confidential

### Case Study: PIM with Glass and Silicon







- Performance of multi-chip design greatly depends on interconnect characteristics.
- For AlexNet architecture and for varying channel lengths:
  - Si Interposer:

1.6 TB/s – 10.2 TB/s throughput

1.4 pJ – 5.4 pJ per bit energy

Glass Interposer:

3.1 TB/s – 17.4 TB/s throughput

1.1 pJ – 3.3 pJ per bit energy

 As the channel length increases, performance gain through multi-chip design increases.

Georgia Institute of Technology

# Summary

- Discussed potential of heterogeneous integration for next gen. energy-efficient AI/ML platforms.
- Packaging technology has a significant role on achievable throughputs.
- Glass interposer-based designs showed superior performance compared to Silicon for HBM.
  - ~2X higher bandwidth at ~2X reduced energy per bit.
- Near memory processing architectures shows great potential for energy-efficiency.
  - Heterogeneous integration of CMOS and non-volatile memory within a 3D stack.
- Processing-in-memory architectures can provide orders of magnitudes gains in efficiency.
- Future research is required to transform the potential of heterogeneous integration for real-systems.



**PRC** Confidential

#### Nov. 7-8, 2019