# The Future of Heterogeneous Systems Design

Valeria Bertacco ADA Center Director





This work is supported by the Semiconductor Research Corporation (SRC) and DARPA



#### **Technology innovation – limited benefits from device scaling**



### **Emergence of specialized architectures**

- +Growing domain offerings
- +Great performance/energy boosts
- 1 app  $\rightarrow$  1 accelerator
- Ad-hoc interfaces



**2016** - Lee et al. image processing accelerator





**2020** – Skadron et al. – Fulcrum: bitlevel parallel PIM accelerator



**2021** - Austin et al. sequestered encryption accelerator



#### **Accelerators from the ADA Center**



Ø

Applications Driving Architectures

Batten - CAPE for bit-serial arith and search

### Applications Driving Architectures (ADA) Research Center



5-year endeavor: 2018-2022

21 faculty members, 130 graduate students

Co-sponsored by SRC and







GOAL: reignite computing system innovation for the 2030-2040 decade through:

- sustained scalability and
- sustained value creation



### Managing design under vast heterogeneity

- [ENABLE MORE IDEAS TO TRANSFORM INTO NEW DESIGNS] Lower expertise required to design hardware systems
  → reignite innovation
- 2. [BOOST OPTIMIZATION OPPORTUNITIES] Blur hardware abstraction layers and cross-optimize
- 3. [IMPROVE SILICON USE EFFICIENCY] Need flexible fabric for specialized accelerator synthesis

 $\rightarrow$  lower carbon emissions associated with computing



#### Solution 1: Lower expertise needed to design hardware systems

- Domain-specific languages (DSL) boost programmer's productivity
- DSLs are approachable by a broad population of software engineers
- High-level compilers today are unable to leverage specialized accelerator hardware (APIs are the practice)

#### How does ADA approach this goal?

• Enabling compilation flows from DSLs to accelerator-rich heterogeneous architectures (3LA)



#### Innovation is powered by people

#### **University of Michigan – selected EECS course enrollments**





8

#### ADA today – design flows for the 2030s



### **Mapping DSLs to accelerators: 3LA**

#### HARDWARE FUNCTIONS SOFTWARE PRIMITIVES Applications provided in **Relay IRModule** Executable/runtime Relay-to-HW program Heterogeneous hardware backends different DSLs (computation graph) fragments mapping leveraging HWs Tensorflow General purpose **S**tvm FlexNLP func A: ; CPU instructions instr. 1 (STR r2, 0xffff0000) CMP r0, r1 SUBGT r0, r0, r1 Pytorch instr. 2 (LDR r3, 0xffff0010) BNE loop CUDA ; Access accel 1 (MMIO) FlexNLP func B: STR r2, 0xffff0000 instr. 3 (STR r2, 0xffffaa00) eras r3, 0xffff0010 LDR instr. 4 (LDR r3, 0xffffaabb) ; Access accel 2 (MMIO) FlexNLP Application specific r4, 0xffff0100 STR NVDLA func C: LDR r5, 0xffff0110 ONNX instr. 5 (STR r4, 0xffff0100) HLSCNN ; CPU instructions instr. 6 (LDR r5, 0xffff0110) r3, r2 MOV SUBGT r0, r0, r1 **NVDLA** B 1r End-to-end compilation steps: VTA Take Relay as the representation 1. 2. Define Relay & HW ILA formal models Provide Relay-to-HW program fragment Translate ILA program fragments into: 3. mappings for each accelerator/functionality SystemC modules for simulation validation Verify the correctness of the program fragment 4. SMT formulas for formal verification pairs via ILA-based methodology 5. Pattern matching and code-gen. [Malik, Tatlock, Weil Applications Driving Architectures

#### **Enabling agile research: Test chip frameworks**



Allow new accelerator chips to be rapidly deployed into a platform without having to design custom systems that support them -- includes

- · Logical and physical socket
- I/O links and networks
- Accelerator motherboard

#### SoC Scaffold Framework



SoC scaffold library and example to simplify SoC design integration

- AXI + protocol checker
- Fully-synthesize-able, all-digital DDR1 PHY and memory controller
- Programmable DMA controller
- SoC examples to highlight HLS flow e.g., FlexASR

11

#### [Brooks, Taylor, Wei]



#### Solution 2: blur hardware abstractions layers, cross-optimize

Why: it provides many additional optimization opportunities, which have traditionally been overlooked

#### How does ADA approach this?

- 1. Explore cross-optimization while compiling in end-to-end design flows (PriMax)
- 2. Design exploration tools that allow computer architects to explore device parameters (NVMexplorer)





1.57x

1.58x

### **PRIMAX: selective primitive mapping**



 Mapping DSL primitives  $\rightarrow$ **Case study:** accelerator **Breadth-First Search Geomean Speedups** func updateEdge(src : Vertex, dst : Vertex) -> output : bool DSL Accel functions leads to output = CAS(&parent[dst], -1, src); **PriMax** GraphIt → OMEGA mixed performance func toFilter(v : Vertex) -> output : bool output = parent[v] == -1; **Optimal: Both Targets** GraphPull results func main() % Declare an active set and make Vert 0 the starting point **Optimal: OMEGA Only** var active : vertexset{Vertex} = new vertexset{Vertex}(0); active.addVertex(0); PRIMAX identifies parent[0] = 0; func updateEdge(src : Vertex, dst : Vertex) -> output : bool **Optimal: GraphPull Only** output = CAS(&parent[dst], -1, src); when the mapping % Loop until the active set is empty end while (active.getVertexSetSize() != 0) 1# active = edges.from(active) is beneficial and .applyModified(updateEdge, parent); parent[dst] : irregular access → map to SPM end end CAS : atomic on SPM data → map to PISC applies it schedule: program->configApplyParallelization("s1","dynamic-vertex-parallel"); selectively program->configApplyDirection("1", "SparsePush"); [Bertacco] 13 program->configApplyAcceleration("s1", "OMEGA");



### **Design exploration tools: NVMExplorer**



# Solution 3: Flexible fabrics for accelerator synthesis

- It is impractical to produce chips with hundreds of accelerator types
- Many computing systems must be capable of running a wide range of applications



Must fit many different accelerators in a small silicon footprint



#### **ADA: designing for reconfigurable hardware**

- [Kasikci] SignalCat & LossCheck debugging support for FPGA designs Monitor signals over time, identify data losses in datapaths
- [Tatlock] Lakeroad ISA synthesis for FPGAs make FPGA-synthesis similar to software compilation, to improve compiler predictability
- [Taylor] BaseJUMP flow for FPGAs





#### In summary:

- [ENABLE MORE IDEAS TO TRANSFORM INTO NEW DESIGNS] Lower expertise required to design hardware systems
   → reignite innovation
- 2. [BOOST OPTIMIZATION OPPORTUNITIES] Blur hardware abstraction layers and cross-optimize
- 3. [BETTER SILICON EFFICIENCY] Need flexible fabric for specialized accelerator synthesis
  → lower carbon emissions associated with computing



#### Thank you







# Intelligent Memory and Storage

Kevin Skadron

Director, CRISP Center

Dept. of Computer Science

University of Virginia



## Why Intelligent Memory and Storage?

- "Memory wall" has been discussed for nearly 30 years
  - But caches, interfaces etc. can no longer hide this wall
  - Big data, irregular access patterns, poor reuse
  - High energy costs to move large volumes of data
  - More algorithms that are data-intensive (ie, low ops/byte)
  - More and more tasks are stalled on memory/storage access
    - Tail latencies also getting worse
- Memory and storage have much higher internal bandwidth than they can transmit
- The closer computation is to the data, the lower the power



## **Design Questions**

- Where to put the intelligence? Huge design space!
  - In the bitcells? At the chip interface? In the controller? Etc.
  - As we move further away from the bitcells, we lose bandwidth but also reduce design and area overhead
  - CRISP identified several candidate designs at different performance/complexity design points
- How to orchestrate placement of data and compute?
  - "Near data computing" is hard in heterogeneous/distributed systems if inputs are in different places
  - Important to look at workflows, not just kernels
- For memory, do we want
  - Memory that can accelerate some computations?
  - Accelerators that happen to use memory technology?
- For storage, do we still need overheads of a block-based interface?
- What does it take for emerging device technologies to find a market?
- Making the programmer's life easy is essential, or nobody will use it
  - High-level, portable abstractions



<u>I</u>





Computing on Network Infrastructure for Pervasive Perception, Cognition, and Action

### **CONIX** Perspective on Advances and **Challenges in Semiconductor Design**

Anthony Rowe Carnegie Mellon University















Carnegie Mellon University George Washington University

University of California, Berkeley University of California, Los Angeles University of California, San Diego

University of Southern California University of Washington

### **CONIX: A Distributed Compute Paradigm Shift**



CONIX



**Next Generation Systems** 

conix.io

### Simply bringing cloud(-native) to the edge won't cut it...











#### Hardware design cycles are still too slow...



### Thanks!



- 1 Carnegie Mellon University (headquarters)
- <sup>2</sup> University of California, Berkeley
- <sup>3</sup> University of California, Los Angeles
- 4 University of California, San Diego
- 5 University of Southern California
- 6 University of Washington







Computing on Network Infrastructure intel. A for Pervasive Perception, Cognition, and Action ARM ARM ARM ARM ARM ARM ARMAN



#### Collaboration towards Decadal Plan Goals: Advances and Challenges in Semiconductor Design Panel

Ada Gavrilovska

School of Computer Science, Georgia Tech Applications Driving Architectures Center (ADA)





This work is supported by the Semiconductor Research Corporation (SRC) and DARPA



### Growth in data movement demand

#### • Increase in traffic volume, number of devices, wireless



## Growth in data movement demand

New bandwidth-intensive and latency-sensitive workloads



SmartCity, automation



high definition video

<u>AR/VR</u>

### Growth in data movement demand

New bandwidth-intensive and latency-sensitive workloads



## What does this mean?

#### • Past and recent datapoints:

- 70 TWh to run the Internet, LBNL, 06/2016
- 50 TWh to run China's mobile network, Huawei, 07/2020
- Updated traffic predictions no slowdown!
- EB/month cost?
  - wide range based on factors: technology, distance, system scope, ... \*
  - 1.8 TWh /EB
  - => 1.2 million tons of CO2 (EPA calculator)
  - per EB



Figure 8: Mobile data and FWA traffic

Ericsson Mobility Report (11/2020)



### What does this mean?

The sum of the greenhouse gas emissions you entered above is of Carbon Dioxide Equivalent. This is equivalent to:

1,275,628 Metric Tons \$ Greenhouse gas emissions from 277,424 3,205,906,725 Passenger Miles vehicles driven by driven for an average passenger one year vehicle CO<sub>2</sub> emissions from 125,307,314 1,409,931,078 16,887 153,615 231,709 7,036 2,953,350 52,147,322 0.322 143,538,704 gallons of gallons of Pounds of tanker homes' homes' railcars' barrels of propane coal-fired gasoline diesel trucks' energy use -orworth of oil cylinders coal electricity power -on- $\sim$ worth of plants in consumed burned for one coal consumed used for consumed use for one gasoline year year burned home one year barbeques 0 155,170,821,55 number of smartphones charged Greenhouse gas emissions avoided by

61,984 54.277,338 265 48,347,610 433,887 trash bags Incandescent Garbage Wind Tons of trucks of of waste turbines lamps waste -01 -0 recycled waste recycled running for switched to instead of instead of recycled a year LEDS landfilled instead of landfilled landfilled Carbon sequestered by 21,092,787 1,562,871 8,721 tree acres of acres of U.S. seedlings U.S. forests forests in grown for preserved 10 years one year from conversion to cropland in one year

Impact of EB of mobile data @1.8 TWh/EB

#### 2030 forecast: 200-300/month => ~ 3000 EB/year



### **Edge Computing and NextG Networking Opportunities**

New technologies => Energy efficiency in the data path

- 5+G/6+WiFi/..., software functions/network server, ...
- Edge computing => Reduce/remove data movement
  - Enabler for new applications
  - Aligned with UN SDG, Exponential Energy Roadmap

| Industrial control<br>Open or closed-loop<br>control of industrial<br>automation systems | Control to control<br>in production line     Machine vision<br>for robotics     Closed-loop<br>process control       Process<br>monitoring     PLC to robot controller     Motion control                                         | Local area —<br>Confined wide<br>General wide<br>Industries                    |
|------------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------|
| Mobility automation<br>Automated control loops<br>for mobile vehicles and<br>robots      | Automated container     Cooperative maneuvering<br>of vehicles     Cloud motion<br>control of AGN       Cooperative AGVs in<br>a production line     Machine vision for<br>intersection safety     Collaborative<br>mobile robots |                                                                                |
| Remote control<br>Human control of<br>remote devices                                     | Remote control with Remote control with Remote control with video/audio AR overlay haptic feedback                                                                                                                                | Media product<br>Forestry<br>Public safety<br>Utilities                        |
| Real-time media<br>Real, virtual and<br>combined<br>environments                         | Cloud-assisted Premium experience Cloud-assisted AR Interactive VR AR cloud-assisted AR Cloud-gaming Cloud gaming Media production                                                                                                | Oil & gas<br>Railways<br>Agriculture<br>Manufacturing<br>Warehousing<br>Mining |
|                                                                                          | 10s of ms latency Time-criticality 1s of ms latency 99.999% reliability 99.999% reliability                                                                                                                                       | Ports<br>Construction                                                          |









ALLE GOALS

Ø

rigerant management

2025

### Edge Computing and NextG Networking Challenges

- Growth in demand
  - Huawei estimate 5G transition from 50TWh to 100TWh mobile network
- Deployment cost, scale, and challenges
  - O(US\$1000) per location
  - Densification of infrastructure, urban deployment, ensuring coverage
- Datacenter-native technologies
  - Natural cooling? PUE efficiency?
- Sustainability of access



Oceania

Europe

### **End-to-end Benchmarks for Edge Computing**



Applications Driving Architectures

#### JUMP and the Decadal Plan: the challenges and opportunities for HW security

JUMP has contributed many advances to the field of HW security across the stack:

- ADA: improving the performance, communication and storage of privacy-enhancing techniques
- CONIX: major contributions to the security of accelerators, and securing Wasm for distributed compte
- CRISP: advancing security issues related to in- and near- memory processing

But there are multiple challenges ahead:

- Increasing complexity: accelerators, chiplets, individual components
- Increasing connectivity: more 'smart' things 29.2 bn Arm chips shipped in 2021.
- Increasing specialization: there is no standard next-gen chip any more

#### What does all of this mean for hardware security?

How and where can the academic community make meaningful contributions?



Andrea Kells, Director Research Ecosystem © 2022 Arm

#### JUMP and the Decadal Plan: the challenges and opportunities for HW security

Security has become everyone's responsibility:

 Growing number and diversity of attack surfaces, increasing (potential) impact of breaches, complex global supply chains

The opportunities for academic contributions are therefore huge, e.g.:

- Security v. energy efficiency
- Improving memory protection architecture
- Confidential compute still in its infancy
- Self-healing components and systems
- Aging, reliability and security

Solutions will require a holistic approach, and therefore collaboration JUMP Centres are ideally placed: scale, convening power, visibility, reputation

Andrea Kells, Director Research Ecosystem © 2022 Arm

### arm