Calculation in memory or near memory: Complete file

2024-01-23 23:00:00

For several decades, the gap between processor and DRAM performance, known as the “memory wall,” has continued to grow. Different techniques are used to limit the growth of this gap:

cache hierarchies, to bring instructions and data closer to the processor;
hardware multithreading to limit memory waits;
increase in DRAM speeds with successive generations: DDR, GDDR, HBM.

Bringing calculations closer to memory data is a technique studied since the 1960s. Achievements such as Vector IRAM were proposed in the 1990s. Calculation in memory or near memory is becoming relevant depending on two phenomena:

Many modern applications use gigantic data sets. Minimizing transfers between CPU and DRAM main memory becomes an obligation.
Hardware techniques for producing circuits, such as 3D stacking of chips in HBM (High Bandwith Memory) DRAMs, make it possible to facilitate calculation near DRAM memories.

Calculating near or in memory poses a number of questions:

Where to perform the calculation?
How much calculation is required?
How to organize the coordination between the master CPU and the hardware accelerator in or near the memory?

These questions are detailed.

Five recent examples of achievement are discussed:

The Untether AI Boqueria architecture is an accelerator for inference in neural networks. It is made up of a 2D grid of 729 SRAM blocks, each block comprising 512 SRAMs of 640 bytes and 512 elementary processors. The calculations are close to SRAM.
The Celebras WS2 circuit is a circuit made up of a wafer of 850,000 cores (2.6.10¹² transistors) for deep learning. The cores, interconnected in a 2D grid at the wafer level, have a 50:50 ratio of logic (computation) and SRAM memory.
The Ambit project modifies the internal structure of a DRAM to perform a number of basic operations: copy, Not, And, Or, etc.
The UPMEM company designed and tested PIM chips comprising a processor made in DRAM technology with a complete set of instructions for the entire calculation, without floats or SIMD instructions alongside DRAM memory banks. We have calculations near the DRAM memory banks.
Samsung’s Aquabolt-XL circuit stacks DRAM chips with TSV technology and inserts into the chip stack with compute units between the memory banks. The calculation unit has a reduced number of 32-bit RISC type instructions controlling in particular SIMD addition and multiplication instructions on 16-bit floats.