Compute-in-memory (CIM)

Before we worry too much about new memory technologies, we can start with new designs. In this case, compute-in-memory (CIM). Unlike traditional architectures that constantly move data between memory and processing units, or High Bandwidth Memory (HBM) which relies on off-chip approaches, CIM performs calculations directly within the memory array where data resides. We can get into the trade-offs for different memories in a moment, but the thing to note is CIM can be implemented using conventional memory technologies (SRAM, DRAM, flash) or emerging non-volatile memories like ReRAM, PCM, FeFET, and MRAM. The key distinction is that CIM integrates dedicated computing units into the memory itself, minimizing the energy-intensive process of moving data back and forth.

There are two main approaches to CIM, Digital CIM (D-CIM) designs, where one device corresponds to one bit, provide high accuracy but limited throughput due to simultaneous operation constraints. Analog CIM (A-CIM), conversely, offers higher weight density and can operate more rows simultaneously, though at the cost of dealing with noisy weights. The newer non-volatile technologies typically deliver higher performance, density, and lower power consumption with the benefit of retaining data without constant power, while traditional CMOS-based approaches benefit from mature manufacturing processes and lower costs. In practical applications, CIM can achieve energy efficiencies of 10-100 TOPS/W for low-precision operations (compared to 1-10 TOPS/W in traditional architectures) and theoretical bandwidth capabilities exceeding 10 Tbps. Real-world implementations have shown promising results - Upmem's DRAM-based processing-in-memory has demonstrated 20x speedups in genomics sequencing, while Samsung's HBM-PIM architecture has achieved 2x performance improvements and 70% energy reductions in machine learning models.

As with any new hardware, CIM faces significant technical and practical limitations. Analog computation accuracy remains a major challenge, with current implementations struggling beyond 8-bit operations, limiting their use in high-precision tasks. Although the work on quantized to 1-bit LLMs from Microsoft will likely mitigate the precision needs. Manufacturing integration poses a more substantial hurdle - adding logic to memory complicates fabrication processes, potentially reducing memory density by 20-30% and increasing costs by 2-5x compared to standard DRAM. Thermal management becomes critical as power densities can increase by 50-100%, requiring sophisticated cooling solutions. The technology also faces ecosystem challenges: current software stacks aren't optimized for in-memory computation, many implementations have limited reprogrammability, and interfacing with external systems can introduce latencies that offset CIM's benefits. These interface delays, ranging from tens to hundreds of nanoseconds, can significantly impact time-sensitive AI workloads, making the balance between CIM's benefits and limitations crucial for specific applications.

Worth watching:

Upmem (France): DRAM-based processing-in-memory solution for data-intensive applications. Offers scalable architecture with thousands of in-memory processors for parallel data processing.
Axelera AI (Netherlands): Uses a popular in-memory computing architecture that includes crossbar arrays of memory devices. These crossbar arrays can store and perform the matrix-vector multiplications (MVMs). They can use SRAM, Flash, and all memristor memories.

Memory Technologies Comparison

Now to the hard stuff, which emerging memory technology will replace HBM for AI training and inference. Trade-offs. It’s always trade-offs.

Technology	Bandwidth (GB/s)	Read (ns)	Write (ns)	Density (Gb/cm²)	Energy/bit (pJ)	Cell Size (F²)	Temp Range (°C)	Endurance (cycles)	Cost ($/GB)
GDDR6X DRAM	1000-1500	12-15	12-15	8-10	15-20	6-8	0 to 95	>10¹⁵	8-10
HBM3E	1000-1200	8-12	8-12	8-10	3.5-4.5	6-8	0 to 105	>10¹⁵	15-20
HBM3	800-900	9-13	9-13	8-10	4-5	6-8	0 to 105	>10¹⁵	12-15
SRAM	100-250	1-10	1-10	2-4	0.1-0.3	120-140	-40 to 125	>10¹⁶	20-30
DDR5 DRAM	50-85	10-15	10-15	10-12	10-15	6-8	0 to 85	>10¹⁵	4-6
LPDDR5X DRAM	60-75	14-20	14-20	8-10	8-12	6-8	-40 to 105	>10¹⁵	5-7
3D XPoint	20-40	100	100-1000	8-10	25-50	8-12	0 to 85	10⁸-10⁹	6-8
ReRAM	10-20	10-50	50-100	8-10	0.1-1	4-12	-40 to 125	10⁹-10¹⁰	8-12
STT-MRAM	3-5	2-20	10-20	4-6	0.5-2	12-20	-40 to 150	>10¹⁵	15-20
QLC NAND	2-3	50,000	1,000,000	100-200	100-1000	4	0 to 70	10³-10⁴	0.05-0.08
TLC NAND	2-3	30,000	500,000	50-100	100-1000	4	0 to 70	10⁴-10⁵	0.08-0.12
MLC NAND	2-3	20,000	200,000	20-50	100-1000	4	0 to 70	10⁵-10⁶	0.12-0.18
FeFET	1-2	20-30	30-50	4-8	0.1-1	6-12	-40 to 125	10¹⁰-10¹²	20-25
FeRAM	1-2	20-80	50-100	1-2	5-10	15-40	-40 to 125	10¹⁴-10¹⁵	25-35
PCM	1-2	20-50	50-500	8-16	100-200	4-12	0 to 85	10⁸-10⁹	10-15

Notes: Bandwidth figures are per die/layer for fair comparison; Read/Write speeds for NAND converted to ns from µs/ms for consistency; Temperature ranges are typical operating ranges for commercial versions; Energy/bit includes both read and write operations; Cell size in F² (F = feature size of the manufacturing process); Bandwidth for HBM includes stacked die advantage

Before looking at alternatives, it's worth noting HBM3e's current position:

High bandwidth: 1000-1200 GB/s
Good latency: 8-12ns read/write
Decent density: 8-10 Gb/cm²
Excellent endurance: >10¹⁵ cycles
Main drawback: Cost ($15-20/GB)

The two primary candidates are Spin-Transfer Torque Magnetic Random Access Memory (STT-MRAM) and Resistive Random Access Memory (ReRAM). Each have strengths and weaknesses. The truth is that in the short-term (3-5 years), HBM will likely remain dominant, with incremental improvements through newer generations and 3D stacking techniques. Medium-term (5-10 years) we will likely see some hybrid memory systems combining: HBM for high-bandwidth, training-intensive operations, ReRAM for inference and compute-in-memory operations and potentially, STT-MRAM for fast, non-volatile storage tiers. If I were a betting man, ReRAM appears to be the most promising complete replacement if endurance and reliability issues can be solved, due to comparable density to HBM, native compute-in-memory capabilities, simpler manufacturing process, and better scaling potential.