Before we worry too much about new memory technologies, we can start with new designs. In this case, compute-in-memory (CIM). Unlike traditional architectures that constantly move data between memory and processing units, or High Bandwidth Memory (HBM) which relies on off-chip approaches, CIM performs calculations directly within the memory array where data resides. We can get into the trade-offs for different memories in a moment, but the thing to note is CIM can be implemented using conventional memory technologies (SRAM, DRAM, flash) or emerging non-volatile memories like ReRAM, PCM, FeFET, and MRAM. The key distinction is that CIM integrates dedicated computing units into the memory itself, minimizing the energy-intensive process of moving data back and forth.

There are two main approaches to CIM, Digital CIM (D-CIM) designs, where one device corresponds to one bit, provide high accuracy but limited throughput due to simultaneous operation constraints. Analog CIM (A-CIM), conversely, offers higher weight density and can operate more rows simultaneously, though at the cost of dealing with noisy weights. The newer non-volatile technologies typically deliver higher performance, density, and lower power consumption with the benefit of retaining data without constant power, while traditional CMOS-based approaches benefit from mature manufacturing processes and lower costs. In practical applications, CIM can achieve energy efficiencies of 10-100 TOPS/W for low-precision operations (compared to 1-10 TOPS/W in traditional architectures) and theoretical bandwidth capabilities exceeding 10 Tbps. Real-world implementations have shown promising results - Upmem's DRAM-based processing-in-memory has demonstrated 20x speedups in genomics sequencing, while Samsung's HBM-PIM architecture has achieved 2x performance improvements and 70% energy reductions in machine learning models.

As with any new hardware, CIM faces significant technical and practical limitations. Analog computation accuracy remains a major challenge, with current implementations struggling beyond 8-bit operations, limiting their use in high-precision tasks. Although the work on quantized to 1-bit LLMs from Microsoft will likely mitigate the precision needs. Manufacturing integration poses a more substantial hurdle - adding logic to memory complicates fabrication processes, potentially reducing memory density by 20-30% and increasing costs by 2-5x compared to standard DRAM. Thermal management becomes critical as power densities can increase by 50-100%, requiring sophisticated cooling solutions. The technology also faces ecosystem challenges: current software stacks aren't optimized for in-memory computation, many implementations have limited reprogrammability, and interfacing with external systems can introduce latencies that offset CIM's benefits. These interface delays, ranging from tens to hundreds of nanoseconds, can significantly impact time-sensitive AI workloads, making the balance between CIM's benefits and limitations crucial for specific applications.

Worth watching:

Memory Technologies Comparison

Now to the hard stuff, which emerging memory technology will replace HBM for AI training and inference. Trade-offs. It’s always trade-offs.

Technology Bandwidth (GB/s) Read (ns) Write (ns) Density (Gb/cm²) Energy/bit (pJ) Cell Size (F²) Temp Range (°C) Endurance (cycles) Cost ($/GB)
GDDR6X DRAM 1000-1500 12-15 12-15 8-10 15-20 6-8 0 to 95 >10¹⁵ 8-10
HBM3E 1000-1200 8-12 8-12 8-10 3.5-4.5 6-8 0 to 105 >10¹⁵ 15-20
HBM3 800-900 9-13 9-13 8-10 4-5 6-8 0 to 105 >10¹⁵ 12-15
SRAM 100-250 1-10 1-10 2-4 0.1-0.3 120-140 -40 to 125 >10¹⁶ 20-30
DDR5 DRAM 50-85 10-15 10-15 10-12 10-15 6-8 0 to 85 >10¹⁵ 4-6
LPDDR5X DRAM 60-75 14-20 14-20 8-10 8-12 6-8 -40 to 105 >10¹⁵ 5-7
3D XPoint 20-40 100 100-1000 8-10 25-50 8-12 0 to 85 10⁸-10⁹ 6-8
ReRAM 10-20 10-50 50-100 8-10 0.1-1 4-12 -40 to 125 10⁹-10¹⁰ 8-12
STT-MRAM 3-5 2-20 10-20 4-6 0.5-2 12-20 -40 to 150 >10¹⁵ 15-20
QLC NAND 2-3 50,000 1,000,000 100-200 100-1000 4 0 to 70 10³-10⁴ 0.05-0.08
TLC NAND 2-3 30,000 500,000 50-100 100-1000 4 0 to 70 10⁴-10⁵ 0.08-0.12
MLC NAND 2-3 20,000 200,000 20-50 100-1000 4 0 to 70 10⁵-10⁶ 0.12-0.18
FeFET 1-2 20-30 30-50 4-8 0.1-1 6-12 -40 to 125 10¹⁰-10¹² 20-25
FeRAM 1-2 20-80 50-100 1-2 5-10 15-40 -40 to 125 10¹⁴-10¹⁵ 25-35
PCM 1-2 20-50 50-500 8-16 100-200 4-12 0 to 85 10⁸-10⁹ 10-15

Notes: Bandwidth figures are per die/layer for fair comparison; Read/Write speeds for NAND converted to ns from µs/ms for consistency; Temperature ranges are typical operating ranges for commercial versions; Energy/bit includes both read and write operations; Cell size in F² (F = feature size of the manufacturing process); Bandwidth for HBM includes stacked die advantage

Before looking at alternatives, it's worth noting HBM3e's current position:

The two primary candidates are Spin-Transfer Torque Magnetic Random Access Memory (STT-MRAM) and Resistive Random Access Memory (ReRAM). Each have strengths and weaknesses. The truth is that in the short-term (3-5 years), HBM will likely remain dominant, with incremental improvements through newer generations and 3D stacking techniques. Medium-term (5-10 years) we will likely see some hybrid memory systems combining: HBM for high-bandwidth, training-intensive operations, ReRAM for inference and compute-in-memory operations and potentially, STT-MRAM for fast, non-volatile storage tiers. If I were a betting man, ReRAM appears to be the most promising complete replacement if endurance and reliability issues can be solved, due to comparable density to HBM, native compute-in-memory capabilities, simpler manufacturing process, and better scaling potential.