Before we worry too much about new memory technologies, we can start with new designs. In this case, compute-in-memory (CIM). Unlike traditional architectures that constantly move data between memory and processing units, or High Bandwidth Memory (HBM) which relies on off-chip approaches, CIM performs calculations directly within the memory array where data resides. We can get into the trade-offs for different memories in a moment, but the thing to note is CIM can be implemented using conventional memory technologies (SRAM, DRAM, flash) or emerging non-volatile memories like ReRAM, PCM, FeFET, and MRAM. The key distinction is that CIM integrates dedicated computing units into the memory itself, minimizing the energy-intensive process of moving data back and forth.
There are two main approaches to CIM, Digital CIM (D-CIM) designs, where one device corresponds to one bit, provide high accuracy but limited throughput due to simultaneous operation constraints. Analog CIM (A-CIM), conversely, offers higher weight density and can operate more rows simultaneously, though at the cost of dealing with noisy weights. The newer non-volatile technologies typically deliver higher performance, density, and lower power consumption with the benefit of retaining data without constant power, while traditional CMOS-based approaches benefit from mature manufacturing processes and lower costs. In practical applications, CIM can achieve energy efficiencies of 10-100 TOPS/W for low-precision operations (compared to 1-10 TOPS/W in traditional architectures) and theoretical bandwidth capabilities exceeding 10 Tbps. Real-world implementations have shown promising results - Upmem's DRAM-based processing-in-memory has demonstrated 20x speedups in genomics sequencing, while Samsung's HBM-PIM architecture has achieved 2x performance improvements and 70% energy reductions in machine learning models.
As with any new hardware, CIM faces significant technical and practical limitations. Analog computation accuracy remains a major challenge, with current implementations struggling beyond 8-bit operations, limiting their use in high-precision tasks. Although the work on quantized to 1-bit LLMs from Microsoft will likely mitigate the precision needs. Manufacturing integration poses a more substantial hurdle - adding logic to memory complicates fabrication processes, potentially reducing memory density by 20-30% and increasing costs by 2-5x compared to standard DRAM. Thermal management becomes critical as power densities can increase by 50-100%, requiring sophisticated cooling solutions. The technology also faces ecosystem challenges: current software stacks aren't optimized for in-memory computation, many implementations have limited reprogrammability, and interfacing with external systems can introduce latencies that offset CIM's benefits. These interface delays, ranging from tens to hundreds of nanoseconds, can significantly impact time-sensitive AI workloads, making the balance between CIM's benefits and limitations crucial for specific applications.
Worth watching:
Now to the hard stuff, which emerging memory technology will replace HBM for AI training and inference. Trade-offs. It’s always trade-offs.
Technology | Bandwidth (GB/s) | Read (ns) | Write (ns) | Density (Gb/cm²) | Energy/bit (pJ) | Cell Size (F²) | Temp Range (°C) | Endurance (cycles) | Cost ($/GB) |
---|---|---|---|---|---|---|---|---|---|
GDDR6X DRAM | 1000-1500 | 12-15 | 12-15 | 8-10 | 15-20 | 6-8 | 0 to 95 | >10¹⁵ | 8-10 |
HBM3E | 1000-1200 | 8-12 | 8-12 | 8-10 | 3.5-4.5 | 6-8 | 0 to 105 | >10¹⁵ | 15-20 |
HBM3 | 800-900 | 9-13 | 9-13 | 8-10 | 4-5 | 6-8 | 0 to 105 | >10¹⁵ | 12-15 |
SRAM | 100-250 | 1-10 | 1-10 | 2-4 | 0.1-0.3 | 120-140 | -40 to 125 | >10¹⁶ | 20-30 |
DDR5 DRAM | 50-85 | 10-15 | 10-15 | 10-12 | 10-15 | 6-8 | 0 to 85 | >10¹⁵ | 4-6 |
LPDDR5X DRAM | 60-75 | 14-20 | 14-20 | 8-10 | 8-12 | 6-8 | -40 to 105 | >10¹⁵ | 5-7 |
3D XPoint | 20-40 | 100 | 100-1000 | 8-10 | 25-50 | 8-12 | 0 to 85 | 10⁸-10⁹ | 6-8 |
ReRAM | 10-20 | 10-50 | 50-100 | 8-10 | 0.1-1 | 4-12 | -40 to 125 | 10⁹-10¹⁰ | 8-12 |
STT-MRAM | 3-5 | 2-20 | 10-20 | 4-6 | 0.5-2 | 12-20 | -40 to 150 | >10¹⁵ | 15-20 |
QLC NAND | 2-3 | 50,000 | 1,000,000 | 100-200 | 100-1000 | 4 | 0 to 70 | 10³-10⁴ | 0.05-0.08 |
TLC NAND | 2-3 | 30,000 | 500,000 | 50-100 | 100-1000 | 4 | 0 to 70 | 10⁴-10⁵ | 0.08-0.12 |
MLC NAND | 2-3 | 20,000 | 200,000 | 20-50 | 100-1000 | 4 | 0 to 70 | 10⁵-10⁶ | 0.12-0.18 |
FeFET | 1-2 | 20-30 | 30-50 | 4-8 | 0.1-1 | 6-12 | -40 to 125 | 10¹⁰-10¹² | 20-25 |
FeRAM | 1-2 | 20-80 | 50-100 | 1-2 | 5-10 | 15-40 | -40 to 125 | 10¹⁴-10¹⁵ | 25-35 |
PCM | 1-2 | 20-50 | 50-500 | 8-16 | 100-200 | 4-12 | 0 to 85 | 10⁸-10⁹ | 10-15 |
Notes: Bandwidth figures are per die/layer for fair comparison; Read/Write speeds for NAND converted to ns from µs/ms for consistency; Temperature ranges are typical operating ranges for commercial versions; Energy/bit includes both read and write operations; Cell size in F² (F = feature size of the manufacturing process); Bandwidth for HBM includes stacked die advantage
Before looking at alternatives, it's worth noting HBM3e's current position:
The two primary candidates are Spin-Transfer Torque Magnetic Random Access Memory (STT-MRAM) and Resistive Random Access Memory (ReRAM). Each have strengths and weaknesses. The truth is that in the short-term (3-5 years), HBM will likely remain dominant, with incremental improvements through newer generations and 3D stacking techniques. Medium-term (5-10 years) we will likely see some hybrid memory systems combining: HBM for high-bandwidth, training-intensive operations, ReRAM for inference and compute-in-memory operations and potentially, STT-MRAM for fast, non-volatile storage tiers. If I were a betting man, ReRAM appears to be the most promising complete replacement if endurance and reliability issues can be solved, due to comparable density to HBM, native compute-in-memory capabilities, simpler manufacturing process, and better scaling potential.