3.2. Advanced Memory

“DRAM doesn’t scale anymore. In the glory days, memory bit density doubled every 18 months – outpacing even logic. That translates to just over 100x density increase every decade. But in this last decade, scaling has slowed so much that density has increased just 2x”. So says Dylan Patel. Just as logic chips improve dramatically in terms of density and cost per transistor function, DRAM improvements have been minor and increased bandwidth has been a function of expensive packaging, not scaling. Memory is unlikely to be a long-term bottleneck to AI data center scaling, but the question is one of economic viability. The DRAM roadmap hints at some brutal trade-offs in cost and power to achieve the throughput required for trillion+ parameter models. Today, High Bandwidth Memory (HBM) is the solution for almost every AI accelerator. It prioritizes bandwidth and power efficiency but is expensive at 3x the price of DDR5 per GB. HBM3e can deliver 36GB capacity and about 1.2 TBps performance. Gen 4 is the only game in town for data center AI accelerators. Other DRAM varieties like DDR5, LPDDR5X and GDDR6X target different cost, performance, and power requirements. Some companies are combining the high performance, high cost of HBM with the lower performance, lower cost of LPDDR. This is all fine but the truth is HBM is a hack. It’s a packaging solution to increase density to solve for the inherent DRAM bandwidth and power problems.

Opportunities

Compute-in-memory (CIM)

Spin-Transfer Torque Magnetic Random Access Memory (STT-MRAM)

Resistive Random Access Memory (ReRAM)

Others

FeRAM: Employs ferroelectric materials (typically PZT or HfO2) that maintain polarization states for non-volatile storage. Achieves read/write speeds of 20-80ns with extremely high endurance (>10¹⁴ cycles) and low power consumption due to voltage-based switching. Cell size remains large (15-40F²) due to capacitor structure requirements, limiting density to 1-2 Gb/cm². Manufacturing challenges include ferroelectric material integration with CMOS and scaling issues below 130nm nodes. Cost remains 4-5x higher than DRAM due to specialized materials and process complexity, confining use to niche applications like industrial control systems and automotive. Development focus shifts to HfO2-based implementations for better CMOS compatibility and scaling potential.

3D DRAM stacking: Employs through-silicon vias (TSVs) to vertically integrate multiple DRAM dies. Achieves bandwidth up to 900 GB/s and capacities of 24-48 GB per stack. Reduces power consumption by 50-70% compared to planar DRAM due to shorter interconnects. Thermal density increases exponentially with die count, requiring advanced cooling solutions. Manufacturing complexity impacts yields, particularly TSV formation and die thinning processes. Cost remains 2-3x higher than conventional DRAM due to complex integration and lower yields.

Hybrid Memory Systems in AI accelerators: Combines HBM (1-2 TB/s bandwidth, 4-64 GB capacity) with GDDR6/6X (up to 1 TB/s, 8-32 GB) or DDR5 (up to 460 GB/s, 128+ GB). Implements multi-level memory hierarchy with software-managed data movement between tiers. Utilizes cache coherence protocols and page migration algorithms to optimize data placement. Requires sophisticated memory controllers capable of managing multiple interfaces and protocols simultaneously. Enables heterogeneous compute architectures with specialized memory subsystems for different AI operations (e.g., HBM for matrix multiply, DDR for embedding tables). Increases design complexity and power management challenges due to disparate voltage domains and timing requirements.

Questions

What specific circuit-level techniques and materials innovations are required for Compute-In-Memory (CIM) architectures to achieve <0.1% accuracy degradation for 8-bit matrix multiplications in 1024x1024 arrays, while maintaining >100 TOPS/W efficiency for convolutional neural network inference tasks?
How can we engineer ferroelectric materials and domain structures to achieve a critical thickness below 2nm and a remnant polarization >20 µC/cm² in FeRAM devices, enabling >64 Gb/cm² density while maintaining sub-10ns write speeds and >10^15 endurance cycles for AI model weight storage?
What breakthroughs in spin-orbit torque materials or device structures are needed to reduce MRAM's write energy below 0.1 pJ/bit and current density below 10^6 A/cm², while maintaining thermal stability (Δ > 60) for reliable operation in high-update-frequency AI training workloads?
How can we develop monolithic 3D integration processes for DRAM that enable >16 active layers with <50nm inter-layer vias, maintaining sub-2ns random access times and <10 pJ/bit energy consumption, to support memory-intensive LLMs?
What specific memory controller architectures and page migration algorithms can optimize data placement across HBM, GDDR6X, and PCIe Gen5 NVMe storage in hybrid memory systems, achieving <5% performance penalty compared to ideal data placement for evolving transformer architectures with varying attention mechanism compute patterns?