Logic and memory are fine if you can process everything on a single chip. Doing a Graphcore as it’s known. But your Claude’s and your ChatGPT’s are too large to fit on a single GPU because they contain hundreds of billions of parameters, requiring far more memory (often 350 GB or more) than the 80 GB VRAM typically available on top-tier GPUs. Plus, the computational load of processing these models demands significantly more power than a single chip can provide, necessitating the use of multiple GPUs. To handle both the storage and computation, these models are distributed across many GPUs, using techniques like model parallelism to share the workload. The Nvidia DGX-1 isn’t winning because it has the fastest processor or memory. It is a world leading package. And key to the package, arguably the biggest moat Nvidia have, is NVLink, its GPU interconnect. Unlike traditional PCIe switches, which have limited bandwidth, NVLink enables high-speed direct interconnection between GPUs within the server. NVLink offers 3x more bandwidth at 112 Gbps per lane compared to PCIe Gen5 lanes. This enables the creation of tightly coupled multi-chip modules that can function as a single, more powerful logical GPU. UALink, is a joint effort between the hyperscalers and AMD, Intel, Broadcom, Cisco and others should eventually commoditize NVlink, but scaling the 100GW cluster will be about faster, higher bandwidth interconnects. Turning it up to 11. Interesting solutions to explore are silicon photonics, advanced packaging with Through-Silicon Vias (TSVs) and chiplet architectures with advanced interconnect fabrics. As with photonic processors for AI accelerators, we turn our attention to photons instead of electrons again. Photons just move faster. We already move data over long distances with light through fiber optic cables, surely we should be able to move light over short distances, too?
Opportunities
Silicon photonics
Optical interposers
Chiplet packaging
Others
- Graphene-based interconnects: Graphene, a two-dimensional carbon material, shows promise for ultra-high-speed interconnects due to its exceptional electrical and thermal properties. Graphene interconnects potentially offer higher current-carrying capacity and lower resistance than traditional copper interconnects, enabling faster signal propagation and reduced power consumption. Research demonstrates potential for sub-picosecond switching times and terahertz operating frequencies, though challenges remain in large-scale manufacturing and integration with existing semiconductor processes.
- Wireless chip-to-chip communication: This approach uses on-chip antennas to transmit data between chips using millimeter-wave or terahertz frequencies. Wireless interconnects could potentially overcome limitations of physical connections, offering flexible chip placement and reduced packaging complexity. Recent advancements have demonstrated data rates exceeding 100 Gbps over short distances. However, challenges include antenna design in limited chip area, power efficiency of transceivers, and managing interference in dense multi-chip environments.
- Spintronic interconnects: Spintronic interconnects leverage electron spin for information transfer, potentially offering lower power consumption and higher data rates compared to charge-based electronics. These interconnects could enable non-volatile signal propagation and logic operations. Recent research has demonstrated spin wave propagation over distances of several micrometers with sub-nanosecond delays. Challenges include efficient spin injection and detection, managing spin coherence over longer distances, and integrating spintronic components with conventional CMOS technology.
Questions
- What specific performance metrics (e.g., bandwidth, latency, power efficiency) must UALink achieve to surpass NVLink's current 900 GB/s bandwidth and <200 ns latency, and what adoption rate among top 5 GPU and ASIC manufacturers is required by 2025 to establish it as a viable open standard for AI accelerator interconnects?
- How can silicon photonics overcome the challenge of >50K/mm thermal gradients in chiplet-based systems to maintain <0.1 dB/cm waveguide losses and <0.5 dB coupling efficiency variation across a 50mm² interposer, while achieving a manufacturing yield >90% for 5000 optical I/Os per chiplet?
- What innovations in heterogeneous integration and advanced packaging are needed to enable chiplet-based architectures with >20 TB/s aggregate bandwidth between compute and memory dies, sub-pJ/bit energy efficiency, and <10 μm alignment precision, while maintaining thermal stability across >1000 mm² packages?
- How can optical interposers be optimized to achieve <$0.01 per Gb/s cost at volume production while delivering >100 Tb/s aggregate bandwidth and <10 ns end-to-end latency for AI clusters with >1000 nodes, considering both silicon and polymer waveguide technologies?
- What are the fundamental physical limits for scaling electrical interconnects beyond 112 Gb/s per lane, and at what point does the crossover occur where optical interconnects become more energy-efficient and cost-effective for chip-to-chip and board-to-board communications in AI systems, considering both NRZ and PAM4 signaling?