Logic and memory are fine if you can process everything on a single chip. Doing a Graphcore as it’s known. But your Claude’s and your ChatGPT’s are too large to fit on a single GPU because they contain hundreds of billions of parameters, requiring far more memory (often 350 GB or more) than the 80 GB VRAM typically available on top-tier GPUs. Plus, the computational load of processing these models demands significantly more power than a single chip can provide, necessitating the use of multiple GPUs. To handle both the storage and computation, these models are distributed across many GPUs, using techniques like model parallelism to share the workload. The Nvidia DGX-1 isn’t winning because it has the fastest processor or memory. It is a world leading package. And key to the package, arguably the biggest moat Nvidia have, is NVLink, its GPU interconnect. Unlike traditional PCIe switches, which have limited bandwidth, NVLink enables high-speed direct interconnection between GPUs within the server. NVLink offers 3x more bandwidth at 112 Gbps per lane compared to PCIe Gen5 lanes. This enables the creation of tightly coupled multi-chip modules that can function as a single, more powerful logical GPU. UALink, is a joint effort between the hyperscalers and AMD, Intel, Broadcom, Cisco and others should eventually commoditize NVlink, but scaling the 100GW cluster will be about faster, higher bandwidth interconnects. Turning it up to 11. Interesting solutions to explore are silicon photonics, advanced packaging with Through-Silicon Vias (TSVs) and chiplet architectures with advanced interconnect fabrics. As with photonic processors for AI accelerators, we turn our attention to photons instead of electrons again. Photons just move faster. We already move data over long distances with light through fiber optic cables, surely we should be able to move light over short distances, too?

Opportunities

Silicon photonics

Optical interposers

Chiplet packaging

Others

Questions

  1. What specific performance metrics (e.g., bandwidth, latency, power efficiency) must UALink achieve to surpass NVLink's current 900 GB/s bandwidth and <200 ns latency, and what adoption rate among top 5 GPU and ASIC manufacturers is required by 2025 to establish it as a viable open standard for AI accelerator interconnects?
  2. How can silicon photonics overcome the challenge of >50K/mm thermal gradients in chiplet-based systems to maintain <0.1 dB/cm waveguide losses and <0.5 dB coupling efficiency variation across a 50mm² interposer, while achieving a manufacturing yield >90% for 5000 optical I/Os per chiplet?
  3. What innovations in heterogeneous integration and advanced packaging are needed to enable chiplet-based architectures with >20 TB/s aggregate bandwidth between compute and memory dies, sub-pJ/bit energy efficiency, and <10 μm alignment precision, while maintaining thermal stability across >1000 mm² packages?
  4. How can optical interposers be optimized to achieve <$0.01 per Gb/s cost at volume production while delivering >100 Tb/s aggregate bandwidth and <10 ns end-to-end latency for AI clusters with >1000 nodes, considering both silicon and polymer waveguide technologies?
  5. What are the fundamental physical limits for scaling electrical interconnects beyond 112 Gb/s per lane, and at what point does the crossover occur where optical interconnects become more energy-efficient and cost-effective for chip-to-chip and board-to-board communications in AI systems, considering both NRZ and PAM4 signaling?