Frontier models are really big. And they are all really linear algebra. Even more precisely they are just multiplying matrices. What if we just made logic chips to just do that? Well, we already have optimized AI chips, Nvidia GPUs are no longer general-purpose chips that also go in gaming PCs. They’ve optimized the chip and server to run lots of matrix multiplications really fast. Google's TPUs, now in their fourth generation, have shown consistent improvements in computation efficiency, with TPU v4 pods delivering more than 1 exaflop of compute power. Every generation from Nvidia, Google, and AMD improves performance per watt. We already have Groq, Cerebras, Tenstorrent, SambaNova all competing with Nvidia with “more” optimized designs. Others like D-Matrix and Rain are using digital in-memory designs to reduce latency and power consumption. Etched.ai is going all in on the transformer ASICs. But for your run-of-the-mill 5x better performance or power consumption, you are ngmi on margins versus Nvidia scale and CUDA moat. So I’m looking for that 10x or 100x leap forward in performance to make the extra cost and hassle of switching worth it. Honestly, for datacentre scale, your analog or neuromorphics will get crushed on performance. I’m looking at photonic chips and then, while we are on it, let’s go all in and aim for the Landauer limit and make computations reversible or compute with time instead? Incremental gains are not taking us to the places we want to go. Everything’s on the table to build our new God.
Opportunities
Photonic chips
Reversible computing
Temporal computing
Others
- 3D stacking leverages advanced packaging technologies to vertically integrate multiple chip layers, significantly increasing transistor density and reducing signal propagation distances. This approach can potentially improve performance by 30-50% and reduce power consumption by 20-30% compared to traditional 2D designs. Recent advancements in through-silicon vias (TSVs) and hybrid bonding techniques are enabling more complex 3D-stacked AI processors with high-bandwidth memory (HBM) integration.
- In-memory computing aims to overcome the von Neumann bottleneck by performing computations directly within memory arrays. This approach can reduce energy consumption by up to 100x for certain AI workloads by minimizing data movement. Recent demonstrations using resistive RAM (ReRAM) and phase-change memory (PCM) have shown promising results for implementing neural network inference with significantly improved energy efficiency.
- Neuromorphic architectures mimic the structure and function of biological neural networks, potentially offering orders of magnitude improvements in energy efficiency for certain AI tasks. IBM's TrueNorth and Intel's Loihi chips demonstrate the potential of this approach, with energy efficiencies in the pico-joules per synaptic operation range. Ongoing research focuses on scaling these architectures and developing software ecosystems to support them.
- Cryogenic computing operates processors at extremely low temperatures, typically using liquid nitrogen or helium cooling. This approach can reduce electrical resistance, enabling higher clock speeds and lower power consumption. Recent experiments with cryogenic CMOS have demonstrated up to 10x improvement in energy efficiency. While currently limited by cooling infrastructure challenges, advancements in cryogenic systems could make this viable for specialized AI accelerators in data centers.
- Spin-based computing utilizes the intrinsic angular momentum of electrons to perform logic operations and store information. This technology promises ultra-low power consumption, with theoretical switching energies as low as 1 attojoule per operation. Recent demonstrations of spin-orbit torque (SOT) devices have shown the potential for sub-nanosecond switching times, making them promising candidates for future energy-efficient AI processors.
Open Questions
- What specific architectural innovations and manufacturing processes are required for emerging analog AI processors to achieve >10 TOPS/W (Tera Operations Per Second per Watt) efficiency while maintaining accuracy within 1% of state-of-the-art digital implementations for convolutional neural networks and transformer models?
- What innovations in reversible logic gate design can reduce the energy-delay product below 10^-21 J·s for arithmetic operations in deep neural network accelerators?
- How can we implement efficient, low-latency interfaces between superconducting quantum processors and room-temperature CMOS circuits to enable hybrid classical-quantum algorithms for machine learning tasks?
- What compiler optimizations and intermediate representations are needed to efficiently map tensor operations from popular deep learning frameworks onto novel neuromorphic architectures with >1 million spiking neurons?
- How can we design scalable interconnect topologies that maintain <100 ns end-to-end latency for distributed training of transformer models across heterogeneous compute nodes combining GPUs, TPUs, and photonic accelerators?