Frontier models are really big. And they are all really linear algebra. Even more precisely they are just multiplying matrices. What if we just made logic chips to just do that? Well, we already have optimized AI chips, Nvidia GPUs are no longer general-purpose chips that also go in gaming PCs. They’ve optimized the chip and server to run lots of matrix multiplications really fast. Google's TPUs, now in their fourth generation, have shown consistent improvements in computation efficiency, with TPU v4 pods delivering more than 1 exaflop of compute power. Every generation from Nvidia, Google, and AMD improves performance per watt. We already have Groq, Cerebras, Tenstorrent, SambaNova all competing with Nvidia with “more” optimized designs. Others like D-Matrix and Rain are using digital in-memory designs to reduce latency and power consumption. Etched.ai is going all in on the transformer ASICs. But for your run-of-the-mill 5x better performance or power consumption, you are ngmi on margins versus Nvidia scale and CUDA moat. So I’m looking for that 10x or 100x leap forward in performance to make the extra cost and hassle of switching worth it. Honestly, for datacentre scale, your analog or neuromorphics will get crushed on performance. I’m looking at photonic chips and then, while we are on it, let’s go all in and aim for the Landauer limit and make computations reversible or compute with time instead? Incremental gains are not taking us to the places we want to go. Everything’s on the table to build our new God.

Opportunities

Photonic chips

Reversible computing

Temporal computing

Others

Open Questions

  1. What specific architectural innovations and manufacturing processes are required for emerging analog AI processors to achieve >10 TOPS/W (Tera Operations Per Second per Watt) efficiency while maintaining accuracy within 1% of state-of-the-art digital implementations for convolutional neural networks and transformer models?
  2. What innovations in reversible logic gate design can reduce the energy-delay product below 10^-21 J·s for arithmetic operations in deep neural network accelerators?
  3. How can we implement efficient, low-latency interfaces between superconducting quantum processors and room-temperature CMOS circuits to enable hybrid classical-quantum algorithms for machine learning tasks?
  4. What compiler optimizations and intermediate representations are needed to efficiently map tensor operations from popular deep learning frameworks onto novel neuromorphic architectures with >1 million spiking neurons?
  5. How can we design scalable interconnect topologies that maintain <100 ns end-to-end latency for distributed training of transformer models across heterogeneous compute nodes combining GPUs, TPUs, and photonic accelerators?