Okay, what if, and hear me out, we trained models not on a GPU cluster in someone’s Cloud, but across thousands and ideally millions of everyday computers distributed across the globe. It’s a tricky challenge yes, but much of the groundwork has already be laid by the crypto industry since 2008. While Bitcoin's proof-of-work and Ethereum's smart contracts laid the groundwork for decentralised and distributed computation, their capabilities fall short of the intensive demands posed by frontier models. Pioneering projects like Golem and iExec have offered peer-to-peer compute resources for years, yet widespread adoption has been constrained by limited demand and technological hurdles. However, a convergence of technologies is now setting the stage for decentralized compute to become a viable infrastructure at scale. It’s interesting to think that at a micro scale we are connecting multiple chips together to scale out into clusters of distributed servers. The same trend may lead us to connect up smaller data centers, servers and computers around the world to basically aggregate all the computing resources into a “global fungible pool of compute”, a macro scale distributed compute. Fungibility is back. It never went away.
Decentralized AI training protocols aim to replicate the efficiency of GPU clusters in single datacenters across distributed networks. While datacenter training benefits from high-bandwidth, low-latency interconnects like NVLink or InfiniBand (offering up to 400 Gbps), decentralized systems contend with public internet speeds often below 1 Gbps and latencies in tens of milliseconds. To bridge this gap, techniques like Ring AllReduce are adapted for wide-area networks, employing UDP-based protocols with forward error correction. Gradient compression methods, including 1-bit SGD and Deep Gradient Compression, reduce communication volume by up to 99.9%, though at the cost of increased compute for compression/decompression.
These technologies have shown promising results in specific contexts. Facebook's Bluewhale system demonstrated that training ResNet-50 on ImageNet could be completed in 1.5 hours across 256 GPUs spread globally, compared to 1 hour in a single datacenter. EASGD implementations have achieved near-linear scaling for certain models up to 128 nodes. The FedAvg algorithm in federated learning scenarios has shown the ability to train models on highly distributed datasets without centralizing raw data, crucial for privacy-sensitive applications. On a recent Dwarkesh podcast, Dylan Patel claimed Microsoft had “cracked” distributed training pointing the the acquisition of multiple geographically disperse datacentres across the US.
Still, despite Dylan’s view on Microsoft, scaling challenges persist, particularly for frontier models with billions if not trillions of parameters. The communication-to-computation ratio becomes increasingly unfavorable as model size grows. Current implementations struggle to match the performance of optimized data-parallel training on tightly-coupled GPU clusters for very large models. Heterogeneous node capabilities and network conditions introduce load balancing complexities absent in homogeneous datacenter environments. Additionally, the increased surface area for potential Byzantine faults necessitates robust consensus mechanisms, adding overhead not present in centralized training. While promising for certain use cases, particularly those prioritizing data privacy or leveraging edge compute, decentralized training has yet to demonstrate consistent superiority over centralized approaches for cutting-edge, large-scale model development. The sweet spot may be distributed training across multiple datacentres but not across a global network of heterogeneous computing devices. But crypto likes nothing better than a economically unviable but technically challenges protocol project. So, let’s see.
Worth watching
Scalability in decentralized AI training necessitates the development of consensus mechanisms and distributed optimization algorithms that can efficiently coordinate hundreds to thousands of nodes. Proof-of-Work is ill-suited due to its high latency and energy consumption. Instead, Proof-of-Stake variants and Directed Acyclic Graph (DAG) structures are the most likely candidates for AI workloads. For instance, Avalanche consensus, with its subsampled voting mechanism, offers potential for low-latency finality across large node sets. In parallel, distributed optimization algorithms like Federated Averaging (FedAvg) and Decentralized Parallel Stochastic Gradient Descent (D-PSGD) are being refined to maintain convergence properties at scale.
These scalable protocols have shown promising results in controlled environments. Implementations of Horovod, an open-source distributed deep learning framework, have demonstrated near-linear scaling up to 27,000 GPUs for certain models. DAG-based systems have achieved theoretical throughput of over 1000 transactions per second, potentially allowing for high-frequency parameter updates in large-scale training scenarios. Although obviously “it’s not decentralised” will come the cry a la IOTA, but I dunno man, trade-offs. FedAvg for example has been successfully deployed by Google on millions of mobile devices, so we know some consensus algorithms can handle extreme node counts, albeit with less frequent synchronization.
Despite these advancements, we are still far away. As node count increases, so does the probability of stragglers and Byzantine nodes, necessitating robust fault tolerance mechanisms that can impact overall system throughput. The communication overhead for global model synchronization grows quadratically with node count in naive implementations, requiring sophisticated gossip protocols or hierarchical aggregation strategies to mitigate. Moreover, the heterogeneity of compute resources in a truly decentralized network introduces load balancing complexities absent in homogeneous GPU clusters. While promising for certain use cases, particularly those prioritizing decentralization over raw performance, these scalable protocols have yet to consistently outperform optimized data-parallel training in controlled datacenter environments for state-of-the-art model architectures.
Worth watching