9.1. Decentralized Training

Okay, what if, and hear me out, we trained models not on a GPU cluster in someone’s Cloud, but across thousands and ideally millions of everyday computers distributed across the globe. It’s a tricky challenge yes, but much of the groundwork has already be laid by the crypto industry since 2008. While Bitcoin's proof-of-work and Ethereum's smart contracts laid the groundwork for decentralised and distributed computation, their capabilities fall short of the intensive demands posed by frontier models. Pioneering projects like Golem and iExec have offered peer-to-peer compute resources for years, yet widespread adoption has been constrained by limited demand and technological hurdles. However, a convergence of technologies is now setting the stage for decentralized compute to become a viable infrastructure at scale. It’s interesting to think that at a micro scale we are connecting multiple chips together to scale out into clusters of distributed servers. The same trend may lead us to connect up smaller data centers, servers and computers around the world to basically aggregate all the computing resources into a “global fungible pool of compute”, a macro scale distributed compute. Fungibility is back. It never went away.

Network efficiency
Blockchain scalability
Private decentralised training

Network efficiency

Decentralized AI training protocols aim to replicate the efficiency of GPU clusters in single datacenters across distributed networks. While datacenter training benefits from high-bandwidth, low-latency interconnects like NVLink or InfiniBand (offering up to 400 Gbps), decentralized systems contend with public internet speeds often below 1 Gbps and latencies in tens of milliseconds. To bridge this gap, techniques like Ring AllReduce are adapted for wide-area networks, employing UDP-based protocols with forward error correction. Gradient compression methods, including 1-bit SGD and Deep Gradient Compression, reduce communication volume by up to 99.9%, though at the cost of increased compute for compression/decompression.

These technologies have shown promising results in specific contexts. Facebook's Bluewhale system demonstrated that training ResNet-50 on ImageNet could be completed in 1.5 hours across 256 GPUs spread globally, compared to 1 hour in a single datacenter. EASGD implementations have achieved near-linear scaling for certain models up to 128 nodes. The FedAvg algorithm in federated learning scenarios has shown the ability to train models on highly distributed datasets without centralizing raw data, crucial for privacy-sensitive applications. On a recent Dwarkesh podcast, Dylan Patel claimed Microsoft had “cracked” distributed training pointing the the acquisition of multiple geographically disperse datacentres across the US.

Still, despite Dylan’s view on Microsoft, scaling challenges persist, particularly for frontier models with billions if not trillions of parameters. The communication-to-computation ratio becomes increasingly unfavorable as model size grows. Current implementations struggle to match the performance of optimized data-parallel training on tightly-coupled GPU clusters for very large models. Heterogeneous node capabilities and network conditions introduce load balancing complexities absent in homogeneous datacenter environments. Additionally, the increased surface area for potential Byzantine faults necessitates robust consensus mechanisms, adding overhead not present in centralized training. While promising for certain use cases, particularly those prioritizing data privacy or leveraging edge compute, decentralized training has yet to demonstrate consistent superiority over centralized approaches for cutting-edge, large-scale model development. The sweet spot may be distributed training across multiple datacentres but not across a global network of heterogeneous computing devices. But crypto likes nothing better than a economically unviable but technically challenges protocol project. So, let’s see.

Worth watching

Gensyn (UK) is pioneering a decentralized protocol for distributed AI training. Their system employs a hybrid consensus mechanism combining Proof-of-Stake with a reputation system for efficient task allocation.
Golem Network (Poland) provides a decentralized marketplace for compute resources, focusing on rendering and machine learning tasks. Their network allows users to rent out idle computing power for various applications, including AI workloads. Golem's platform could play a significant role in democratizing access to AI compute resources.
Akash Network (US) provides a decentralized cloud computing marketplace with a focus on containerized applications. Their platform allows for deployment of AI and machine learning workloads on a global network of providers. Akash's approach could significantly reduce costs and increase flexibility for AI compute resources.

Blockchain scalability

Scalability in decentralized AI training necessitates the development of consensus mechanisms and distributed optimization algorithms that can efficiently coordinate hundreds to thousands of nodes. Proof-of-Work is ill-suited due to its high latency and energy consumption. Instead, Proof-of-Stake variants and Directed Acyclic Graph (DAG) structures are the most likely candidates for AI workloads. For instance, Avalanche consensus, with its subsampled voting mechanism, offers potential for low-latency finality across large node sets. In parallel, distributed optimization algorithms like Federated Averaging (FedAvg) and Decentralized Parallel Stochastic Gradient Descent (D-PSGD) are being refined to maintain convergence properties at scale.

These scalable protocols have shown promising results in controlled environments. Implementations of Horovod, an open-source distributed deep learning framework, have demonstrated near-linear scaling up to 27,000 GPUs for certain models. DAG-based systems have achieved theoretical throughput of over 1000 transactions per second, potentially allowing for high-frequency parameter updates in large-scale training scenarios. Although obviously “it’s not decentralised” will come the cry a la IOTA, but I dunno man, trade-offs. FedAvg for example has been successfully deployed by Google on millions of mobile devices, so we know some consensus algorithms can handle extreme node counts, albeit with less frequent synchronization.

Despite these advancements, we are still far away. As node count increases, so does the probability of stragglers and Byzantine nodes, necessitating robust fault tolerance mechanisms that can impact overall system throughput. The communication overhead for global model synchronization grows quadratically with node count in naive implementations, requiring sophisticated gossip protocols or hierarchical aggregation strategies to mitigate. Moreover, the heterogeneity of compute resources in a truly decentralized network introduces load balancing complexities absent in homogeneous GPU clusters. While promising for certain use cases, particularly those prioritizing decentralization over raw performance, these scalable protocols have yet to consistently outperform optimized data-parallel training in controlled datacenter environments for state-of-the-art model architectures.

Worth watching

Avalanche (US) developed a novel consensus protocol that offers high throughput and low-latency finality. Their subsampled voting mechanism enables efficient coordination across large node sets, making it potentially suitable for decentralized AI workloads. Avalanche's technology could provide the scalability needed for distributed AI training while maintaining security and decentralization.
Horovod (US) created an open-source distributed deep learning framework that has demonstrated near-linear scaling up to 27,000 GPUs. Their implementation optimizes inter-GPU communication, crucial for efficient large-scale model training. Horovod's approach, while primarily designed for data centers, could inform strategies for scaling decentralized AI training across heterogeneous nodes.
**Fetch.ai (UK)** is building a decentralized machine learning platform using a DAG-based ledger for high transaction throughput. Their architecture combines AI, multi-agent systems, and blockchain to create a scalable infrastructure for distributed AI computations. Fetch.ai's approach could enable high-frequency parameter updates necessary for large-scale decentralized AI training.