tldr, yes it’s a trade-off, it always is. Confidential computing is the brand name for running Trusted Execution Environments (TEEs) to isolate sensitive computations from the operating system, hypervisor, and other potentially untrusted software layers. By creating a secure enclave for computations, Confidential Computing protects against both external threats, such as malware or hackers, and insider risks, including malicious administrators. In the realm of AI, where vast amounts of sensitive data are processed for training models or running inferences, the ability to secure data during computation is crucial, particularly in cloud environments where users may not fully control the infrastructure. For example, Microsoft’s Azure Confidential Computing and Google’s Confidential VM offer platforms that enable secure data processing, mitigating the risks associated with outsourcing AI workloads to public clouds.

Distributed TEEs for AI Model Training and Inference

Distributed Trusted Execution Environments (TEEs) represent a crucial technology for securely training and running AI models across multiple machines in a decentralized manner. TEEs, such as Intel's Software Guard Extensions (SGX), provide hardware-based isolation, allowing sensitive computations to be performed in secure enclaves. These enclaves ensure that both data and computations remain confidential, even in untrusted environments, by encrypting memory and isolating critical portions of the application from the underlying system, including the operating system and hypervisors. This hardware-based security is essential for AI workloads involving sensitive data, as it mitigates risks from both external and insider threats. Recent advancements, including Intel’s Total Memory Encryption (TME), have expanded the capability of SGX, allowing TEEs to handle larger datasets and more complex AI models. However, when scaled to distributed AI training and inference, TEEs must be deployed across multiple nodes, introducing new complexities in maintaining security, performance, and scalability across the network.

Distributed TEEs offer significant potential for enabling privacy-preserving AI model training and inference, particularly in multi-tenant cloud environments where data security is a top priority. By isolating AI computations within secure enclaves distributed across several machines, organizations can safely collaborate on AI model training without exposing sensitive data to third parties, such as cloud providers. For example, companies like Oasis Labs have demonstrated the feasibility of training deep learning models within SGX enclaves, ensuring that both the model and the data remain confidential throughout the entire process. This approach enables AI-as-a-Service (AIaaS) providers to offer secure AI model training on encrypted data, opening new opportunities for collaborative AI across industries such as healthcare, finance, and government, where data privacy is critical. Inference can also be securely distributed, allowing organizations to deploy AI models across different nodes without compromising data security.

However, despite these advantages, running distributed AI models within TEEs presents several technical challenges. One of the primary obstacles is scalability, as the secure memory available in current TEE architectures is limited. Intel SGX enclaves, for instance, provide only a few hundred megabytes of secure memory, while many modern AI models require gigabytes of memory just for inference, not to mention the much larger requirements for training. To address these memory constraints, developers may need to modify or split models into smaller segments, complicating the process and potentially reducing efficiency. Moreover, distributed inference and training add another layer of complexity: performing these tasks securely across multiple TEEs requires careful coordination to ensure that security guarantees are maintained across the entire network. Secure communication protocols between enclaves introduce additional overhead, resulting in increased latency and degraded performance, particularly in latency-sensitive applications such as real-time decision-making in autonomous systems.

Additionally, the process of deploying AI models into distributed TEEs involves several non-trivial steps. First, the encrypted AI model must be decrypted and loaded into the TEE, which raises concerns about potential data leakage during transfer. Careful management is required to ensure that the model and data remain secure throughout the process, and any vulnerabilities in communication could lead to side-channel attacks. Furthermore, distributed TEEs lack efficient inter-enclave communication protocols optimized for AI tasks, making it difficult to coordinate complex neural network computations across multiple machines without introducing significant delays. These limitations suggest that while distributed TEEs provide a promising solution for secure AI workloads, extensive research and development are still needed to optimize the performance, scalability, and security of distributed TEE architectures for high-performance AI applications.

Worth watching

MPC with TEEs

A key area of research within Confidential Computing is the integration of Multi-Party Computation (MPC) with Trusted Execution Environments (TEEs), creating a powerful framework for secure collaborative computation. MPC allows multiple parties to jointly compute a function over their private data without revealing the data itself to other participants, while TEEs provide hardware-based isolation and security. By combining MPC with TEEs, organizations can achieve enhanced security, ensuring that not only the data remains private but also that computations are securely isolated from potential threats. This hybrid approach is particularly valuable in scenarios where multiple stakeholders, such as hospitals or financial institutions, need to collaborate on sensitive data analysis or AI model training without exposing their underlying data to each other or third-party cloud providers. The use of TEEs ensures that the computation occurs within a protected environment, adding an additional layer of security beyond the encryption used in MPC protocols.

The combination of MPC and TEEs provides significant benefits, particularly in environments where collaboration between multiple parties is essential but data privacy must be maintained. For example, AMD’s Secure Encrypted Virtualization (SEV) technology enables encrypted execution within virtual machines, providing strong protection for data and computations in shared environments. When integrating MPC into SEV-based TEEs, researchers have observed that the added encryption overhead results in a latency increase of roughly 10-20%, depending on the complexity of the computation. For example, in the "Citadel" project at UC Berkeley, which demonstrated distributed machine learning using MPC within TEEs, researchers reported a performance drop of around 15-25% for matrix multiplication tasks commonly used in neural networks, when compared to non-TEE environments. This still allows for practical applications, especially for batch processing or scenarios where real-time speed is not the highest priority. In financial and healthcare settings, the ability to jointly train fraud detection systems or diagnostic AI models with privacy guarantees outweighs the performance hit, enabling organizations to leverage broader datasets without violating privacy regulations.

Despite its strong privacy guarantees, the integration of MPC and TEEs faces notable limitations in terms of performance and scalability. MPC protocols already introduce significant latency, typically adding a 2x to 10x slowdown compared to non-MPC algorithms, depending on the complexity of the task. When combined with the overhead of executing computations within a TEE, this latency can increase further. For instance, secure computation involving deep learning inference tasks within Intel SGX-based TEEs has been shown to incur a latency penalty of up to 30% for large models, particularly during the secure memory access and encryption/decryption operations. This makes real-time applications, such as high-frequency trading or autonomous driving, impractical within this framework. Furthermore, the requirement for secure communication between TEEs across distributed nodes exacerbates the issue, as coordination and synchronization between enclaves can result in increased round-trip times, often adding 100-300 milliseconds in distributed AI systems depending on network conditions. These performance penalties, coupled with the complexity of managing secure enclave environments, limit the applicability of MPC within TEEs to less latency-sensitive use cases. Optimizing these systems to reduce the cryptographic and computational overhead remains a critical area of ongoing research.

Worth watching