7.3. Synthetic Data

I used to think synethtic data was a waste of time. That eventually hardware and software-based cryprography would improve so that we wouldn’t need to create synthetic data to protect privacy at all. But then transformers just wanted all the data in the world to train on. And then it ran out. And we had to feed the beast. So here we are. fwiw, synthetic data refers to artificially generated information that mimics the statistical properties and patterns of real-world data without containing actual individual records. In the context of AI privacy, synthetic data serves as a tool for mitigating privacy risks during both training and inference phases. By training models on synthetic datasets that capture the essential characteristics of sensitive real data without incorporating personally identifiable information, organizations can develop AI systems that generalize well to real-world scenarios while minimizing exposure of individual data points. During inference, synthetic data can be used to query models or test system behavior without risking the privacy of real individuals. The key to effective synthetic data lies in its ability to preserve the utility of the data for the intended AI task while breaking the one-to-one mapping between synthetic and real data points, thereby providing a layer of privacy protection. In the context of training frontier models, state-of-the-art approaches to synthetic data generation have become integral to the development process in particular, masked language modeling, data augmentation, and instruction tuning.

Masked language modeling
Data augmentation
Instruction tuning

Masked language modeling (MLM)

Masked Language Modeling (MLM) is a sophisticated self-supervised learning technique that has become fundamental in the training of large language models. This approach involves presenting the model with text sequences containing strategically masked tokens, which the model must then predict. The process leverages the inherent structure and patterns within language to enable the model to learn contextual representations of words and phrases without the need for explicit labeling. By forcing the model to consider the surrounding context to infer the masked elements, MLM facilitates the development of nuanced language understanding capabilities. The technical viability of MLM lies in its ability to harness the vast amounts of unlabeled text data available, making it particularly well-suited for pre-training large-scale language models.

MLM shows significant improvements observed in model performance across a wide range of natural language processing tasks. Models trained using MLM have demonstrated remarkable proficiency in capturing semantic and syntactic relationships, leading to enhanced performance in downstream tasks such as text classification, named entity recognition, and question answering. The self-supervised nature of MLM allows for efficient utilization of unlabeled data, reducing the reliance on expensive and time-consuming manual annotation processes. Furthermore, the learned representations have shown impressive transfer learning capabilities, enabling models to adapt quickly to new domains and tasks with minimal fine-tuning.

MLM faces several limitations and challenges in its adoption and effectiveness. The artificial nature of the masking process may not fully encapsulate the intricacies of natural language understanding, potentially leading models to exploit masking-specific patterns rather than developing a comprehensive grasp of language. MLM also struggles with capturing long-range dependencies and maintaining global coherence, as its primary focus is on local context. While researchers are exploring variations such as whole word masking and span-based prediction to address these issues, striking a balance between local and global understanding remains a significant challenge. Moreover, the computational resources required for MLM training can be substantial, potentially limiting its accessibility for smaller research teams or organizations. As alternative approaches like contrastive learning and autoregressive models continue to evolve, the relative advantages of MLM in certain scenarios may diminish, necessitating ongoing research to refine and expand its capabilities.

Worth watching

Hugging Face (US) Developed the Transformers library, which includes implementations of various MLM-based models. They provide pre-trained models and tools for fine-tuning MLM models on custom datasets. Hugging Face's contributions have significantly democratized access to state-of-the-art NLP technologies.
Google AI (US) Introduced BERT, one of the pioneering and most influential MLM-based models. They continue to develop and improve MLM techniques, with models like T5 and ELECTRA. Google's research in this area has set benchmarks and inspired numerous subsequent developments in the field.
Microsoft Research (US) made significant contributions to MLM with models like UniLM and DeBERTa. They have focused on improving MLM techniques for various NLP tasks and developing more efficient training methods. Microsoft's work has pushed the boundaries of what's possible with MLM, particularly in multi-task learning scenarios.

Data augmentation

Data augmentation techniques represent another powerful approach to enhancing LLM training by artificially expanding and diversifying the training dataset. This methodology involves creating variations of existing data through methods such as back-translation, paraphrasing, and controlled text generation. The technical underpinnings of data augmentation lie in its ability to expose the model to a broader spectrum of linguistic variations without the need for additional manual data collection or annotation. By systematically altering existing text while preserving its essential meaning, data augmentation techniques can significantly increase the effective size and diversity of the training corpus, potentially leading to more robust and generalizable models.

This approach can substantially improve a model's ability to handle linguistic variations and edge cases, enhancing its overall robustness and generalization capabilities. By exposing the model to a wider range of syntactic structures and lexical choices, data augmentation helps mitigate overfitting to specific phrasings or styles present in the original dataset. This can lead to improved performance across diverse tasks and domains, particularly in scenarios where the model encounters language use that diverges from its primary training data. Additionally, data augmentation can be particularly valuable in low-resource settings or for specialized domains where obtaining large volumes of high-quality, diverse data may be challenging or expensive.

Despite its potential, one significant concern is the risk of introducing artifacts or biases from the augmentation process itself, which could lead the model to learn spurious correlations rather than genuinely useful linguistic patterns. Ensuring the semantic preservation and naturalness of augmented text remains a substantial challenge, particularly for complex or nuanced content where subtle changes can significantly alter meaning or tone. Moreover, while augmentation can increase dataset size, it may not necessarily introduce truly novel information or concepts, potentially limiting its ability to expand the model's knowledge base. In comparison to Masked Language Modeling (MLM), data augmentation offers more flexibility in generating diverse training examples but may lack the structured approach to learning contextual representations that MLM provides. While MLM focuses on predicting masked tokens within a fixed context, data augmentation aims to create entirely new variations of existing data. This distinction means that data augmentation may be more effective at improving model robustness across different phrasings or styles, while MLM might excel at capturing fine-grained contextual relationships within text. As research progresses, the integration of sophisticated augmentation techniques with other training approaches, including MLM, may yield complementary benefits, addressing the limitations of each method individually.

Worth watching

Appen (Australia) provides data augmentation services for AI and machine learning, including techniques for language models. They offer a platform for creating and augmenting datasets at scale. Appen's solutions help companies improve their models' performance and robustness through diverse, high-quality training data.
Synthesis AI (US) specializes in generating synthetic data for AI training, including applications in natural language processing. They use advanced AI techniques to create diverse and realistic synthetic datasets. Synthesis AI's approach helps address data scarcity and privacy concerns in AI development.
Datomize (Israel): offers a platform for generating synthetic data, including text data for NLP applications. Their technology preserves the statistical properties of original datasets while ensuring privacy. Datomize's solutions enable organizations to augment their training data without compromising sensitive information.