Computer vision models are complex to build and train, and require a large number of carefully labeled images to yield accurate results. For simple use cases, such as classification of common objects, you can use transfer learning techniques with public datasets such as ImageNet. However, for many real world use cases, for example automated face cropping, you will need to create your own dataset, and this can be a major obstacle for your computer vision project.
Synthetic data is a growing field that offers a compelling alternative to the problem of image datasets. Instead of manually collecting, processing, and annotating images, you can simply auto-generate them with labels built in. We’ll explain how this works, the pros and cons of synthetic data, common algorithmic methods for generating synthetic image data, and tips for success.
What Is a Synthetic Dataset?
Synthetic data is information that originates from a human-operated process and not from events in the real world. It can be generated by a variety of algorithms and used to train machine learning models, or as an alternative to using production data sets for training or testing purposes.
When generating a synthetic dataset, the goal is to make it generic and robust enough to help train machine learning models. The dataset needs to be sufficiently similar in its properties to real-world data, and must capture all relevant edge cases needed to fully train the model.
Today it is possible to generate realistic synthetic data of almost any type—categorical, binary, numeric, and even unstructured data. The “holy grail” of synthetic data is to generate realistic images that can be used to train computer vision models. This is a complex task, but is already effectively handled by a range of deep learning techniques.
The most significant benefits of synthetic data are:
- Low cost—compared to manually collecting and cleaning data from real datasets, synthetic data can be generated at a small fraction of the cost.
- Scalability and time to market—regardless of cost, creating large-scale datasets can take months and require teams of data annotators. Synthetic data can, at least in theory, be generated instantly at any scale.
- Privacy—traditional datasets commonly include information belonging to living persons, which raises privacy, security, and compliance issues. Synthetic data is, by definition, “fake” data that does not represent real persons or events.
How Can Synthetic Data Generation Help Computer Vision?
Collecting real-world visual data with the desired characteristics and variety can be expensive and time consuming. After collection, it is important to label the data correctly. If you label your data incorrectly, your model results will be inaccurate. These processes can take months and consume valuable business resources.
Synthetic image data is generated programmatically, eliminating the need for manual data collection and can include near-complete annotations.
Synthesizing human faces and other types of images
Many computer vision models are concerned with processing images of humans, where the most important element is a human face. Generating photorealistic images containing human faces has been a hot topic of research over the past few years.
Generative adversarial networks (GANs) are at the forefront of this research. Examples such as Nvidia’s StyleGAN, CycleGAN, and now FastCUT show that high-quality images can be synthesized based only on labels and existing image data sets. Similar techniques can be used to synthesize images of physical landscapes, scenarios, or day to day objects.
Combating bias in computer vision datasets
Image datasets collected in the real world are often unbalanced. This can happen due to the rarity of real-world events—for example, an autonomous car that learns from road events may not have enough images of car crashes, because crashes occur rarely and are difficult to capture in real time.
Another reason is due to unconscious bias of the researchers – for example, multiple studies have shown that major image datasets contain an disproportionately small number of images of underrepresented populations.
Synthetic data provides an innovative solution to these problems. Algorithms based on GAN or other technologies can be used to synthesize images of rare events across all relevant edge cases. To take the example above, it is possible to synthesize an image of a car crash across multiple car makes and models, lighting conditions, road configurations, etc.
Synthetic data can also prevent unconscious bias. For example, it is possible to explicitly generate enough images of underrepresented groups to ensure the computer vision model can recognize them correctly.
What Are the Risks of Synthetic Data?
Synthetic data has many benefits in terms of cost effectiveness and privacy, but also has significant limitations. The quality of synthetic data depends on the model from which it was created and the quality of the source data set on which it was based.
Using synthetic data requires additional validation steps, such as comparing the model results to real human annotated data, to ensure the accuracy of the results. Also, synthetic data does not guarantee zero privacy issues, because in some cases, synthetic images may be too similar to images of people in the original dataset.
Another challenge is skepticism within the organization or among users. Many people might have the notion that synthetic data is “fake” data that cannot yield accurate results. Overcoming this cultural hurdle can be a difficult task.
As synthetic data becomes more widely adopted, business leaders will need to consider explainability of data creation methods, and consider how to achieve transparency and accountability in case there are inquiries from regulators or other stakeholders in the future.
Methods for Generating Synthetic Data
Generally speaking, synthetic data algorithms involve learning the features and statistical distribution of a real data set, and generating a new dataset that has similar features and distribution, but is not identical to the original images. Advances in machine learning have provided a variety of models that can learn and synthesize different types of data.
Variational Autoencoders (VAE)
VAE is a generative model that aims to learn the underlying distribution of raw data and is very effective at creating complex models. It works in two stages:
- An encoder network transforms the original complex distribution into a latent distribution.
- A decoder network transforms the distribution back into the original space.
- An error is computed reflecting the degree to which the model accurately reconstructed the original distribution.
The goal of VAE training is to minimize this error, while using regularization objectives to control the shape of the latent distribution.
VAE has limited effectiveness with heterogeneous source data. The more variety exists in the original distribution, the more difficult it is to define reconstruction errors that apply to all components of the dataset.
Generative Adversarial Network (GAN)
GAN is a family of unsupervised, generative models based on deep learning techniques. In general, a GAN architecture includes two neural networks trained simultaneously in an adversarial manner:
- One network is called the generator. It takes random inputs and transforms them into different shapes without looking directly at the original data.
- The second network is called the discriminator. It reviews real images and output from the generator and tries to predict which images are “real” or “fake”.
Both networks are connected through training, so the generator has access to the discriminator’s decision. Over thousands of cycles, the generator gradually improves its ability to create images that will “fool” the discriminator, and the discriminator becomes better and better at identifying “fake” images. Eventually, the generator is able to generate images that are indistinguishable from real images.
Neural Radiance Field (NeRF)
Neural rendering is a relatively new approach to generating images and videos using deep learning. It allows you to control image characteristics such as lighting, camera positioning, geometry, and semantic structures. Neural rendering combines physical computer graphics knowledge with machine learning methods to generate photo-realistic, controllable scenes.
Neural Radiance Field (NeRF) is an image generation technique that can produce new views of a complex scene. NeRF ingests several input images of the given scene and interpolates between them to render a new, complete image of the scene.
The NeRF outputs are volumes with changed density and color characteristics based on the viewpoint and light radiance conditions. Every ray generates an output volume—these volumes collectively generate a complex scene.
NeRF represents static scenes as continuous 5D functions, as shown in the image above. It uses a multi-layered perceptron (MLP)—a fully connected neural network—to represent these functions by navigating backward from specific 5D coordinates to output the appropriate volume densities. The viewpoint influences the RGB color.
Synthetic Data Generation: Tips and Best Practices
Here are some considerations for generating useful synthetic data.
Sufficient Training Data
The volume of training data impacts the quality of a synthetic dataset. The larger the number of examples, the more accurate the model becomes. The training dataset should include at least several thousand examples.
Generating Enough Synthetic Data
Generating more synthetic data makes it easier to evaluate its quality and integrity. Creating more images can help address quality issues.
Identifying and addressing anomalies in the training data is important to ensure the synthetic data model is resilient. When anomalies arise, researchers have several options:
- Fine tuning the model and regenerating the training set
- Identifying anomalies and removing them from synthetic data by hand
- If anomalies are significant, rethinking the data generation approach
Iterate on Synthetic Data
Before using synthetic data to train a model in production, it is critical to evaluate the data. Here are common methods:
- Manual inspection—is usually performed by a data analyst familiar with real-world data. It involves inspecting a sample of records to determine if they seem realistic. It enables human evaluators to provide qualitative feedback. For example, checking whether the dataset is adequately diverse and providing comments about certain data points that can mislead the model.
- Statistical analysis—involves establishing metrics for model diversity and similarity to the distribution of a comparable real dataset. It requires an evaluation of event sequences, distributions, entity relationships, features, and correlations between features. You compute these metrics to ensure the dataset meets all required criteria.
- Training runs – involves training a model in a non-production environment with a synthetic dataset and evaluating its performance on human-labeled or known data points. It helps determine if the synthetic dataset can truly teach the model to process real-world data accurately.
The feedback provided by the above techniques can help finetune and improve a synthetic dataset. However, this is a process rather than a single task. Effective synthetic datasets result from an iterative process incorporating several rounds of trial and error.
Consider Compliance Requirements
Data related to living persons is covered by various compliance regulations, including the General Data Protection Regulation (GDPR), the California Consumer Privacy Act of 2018 (CCPA), and the Health Insurance Portability and Accountability Act of 1996 (HIPAA). When creating a synthetic dataset, it is important to consider compliance requirements.
Speaking to compliance teams in the organization can help understand the compliance requirements for a synthetic data project. The resulting synthetic dataset must not create any privacy risk. A synthetic dataset can be based on real data, which means initial data might contain sensitive data related to living persons. This data must be protected and handled carefully.
In this article, we explained the basics of synthetic data and how it can be applied to computer vision projects requiring large-scale image datasets. I covered three cutting edge techniques for generating high quality synthetic image data:
- Variational autoencoders (VAE)—transforming a complex underlying distribution into a new latent distribution and back.
- Generative Adversarial Network (GAN)—generating new, realistic images via two neural networks competing against each other.
- Neural Radiance Field (NeRF)—generating new angles and poses based on an existing 3D scene.
I hope this will be useful as you evaluate novel approaches to feeding your computer vision models with high quality data.