This post presents the different synthetic data types that currently exist: text, media (video, image, sound), and tabular synthetic data. We start with a brief definition and overview of the reasons behind the use of synthetic data. We then go over several real-life examples of applications for synthetic data:
For a detailed intro to the concept of synthetic data, check our article “What is privacy-preserving synthetic data.”
To start, we could give the following definition of synthetic data:
data algorithmically generated approximating original data, which can be used for the same purpose as the original.
There are a few reasons behind the need for such assets. First, it can be a matter of availability. Your organization or your team doesn’t have the data or enough of it. For larger organizations, legacy infrastructures and siloed data systems are also often a cause of data unavailability. In today’s data protection regulatory landscape, it can also be a matter of legal compliance. The data exists, but its processing is strictly regulated. For instance, the General Data Protection Regulation (GDPR) forbids uses that weren’t explicitly consented to when the organization collected the data.
Security concerns can also prevent data from flowing within an organization. The information is too sensitive to be migrated to a cloud infrastructure, for example. Governance processes might also slow down or limit data access for similar reasons. Finally, it can come down to a matter of cost. A given data asset might be too expensive to buy or time-consuming to access and prepare.
These reasons are why companies turn to synthetic data. Either they produce datasets from partially synthetic data, where they replace only a selection of the dataset with synthetic data. It is common when they want to complement an existing resource. Or they use fully synthetic data, with datasets that don’t contain any of the original data. Fully synthetic data is often found where privacy is impeding the use of the original data.
There are several types of synthetic data that serve different purposes. Synthetic data can be:
Synthetic text is artificially-generated text. You build and train a model to generate text. Because of languages’ complexities, generating realistic synthetic text has always been challenging. However, the rise of new machine learning models led to the conception of remarkably performant natural language generation systems.
Last year, the OpenAI team introduced GPT-3, a language model able to generate human-like text. You can find numerous examples of text written by the GPT-3 model, with constraints or specific text inputs, such as the one depicted below.
Synthetic data can also be synthetic video, image, or sound. You artificially render media with properties close-enough to real-life data. This similarity allows using the synthetic media as a drop-in replacement for the original data.
None of these individuals are real. These synthetic images were artificially generated by the Generative Adversarial Network, StyleGAN2 (Dec 2019) from the work of Karras et al. and Nvidia. The system learned properties of real-life people’s pictures in order to generate realistic images of human faces.
This method is helpful to augment the databases used to train machine learning algorithms. For example, when training video data is not available for privacy reasons, you can generate synthetic video data to resolve that. Similarly, you can use synthetic data to increase datasets' size and diversity when training image recognition systems.
Tabular synthetic data refers to artificially generated data that mimics real-life data stored in tables. This data is structured in rows and columns. It could be anything ranging from a patient database to users’ analytical behavior information or financial logs.
Data is at the core of today’s data science activities and business intelligence. As mentioned earlier, there are multiple scenarios in the enterprise in which data can not circulate within departments, subsidiaries or partners. Synthetic data can be used as a drop-in replacement for any type of behavior, predictive, or transactional analysis.
At Statice, our focus is on privacy-preserving tabular synthetic data. The synthetic data we generate comes with privacy guarantees. These measures ensure no individual present in the original data can be re-identified from the synthetic data.
Privacy-preserving synthetic data holds opportunities for industries relying on customer data to innovate. Modern data protection regulations often prevent any extensive use of such data. Privacy-preserving synthetic represents here a safe and compliant alternative to traditional data protection methods. It also enables internal or external data sharing.
Synthetic data has application in the field of natural language processing. Amazon’s Alexa AI team, for instance, uses synthetic data to complete the training data of its natural language understanding (NLU) system. It provides them with a solid ground to train new languages without existing, or enough, customer interaction data.
“When a new-language version of Alexa is under development, training data for its NLU systems is scarce. [...] The new bootstrapping tools, from Alexa AI’s Applied Modeling and Data Science group, treat the available sample utterances as templates and generate new data by combining and varying those templates.” Janet Slifka, director of research science in Alexa AI’s Natural Understanding group
When it comes to synthetic media, a popular use for them is the training of vision algorithms. For over a year now, the Waymo team has been generating realistic driving datasets from synthetic data. Alphabet’s subsidiary company uses these datasets to train its self-driving vehicle systems. It is an efficient way of including more complex and varied scenarios, as opposed to spending significant time and resources to obtain observations of similar scenarios.
As its virtual cars drive through the same scenarios Waymo vehicles experience in the real world, engineers [...] manipulate those scenes by virtually adding new agents into the situation, such as cyclists, or by modulating the speed of oncoming traffic to gauge how the Waymo Driver would have reacted. Venturebeat
Waymo isn’t the only company relying on synthetic data for this use-case: GM Cruise, Tesla Autopilot, Argo AI, and Aurora are too.
In the retail industry, Amazon also deployed similar techniques for the training of Just Walk Out, the system powering the Amazon Go cashier-less stores. The team generated a considerable amount and variety of synthetic customer behavior data to train its computer vision system.
“By using simulation to build a massive training set, the team was able to leverage the power of the cloud to train on months worth of data in a day, eliminating the time bottleneck and allowing rapid progress.” How the Amazon Go Store works.
The financial institution American Express has been investigating the use of tabular synthetic data. Their data science team is researching how to generate statistically accurate synthetic data from financial transactions to perform fraud detection. They were already able to use the synthetic data to help train the detection models.
To develop state of the art ML methods, including methods for anomaly detection and model interpretation, ML researchers and practitioners need to have access to data that is as close to the real one as possible. [...] we show that synthesized data follows the same distribution as the original data,and that ML models trained on synthesized data have the same performance as those trained on the original data. Efimov, Xu, Kong, Nefedov and Anandakrishnan (2020) in Using Generative Adversarial Networks to Synthesize Artificial Financial Datasets
In the field of insurance, where customer data is both an essential and sensitive resource, Swiss company La Mobilière used synthetic data to train churn prediction models. The data science team modeled tabular synthetic data after real-life customer data. They trained their machine learning models without compromising on the model performance or on their customer privacy.
“The Statice software protects the original data of our customers on the one hand, and on the other, enables us to work with the data across departments without compromising privacy or security issues.” Georg Russ, Data Scientist, Data & Analytics.
In general, all customer-facing industries can benefit from privacy-preserving synthetic data, as modern data procession laws regulate personal data processing.
For example, in the healthcare field, the use of patient’s data is extremely regulated. Roche validated with us the use of synthetic data as a replacement for patient data in clinical research. The german Charité Lab for Artificial Intelligence in Medicine is also working on developing synthetic data to generate data for collaborative research and facilitate the progression of different medical use cases.
For an overview of industries and their use of privacy-preserving synthetic data, check our answer in this post about “Which industries have the strongest need for synthetic data?”
Contact us and get feedback instantly.