This post presents the different synthetic data types that currently exist: text, media (video, image, sound), and tabular synthetic data. We start with a brief definition and overview of the reasons behind the use of synthetic data. We then go over several real-life examples of applications for synthetic data:
For a detailed intro to the concept of synthetic data, check our article “What is privacy-preserving synthetic data.”
To start, we could give the following synthetic data meaning:
data algorithmically generated approximating original data, which can be used for the same purpose as the original.
There are a few reasons behind the need for such assets. First, it can be a matter of availability. Your organization or your team doesn't have the data or enough of it. For larger organizations, legacy infrastructures and siloed data systems are also often a cause of data unavailability. In today's data protection regulatory landscape, it can also be a matter of legal compliance. The data exists, but its processing is strictly regulated. For instance, the General Data Protection Regulation (GDPR) forbids uses that weren't explicitly consented to when the organization collected the data.
Security concerns can also prevent data from flowing within an organization. The information is too sensitive to be migrated to a cloud infrastructure, for example. Governance processes might also slow down or limit data access for similar reasons. Finally, it can come down to a matter of cost. A given data asset might be too expensive to buy or time-consuming to access and prepare.
These reasons are why companies turn to synthetic data. Either they produce datasets from partially synthetic data, where they replace only a selection of the dataset with synthetic data. It is common when they want to complement an existing resource. Or they use fully synthetic data, with datasets that don't contain any of the original data. Fully synthetic data is often found where privacy is impeding the use of the original, real data.
There are several types of synthetic data that serve different purposes. Synthetic data can be:
Synthetic text is artificially-generated text. You build and train a model to generate text. Because of languages' complexities, generating realistic synthetic text has always been challenging. However, the rise of new machine learning models led to the conception of remarkably performant natural language generation systems.
Back in 2020, the OpenAI team introduced GPT-3 (Generative Pre-Trained Transformer 3), a language model able to generate human-like text.
The GPT-3 algorithm falls into a category of deep learning called large language models, a kind of neural network trained on a colossal amount of text. GPT-3 used about 700 gigabytes of real world data pulled from the Web, including Wikipedia, and digitized books. While GT-3 is the most famous and available large language model, Google, Meta and DeepMind have all developed their own in the last few years.
You can find numerous examples of text written by the GPT-3 model, with constraints or specific text inputs, such as the one depicted below.
Earlier this year, OpenAI added functionality to allow users to directly instruct GPT-3 as opposed to simply asking it to elaborate on a passage of text.
Synthetic data can also be synthetic video, image, or sound. You artificially render media with properties close-enough to real world data. This similarity allows using the synthetic media as a drop-in replacement for the real data.
None of these individuals are real. These synthetic images were artificially generated by the Generative Adversarial Network, StyleGAN2 (Dec 2019) from the work of Karras et al. and Nvidia. The system learned properties of real-life people's pictures in order to generate realistic images of human faces. If you are curious to take a look at more artificially generated faces, check out this generator.
This method is helpful to augment the databases used to train machine learning algorithms. For example, when training video data is not available for privacy reasons, you can generate synthetic video data to resolve that. Similarly, you can use synthetic data to increase datasets' size and diversity when training image recognition systems.
Danny Lange, senior VP of AI and machine learning at Unity, the leading cross-platform game engine, says synthetic data has enabled Unity's customers to improve object recognition rates from 70% or 80% to almost 100% because the new training data adds much more diversity to the training data.
Tabular synthetic data refers to artificially generated data that mimics real world data stored in tables. This data is structured in rows and columns. It could be anything ranging from a patient database to users' analytical behavior information or financial logs.
Data is at the core of today's data science activities and business intelligence. As mentioned earlier, there are multiple scenarios in the enterprise in which real world data can not circulate within departments, subsidiaries or partners. Synthetic data can be used as a drop-in replacement for any type of behavior, predictive, or transactional analysis.
At Statice, our focus is on privacy-preserving tabular synthetic data. The synthetic data we generate comes with privacy guarantees. These measures ensure no individual present in the original data can be re-identified from the synthetic data.
Privacy-preserving synthetic data holds opportunities for industries relying on customer data to innovate. Modern data protection regulations often prevent any extensive use of such data.
Privacy-preserving synthetic represents here a safe and compliant data protection method. It also enables internal or external data sharing.
Synthetic data has application in the field of natural language processing. Amazon's Alexa AI team, for instance, uses synthetic data to complete the training data of its natural language understanding (NLU) system. It provides them with a solid ground to train new languages without existing, or enough, customer interaction data.
“When a new-language version of Alexa is under development, training data for its NLU systems is scarce. [...] The new bootstrapping tools, from Alexa AI’s Applied Modeling and Data Science group, treat the available sample utterances as templates and generate new data by combining and varying those templates.” Janet Slifka, director of research science in Alexa AI’s Natural Understanding group
According to Garter's Alys Woodward, synthetic data allows organizations to "move faster and fill in the gaps in their actual data", which is crucial for building machine learning models. Her example was Amazon, which is using synthetic data from Alexa to enable support for Brazilian Portuguese.
When it comes to synthetic media, a popular use for them is the training of vision algorithms. For over two years now, the Waymo team has been generating realistic driving datasets from synthetic data. Alphabet's subsidiary company uses these datasets to train its self-driving vehicle systems. It is an efficient way of including more complex and varied scenarios, as opposed to spending significant time and resources to obtain observations of similar scenarios.
Waymo is building The World's Most Experienced Driver™ and has launched fully autonomous rides in Phoenix's East Valley and San Francisco simultaneously. As they claim, they can generate fully synthetic data on a real-world scale, which dramatically speeds their data generation rates, subsequently improving our iteration times.
As its virtual cars drive through the same scenarios Waymo vehicles experience in the real world, engineers [...] manipulate those scenes by virtually adding new agents into the situation, such as cyclists, or by modulating the speed of oncoming traffic to gauge how the Waymo Driver would have reacted. Venturebeat
In the retail industry, Amazon also deployed similar techniques for the training of Just Walk Out, the system powering the Amazon Go cashier-less stores. The team generated a considerable amount and variety of synthetic customer behavior data to train its computer vision system.
“By using simulation to build a massive training set, the team was able to leverage the power of the cloud to train on months worth of data in a day, eliminating the time bottleneck and allowing rapid progress.” How the Amazon Go Store works.
The financial institution American Express has been investigating the use of tabular synthetic data. Their data science team is researching how to generate statistically accurate synthetic data from financial transactions to perform fraud detection. They were already able to use the synthetic data to help train the detection models.
To develop state of the art ML methods, including methods for anomaly detection and model interpretation, ML researchers and practitioners need to have access to data that is as close to the real one as possible. [...] we show that synthesized data follows the same distribution as the original data,and that ML models trained on synthesized data have the same performance as those trained on the original data. Efimov, Xu, Kong, Nefedov and Anandakrishnan (2020) in Using Generative Adversarial Networks to Synthesize Artificial Financial Datasets
J.P. Morgan is another large player in the financial industry that conducts research and develops algorithms to generate realistic synthetic datasets to accelerate financial AI research and development.
The Executive Director of AI Research, Rob Tillman, summarizes the challenges and benefits of synthetic data:
“In highly regulated industries which deal with sensitive data, such as finance, there are often significant barriers that impede or delay the ability of researchers and developers to use data to develop AI solutions that improve experiences or address important problems like fraud detection and anti-money laundering.”
In the field of insurance, where customer data is both an essential and sensitive resource, Swiss company La Mobilière used synthetic data to train churn prediction models. The data science team modelled tabular synthetic data after real-life customer data. They trained their machine learning models without compromising on the model performance or on their customer privacy.
“The Statice software protects the original data of our customers on the one hand, and on the other, enables us to work with the data across departments without compromising privacy or security issues.” Georg Russ, Data Scientist, Data & Analytics.
In Germany, the second largest public insurer, Provinzial, also explored the potential of this technology to aid predictive analytics, protect customer privacy and significantly speed up data projects. They saved 3 months in evaluating data privacy risks, reduced time-to-data by 4 weeks & achieved 97% in performance effectiveness when training their 'next best offer' model on synthetic data.
In general, all customer-facing industries can benefit from privacy-preserving synthetic data, as modern data procession laws regulate personal data processing.
For example, in the healthcare field, the use of patient's data is extremely regulated. Roche validated with us the use of synthetic data as a replacement for patient data in clinical research. The german Charité Lab for Artificial Intelligence in Medicine is also working on developing synthetic data to generate data for collaborative research and facilitate the progression of different medical use cases.
Anthem, a health insurance company, recently announced that it would be working with Google Cloud to develop a synthetic data platform. The platform will generate 1.5-2 petabytes of synthetic data using algorithms and statistical models that include artificial intelligence-generated history of medical treatment, health insurance claims, and other medical information. The overall goal is to validate and train AI algorithms with massive amounts of personal health data, while minimizing privacy concerns.
Vice President of Alexa Speech, Manoj Sindhwani, says that the company is continuously investing in anonymization and synthetic data generation to protect customer privacy.
For an overview of industries and their use of privacy-preserving synthetic data, check our answer in this post about “Which industries have the strongest need for synthetic data?”
Contact us and get feedback instantly.