How to use synthetic data in Machine Learning and AI

From improving time to data to removing privacy constraints, learn how to use synthetic data in machine learning to enhance AI projects.

Machine Learning (ML) and Artificial Intelligence (AI) help develop many industries worldwide. New technologies support companies in making informed decisions, automating processes, and cutting costs. But you can't run a successful AI project without a diverse, high-quality, and unbiased dataset. And here comes the challenge. Many companies:

Don't have enough real world data
Have data but the quality is not always good
Can't use data due to privacy regulations

These obstacles doom many projects to failure before they even start. But it doesn't have to be like that. Synthetic data can democratize ML algorithms and boost AI projects. It can help by improving the time to data, the data quality, and protecting data privacy. In this article, you'll discover:

3 problems with real world data in ML and AI
What is synthetic data in ML
How synthetic data can help ML and AI
How insurance, finance, and healthcare industries can use synthetic data in machine learning projects

Let’s get straight into it.

3 problems with real datasets in ML and AI

ML applications support industries such as finance, insurance, healthcare, and many more. But there are challenges data science teams face when working on AI projects. Let's look at the 3 most common obstacles.

Data locked up due to lengthy data access procedures

Machine Learning models need a lot of training data to provide viable outcomes. Even with a simple task, they need thousands of data points to produce results. To conduct more advanced operations (e.g., on text, image, or video recognition), the dataset has to contain millions of data points.

This can be problematic. It's because before using real data for ML purposes, your company has to go through lengthy data access procedures that can take up to 6 months. As a result, AI/ML projects can get either postponed or are doomed to fail.

Also, you might resign from going through this time-consuming data access process. It's because, at this point, you can't be sure if this dataset is suitable for your project. ‍

Data bias problem

Bias in Machine Learning is an error that results from wrong assumptions in the learning algorithm. For example, your company has data about thousands of your customers. You want to make use of this data and track the relationship between demographic and buying behavior data.

If the original, real data is skewed and doesn't hold a real relationship between demographic information and buying behavior, there is nothing to learn. To build a ML application, you'd have to get a strong pattern and separate the data into different clusters that have specific characteristics.

*According to* *Anthony Bradley of Gartner*, it is not sufficient to look for bias only in the algorithm. He describes four stages relevant to AI bias: real world bias, data bias, algorithm bias, and business bias. When using synthetic data in ML and AI projects, these factors should also be taken into account.

‍

The bias problem doesn't only result in AI inefficiencies, but it can also reinforce discrimination. According to The Washington Post, Google displayed far fewer ads for high-paying executive jobs if you're a woman. That's why companies should feed Machine Learning algorithms with trustworthy and representative data. ‍Read more about bias in ML & AI here.‍

Data locked due to privacy regulations

Real data could serve ML algorithms to solve many business problems. But Personally Identifiable Information (PII) or Personal Health Information (PHI) is also subject to privacy regulations such as General Data Protection Regulation (GDPR), California Consumer Privacy Act (CCPA), or Health Insurance Portability and Accountability Act (HIPAA).

*An overview of EU data initiatives by IAPP. Take a look at the infographic in more detail* *here*.

‍

Those regulations restrict how you can collect and use real world data. They prevent data misuse, illegal transfers, or leaks. Providing that, it's hard or even impossible to use real data. First, it's because asking for secondary consent for an ML project causes a lot of friction in the customer experience. Second, in the first phase of AI projects, it's complex to estimate the necessary data scope. For example, the depth of data penetration and every edge case coverage.

Data teams that use synthetic data can solve those obstacles and unlock the potential of Machine Learning projects. Let's find out what synthetic data is and how it can help ML and AI endeavors.

What is synthetic data in Machine Learning

Synthetic data is artificially created data that serves various purposes, including Machine Learning. Deep generative models (DGM), nerual networks that can replicate the data distribution that you give it, learn the statistical properties of real data to produce synthetic media that mimic the original subject. GANs (Generative Adversarial Networks), VAEs (variational autoencoders) or the combination of both are used to generate synthetic data.

Synthetic dataset resembles the quality of the real world sensitive data and retains the statistical distribution. Synthetic data can be privacy-preserving which means the risk of being linked back to real data is limited. Synthetic data generation can either replace or complement a real dataset. It can be alternatively used to train ML applications and improve AI projects.

The use of synthetic data generation depends on the use case and industry. Let's see the different benefits of making this method a part of your ML projects.

How synthetic data can help ML & AI

Synthetic data opens up various possibilities for AI projects that use ML algorithms. Let's dive into details. ‍

Improve time to data with synthetic data

Although companies process hundreds of thousands of data points, they still face data access problems. Healthcare companies might deal with lengthy access procedures for rare disease data collection. A financial institution might struggle with accessing data about fraudulent transactions.

Generating synthetic data can solve the data access problem by significantly reducing the time to access data. Unlike sensitive datasets, properly anonymized synthetic data doesn't have to go through the long access request process.

With a dataset, artificially created from an original one, your data science team has rapid access to it. They'll be able to understand the statistical patterns of this data and verify its relevance for using it in ML models.

Also, you can aggregate synthetic data together, increasing your sample size. For example, you can create a synthetic data lake for exploration. Your data science team will have more flexibility in filtering out data for a specific use case. All this can be beneficial for your AI project.

Improve the quality of the data with synthetic data

Data science teams often spend time cleaning data before using it to fuel ML algorithms. This time-consuming process is crucial in determining the success of the AI project. Poor quality or biased data will negatively impact the results of Machine Learning.

The generation of synthetic data can help streamline the data cleaning process. For example, differentially-private synthetic data suppresses outliers, which helps reduce bias and improve the training data quality.

As a result, properly generated synthetic data can improve the quality of the original data and increase the success of your AI project. Synthetic data is also ready-to-use so you don't have to clean or format it.

Remove privacy constraints with synthetic data

It takes months to go through compliance verification processes to open up real world data or get secondary consent to use it for Machine Learning purposes. In many cases, you either don't get consent, or the de-identified data quality is too low to make a successful ML application.

Creating synthetic data with the right privacy guarantees can streamline the compliance process. The legal constraints around data processing are much lighter because privacy-preserving synthetic data doesn't contain real world data or sensitive personal data. For example, you don't have to ask for secondary consent to use anonymized synthetic data for a new ML project.

Using synthetic data also means you safeguard the privacy of your customers, exposing them to less risk. As a result, you can experiment on a synthetic dataset, test different machine learning models, see what works and what doesn't, and process the data without risks related to privacy regulation breaches.

Lastly, using synthetic data opens up new cooperation possibilities and sets a new basis for the ML project's success. You can collaborate with a third party, e.g., use synthetic data in a Proof Of Concept (POC) and test it out before implementing it on a wide scale.

Now, let's dive into 3 synthetic data use cases from industries that have to stay privacy- and security-compliant.

How insurance, finance, and healthcare industries can use synthetic data

Insurance: using synthetic data to develop ML project

Insurance companies own huge chunks of customer data. They can use those datasets for their AI projects only after a long compliance and governance process. The problem is that it slows down the data science teams, increasing the risk of outdated data, resulting in failed projects and money losses.

To derive learnings, perform advanced analytics, or develop machine learning, it is not necessary to access specific personal information. We simply need to develop insights based on aggregated data.

That's why, to speed up compliance and governance processes, insurance entities can create synthetic data. For example, they can fuel ML and AI models with artificially generated data and train them to predict the probability of customer churn.

With historical synthetic data, a data science team can build robust prediction models. These models will estimate the likelihood of customer churn. ‍
‍

Healthcare: using synthetic data in clinical trials

Healthcare entities operate on extremely sensitive real world clinical trial data that are manually processed within the company. This is time-consuming and, in the case of deidentified patient data, doesn't provide a maximum data utility. If the healthcare entity wanted to cooperate with an external data science expert, sharing the data wouldn't be possible.

By using synthetic datasets, healthcare entities can more freely use data to train Machine Learning.

In Sweden, for example, researchers at the Sahlgrenska University Hospital are developing synthetic datasets of skin lesions to improve the early detection of skin cancer. In spite of the relatively large size of the original dataset, it was highly unbalanced. It is for this reason that researchers decided to use GANs to augment the amount of data and balance the datasets in order to improve the accuracy and robustness of classification networks used in diagnosis.

Their goal is to facilitate the sharing of data between institutes and complement and balance existing datasets to improve the performance of other AI tools.

As a result, they move past expensive and time-consuming procedures for:

Consent collection
Patients' sensitive data de-identification in new clinical trials

What's more, they remove the risks associated with non-compliance. ‍
‍

Finance: using synthetic data for machine learning projects development

Financial companies also have a lot of data that could be useful for Machine Learning projects but are subject to data privacy regulations. Moreover, the banking industry is rapidly developing new ways of incorporating artificial intelligence in services, with cost savings estimated at $447 billion by 2023 from AI applications.

According to Business Insider, in order to gain key business benefits, and to respond to consumer demands, financial institutions are implementing AI algorithms across every branch of their business.

With synthetic data, financial institutions can operate on safe and compliant financial datasets. It can serve for building and validating ML & AI models. Also, they can move synthetic datasets to the cloud, which is a more cost-effective option than on-premises hosting. Synthetic data offers the companies in the financial sector to also engage in external corporations and tailor datasets to a particular use case they want to analyze.

Synthetic data in your enterprise project

It's not too bold to say that synthetic data can revolutionize Machine Learning algorithms and speed up AI projects.

Companies that address their current challenges with synthetic data will gain a competitive edge. It will help them:

Operate more on autopilot
Contribute to new findings
Come up with more accurate, case-tailored predictions for the future.

If you want to explore the topic more, our team will be happy to talk about privacy-preserving data.

How to use synthetic data in Machine Learning and AI

3 problems with real datasets in ML and AI