Synthetic data tools: Open source or commercial? A guide to building vs. buying

synthetic data open source
By
Evgeniya Panova

We created this post to answer a recurring question our team encounters during conversations with customers: what are the benefits, pitfalls, and implications of building synthetic data tools versus using off-the-shelf services? This article draws on our team's five years of experience supporting enterprises with data protection to provide the best answer possible. 

This article focuses only on tools for synthesizing structured data. Unstructured data may require different considerations. It should be noted that these are only suggestions based on our team's experience. Various unique factors can influence these estimates, including your use case, your industry, the size of your team, or geography. 

⬝⬝⬝

Synthetic data is expected to completely replace real data in AI models by 2030, according to the famous estimate by Gartner. The market for synthetic data continues to grow. According to Cognilytica, its size will reach $1.15B by 2027, up from $110M in 2021. 

The use of synthetic data is growing across many industries, and you may wonder how to get started.

In this article, we compare key aspects of open source and commercial synthetic data solutions and analyze them based on a few important elements of a healthy data project. Here’s what you’ll learn: 

  1. Key differences between open source and commercial synthetic data solutions (if you are curious to have a detailed overview of commercial synthetic data companies, read this blog post).
  2. An overview of open source synthetic data tools.
  3. Building vs. buying: What you need to know beforehand.
  4. Synthetic data project cost estimates (available in the extended version - our build vs. buy guide)

Download the complete guide to access the full project costs estimates.

An overview of open source and commercial synthetic data solutions

Today, structured synthetic data generation software includes: 

  • Commercial vendors' software: platforms and frameworks that plug into your data pipeline and provide synthetic dataset generation and evaluation functionality out-of-the-box. 
  • Open source tools: offer code for the synthetic data generation that you can modify and enhance to build your own solution. 

There are several technological approaches to generate synthetic data: VAEs (Variational Autoencoders), GANs (Generative Adversarial Networks) or other deep learning model combinations.

In terms of  the functionality and services, most commercial vendors usually offer some form of privacy guarantees, meaning that the mechanisms in the synthetic data are meant to prevent the re-identification of an individual from the original data. Commercial vendors offer SaaS, professional services, support, and licensing based on monthly or annual fees. Some vendors offer free trials or plans.

It is mostly free or low-cost to use open source solutions, so they are an attractive option for projects with a smaller budget, and you can get started with many of the tools using their communities and tutorials.

Find the list of commercial vendors here.

A list of open source synthetic data tools


Let’s look at some of the open source solutions that are available on the market.

  1. Copulas: Python library for modeling multivariate distributions and sampling from them using copula functions.
  2. CTGAN: SDV’s collection of deep learning-based synthetic data generators for single table data.
  3. DataGene: Tool to train, test, and validate datasets, detect and compare dataset similarity between real and synthetic datasets.
  4. DoppelGANger: Synthetic data generation framework based on generative adversarial networks (GANs).
  5. DP_WGAN-UCLANESL: This solution trains a Wasserstein generative adversarial network (w-GAN) that is trained on the real private dataset. 
  6. DPSyn: Algorithm for synthesizing microdata while satisfying differential privacy.
  7. Faker: Python package that generates fake data (Note: this tool does not generate synthetic data but offers dummy data).
  8. Generative adversarial nets for synthetic time series data. Repository that shows how to create synthetic time-series data using generative adversarial networks (GANs).
  9. Gretel.ai: Commercial synthetic data vendor that offers open source functionality.
  10. Mimesis: Fake data generator for Python.
  11. mirrorGen: Python tool that generates synthetic data based on user-specified causal relations among features in the data. 
  12. Open SDP (COMMUNITY): Community for sharing educational analytic tools and resources.
  13. Plait.py: Program for generating fake data from composable yaml templates.
  14. Pydbgen: Python package that generates a random database table based on the user's choice of data types. 
  15. Smart noise synthesizer: Differentially private open source synthesizer for tabular data.
  16. Synner: Tool to generate real-looking synthetic data by visually specifying the properties of the dataset.
  17. Synth: Data-as-code tool that provides a simple CLI workflow for generating consistent data in a scalable way. 
  18. Synthea: Synthetic patient generator that models the medical history of synthetic patients.
  19. Synthetic data vault (SDV): One of the first open source synthetic data solutions, SDV provides tools for generating synthetic data for tabular, relational, and time series data.  
  20. TGAN: Generative adversarial training for generating synthetic tabular data.
  21. Tofu: Python library for generating synthetic UK Biobank data. 
  22. Twinify: Software package for privacy-preserving generation of a synthetic twin to a given sensitive dataset.
  23. YData: Synthetic structured data generator by YData, a commercial vendor.

You can also find the description of all open source solutions in our guide or in this Github Awesome List

Synthetic data project assessment: Criteria for building vs. buying

What aspects should you consider when deciding whether to build or to buy?

It is likely that your new synthetic data generator will become a part of the data lifecycle of a project or a use case you are developing. The project constraints will determine what type of tools to use. To pick the right tool, we must zoom in to examine specific steps that synthetic data projects follow. The constraints on these steps will help you determine which tool is most appropriate. 

 Data access 

Whether you acquire data externally, gather it internally, or plan to synthesize completely new datasets, access has a big impact on everything that follows. Quickly accessing and sharing data with stakeholders is often the difference between a successful project and a failure. Consider how you will handle potential data access issues now and in the future.

  • Choose a commercial vendor: Commercial vendors offer plug-and-play functionality for managing data access levels and roles. Big companies with typical data projects involving cross-departmental work and/or third parties, may find it expensive and time-consuming to build open source-based data access capabilities. 
  • Choose an open source solution: If you are developing or testing one specific use case that you do not intend to scale, open source offers you the ability to develop a 100% custom data access functionality that will precisely address your needs without the need to purchase yearly or monthly commercial vendor licenses.

Synthetic data in a typical data pipeline.

Data preparation

 

Data preparation is among the most time-consuming and important phases of many data projects. To train machine learning models with synthetic test data, for example, data scientists have to prepare the original data for synthesization. 

Commercial vendors, who work directly with various types of customers and use cases, tend to offer out-of-the-box functionality for a wider range of use cases and data types. Having the necessary support and expertise in a wide range of issues is a big advantage of commercial solutions. 

While commercial vendors offer automated pre-processing features, some open source tools might require you to prepare your original data for synthesization manually. The list of pre-processing tasks can grow quickly the more complex your dataset is. 

  • Choose a commercial vendor: For complex datasets that require a lot of pre-processing and custom rules. Depending on the vendor, you'll get features to automate pre-processing. 
  • Choose an open source tool: If you are working with small and simple datasets, apply straightforward rules and do not plan to scale your project.

Synthetic data quality and utility assessment

Data utility refers to the analytical completeness and validity of the data. Synthetic data utility requirements are closely tied to your use case. If you plan on using your synthetic data for machine learning model training or analytics, it requires evaluating its quality and utility first. Most tools provide some utility metrics you can quickly test. Usually, commercial tools will have a larger choice of out-of-the-box evaluations.

  • Choose a commercial vendor: Your data complexity will determine how long it takes to validate the results. Your approach might need to be validated on several datasets, even if you only have one dataset or one use case. If that’s the case, using commercial vendor tools may be more convenient. Their utility evaluations can be run on datasets of varying complexity, and the output can be adjusted and controlled. Additionally, some commercial vendor solutions offer ways to assess the performance of machine learning models.
  • Choose an open source solution: Open source tools are a good solution for simple datasets or if you have the time and expertise to cover a wide range of situations encountered in different datasets. Open source can be a good option in situations where you require specific utility metrics and don't need privacy and utility guarantees. 

Synthetic data privacy assessment

Another crucial aspect of data access is privacy. To share synthetic data built out of data that contains personal information, you need to ensure it can withstand re-identification attacks. When companies use synthetic data as an anonymization method, the biggest question around synthetic data is how to assess the privacy risks.

Privacy is an empirical field, and without experts, it is hard to assess the risks, run privacy attacks to comply with privacy laws and get the approval of the DPO (Data Protection Officer). If you are building your own synthetic data solution based on open source tools and need a strong privacy guarantee, we recommend involving data privacy experts to develop and verify the privacy evaluations you need. 

Keep in mind that building a privacy evaluation is also time consuming. Depending on the complexity of the use case, it can take between 3 to 6 months to research, develop, test and approve synthetic data privacy with or for a DPO.

  • Choose a commercial vendor: Commercial solutions are recommended if you don't have the time, resources or specialized knowledge to develop complex privacy evaluations in-house. The commercial vendor is a better option when your synthetic datasets are generated based on sensitive original data (for example, customer data containing personally identifiable information) and need to be shared across multiple departments or external stakeholders. This is because privacy evaluators are already built-in and tested. 
  • Choose an open source tool: Some open source solutions offer privacy evaluators. However, those metrics might not be robust enough to provide comprehensive and legally meaningful evaluations of the privacy risks which compliance professionals can understand. Open source would be a good fit if you don’t need to involve DPOs or demonstrate compliance in your project.

Ease of use

Finally, think of who needs access to a synthetic data platform in your team or company. Sometimes, it is not just data scientists but DPOs, managers or even CEOs. 

For technical users, open source tools are relatively easy to use. Some open source solutions have Discord or Slack channels where users can ask questions and solve issues collectively. Most open source tools only are a developer toolkit or library, which isn't suited for non-technical users.

Commercial synthetic data companies typically offer ready-to-use platforms with GUIs (graphical user interfaces) and expert support. The benefit is that you don't have to be a technical user to take advantage of this, and you can support custom data types and extend functionality if needed. 

  • Choose a commercial vendor: When you have a specialized case like healthcare data. When you are not ready to configure everything yourself or/and other team members need to have access and be able to understand and use the tool. 
  • Choose an open source tool: When you want to have 100% control over the functionality and independence from third party software. Go for open source solutions that have communities around them to get support when needed. 

Taking a hybrid approach

Sometimes you don't have to choose between open source and commercial vendors because you can take advantage of both with a hybrid approach.

For instance, you're happy with what you've built in-house, but your project scales. It may be necessary to perform extensive privacy evaluations in order to share your synthetic data. In this case, you can perform privacy assessments using scientifically proven commercial vendor functionality.

You can also use vendor expertise and services when you need expert help building your own tool.

Use this decision tree to evaluate your build vs. buy decision

Synthetic data costs

The cost of setting up and running your synthetic data project varies greatly. Creating a unit devoted to synthetic data can be very expensive. On the other hand, if your project is small, free open source software like SDV may suffice.

Let's consider the following example to estimate the potential budget needs of a synthetic data project. Suppose you need to run a small synthetic data project without the need of extensive (or any) privacy evaluations and DPO approvals. You plan to use it within your data science team and no external stakeholders will be involved.

Let’s take a look at the approximate costs of such a project.

Download our guide to access the full project costs estimates.

Summary

Commercial vendors and open source tools both have amazing features. The specifics and goals of your data project will determine which category suits you best. When making your decision, consider your project's use case, its complexity, your stakeholders' needs, maintenance budget, and, of course, the security of your sensitive data.

Click to enlarge

Download the complete guide

Get the latest content straight in your inbox!

Get the latest content straight in your inbox!

Articles you might like

The impact of data bias on your business & the benefits of fair AI

Read more

Synthetic data: How to manage sensitive data in a GDPR-compliant manner

Read more

8 types of data bias that can wreck your machine learning models

Read more