As opposed to real data, which is derived from people's information, synthetic data generation is based on machine learning algorithms.
Synthetic data is a collective term, and not all synthetic data has the same characteristics. Synthetic datasets are not simply a re-design of a previously existing data but is a set of completely new data points.
In this article, we talk about how to generate synthetic data. We present the logic behind synthetic tabular data generation and the role of deep learning in the process. We present two models to generate tabular synthetic data and explain which approach we decided to follow at Statice.
As previously explained in Types of synthetic data and real-life examples, there are different synthetic data types: structured and unstructured. In this post, we’ll focus on our field of expertise, the generation of synthetic tabular data. Although, the techniques we mentioned have been studied and used for unstructured data generation as well.
The end goal with synthetic tabular data generation is to take real data source and create a synthetic data source with similar statistical properties out of it.
Having similar statistical properties means that we need to reproduce the distribution to the extent that we should ultimately be able to infer the same conclusion from both versions of the data - synthetic and real. We also need to maintain the structure of the real data.
To do so, we need to learn an approximated distribution or process compatible with the real data (i.e., a generative model) that can later be used to sample structurally and statistically comparable synthetic data.
The simple approach would be to count the occurrence of each value in each column independently. The results are discrete distributions that become our model. We could sample from it to create new synthetic data “employee” profiles.
However, this approach would miss possible connections between the columns. To include these patterns, one solution could be to count the occurring combinations. As a result, instead of several marginal distributions, you obtain a joint distribution that you can use to create the synthetic data table.
Theoretically, it is a valid approach, but it would not scale if we increase the dataset’s complexity. The more columns you add, the more combinations appear. At some point, you might just lack data points to learn the distribution properly. It is why we need a more robust model to tackle the complexity of the data.
You can use such a simple approach when your data is basic and doesn’t contain too complicated dependencies. With a simple table and very few columns, and none or few dependencies, a very simplistic model can be a fast and easy way to generate synthetic data.
As your data grows in complexity, you need to upgrade the type of model used. Neural Networks (NNs) are well-fitted to simplify transformation problems because they are good at finding patterns in data. Their transformation functions generate an easier distribution to learn from without sacrificing information.
NNs are constructs of interconnected neurons, forming layers that can display complicated behaviors. These networks are trainable functions that, when shown examples, can learn to generate pattern.
An NN is trained to reproduce data through prediction and correction. As it learns to generalize, it produces representations that can be used to generate synthetic data. For this reason, NNs are particularly suitable for creating synthetic data.
Remember that our revised problem is to find a transformation function that translates between the space where the original data lives and some latent space in which the data takes an easy to learn shape.
We can train a neural network to become such a transformation function. That is, train it to translate data between its original and the desired latent representation. For this aim, two very prominent neural network architectures are generative adversarial network (GAN) and variational autoencoder (VAE).
In the last few years, advancements in machine learning and data science have put in our hands a variety of deep generative models that can learn a wide range of data types. VAEs and GANs are two commonly-used architectures in the field of synthetic data generation. They follow different approaches in the training of models, and both have pros and cons depending on your data types.
This first type of NNs comes from the field of unsupervised training and the autoencoder family. As generative models, they are designed to learn the underlying distribution of original data and are very efficient at generating complex models.
VAEs function in two steps. At first, an encoder network transforms an original complex distribution into a latent distribution. A decoder network then transforms the distribution back to the original space.
This double transformation, encoded-decoded, appears cumbersome at first glance but is necessary to formulate a quantifiable reconstruction error. Minimizing this error is the objective of the VAE training and what turns it into the desired transformation function, while an additional regularization objective controls the shape of the latent distribution.
VAEs are a straightforward approach to solve the transformation problem. They are relatively easy to implement and to train. Their weak point, however, lies in their training objective. As your real data becomes more heterogeneous (e.g., mix of categorical, binary, continuous), it also becomes more difficult to formulate a reconstruction error that works well on all data components.
If, for example, the reconstruction error puts too much emphasis on getting the continuous parts of the data right, the quality of the categorical parts might suffer. When it becomes too difficult to formulate a good reconstruction error, it might be preferable to use a different approach to synthetic data generation, such as GANs.
GAN comes from the field of unsupervised training and the generative family. In terms of architecture, they simultaneously train two neural networks in an adversarial fashion.
There are two components in the GAN architecture - Generator and Discriminator, both of which are competing against each other. The generator produces data as close to the input dataset as possible; the discriminator separates it from the input dataset.
The generator digests random input from some latent distribution and transforms these data points into some other shape without ever directly looking at the real data. The discriminator digests input from the original data or training data and the generator's output, aiming to predict where the input comes from. Both networks are connected in training so that the generator has access to the discriminator’s decision making.
When both networks are trained together, the discriminator needs to learn from patterns in the training data whether they look realistic enough, while the generator learns to outsmart the discriminator by producing more realistic samples from its random input.
The advantage of using GAN for synthetic data generation is that you don’t need to provide a reconstruction error. The discriminator learns the characteristics of the real data.
This approach is particularly interesting for synthetic image generation, as it’s not clear how to translate into a function the characteristics of realisticness. In general, GAN is good with unstructured data.
However, GANs are also more challenging to train than VAEs and require more expertise. It is, for starters, not easy to see when to stop training. GAN is also prone to the mode collapse phenomenon, where the generator starts to produce only a small subset of the original data (think of a single image) instead of the full distribution.
At Statice, we tried a variety of different architectures, as well as methods outside of deep learning. We've realized that it is difficult to find a silver bullet solution that would work for many different cases.
Because we serve different use-cases, with multiple data types and needs every time, we found out the most effective approach was a hybrid one.
We take a synthesis problem and partition it. We split up the data into groups and tackle each group with the most effective model. We then iterate over all the parts while explicitly keeping the conditional dependence on the other parts, which eventually allows us to reassemble the groups on the synthetic data side.
This approach offers robust performances with structured data. As a result, our software maintain to a high similarity degree statistics properties such as:
In addition to producing synthetic data with a good utility, the partitioning into subproblems allows addressing them in parallel. This helps to scale up to large datasets and make better use of computational infrastructure.
To generate synthetic dataset, you learn the joint probability distribution from real data by means of a generative model from which you sample new data. While you can theoretically do it by counting the unique rows in a table, the task gets more problematic with wider datasets and cases where you need to capture more complex dependencies.
Deep learning models such as generative adversarial networks (GAN) and variational autoencoders (VAE) proved to be efficient at such tasks. Depending on the type of data, the skills of the user, or the desired outcome, you might want to pick one or the other.
In the end, when generating synthetic data, the most important lesson we’ve learned is that context is key. It’s why at Statice, we opted for a hybrid approach that handles each type of data with the model best suited for its characteristics.
Contact us and get feedback instantly.