Generate high-fidelity synthetic data with the Statice SDK

In this blog post, we show how to generate synthetic data using the Statice SDK and evaluate the synthetic data utility and its performance for Machine Learning.

‍

Key take-aways:

With the Statice SDK, we generate synthetic data using deep learning models. We use preset profiles to prioritize speed, utility, or privacy in training.
We use a series of out-of-the-box evaluations to assess synthetic data utility by comparing the original and synthetic datasets on their marginal distributions, pairwise dependencies, and responses to SQL style counting queries.
For ML performances, cross-classification lets us verify that a model trained on our synthetic data performs similarly to one trained on the original dataset.

⬝⬝⬝

Generate synthetic data with the Statice SDK using profiles

‍

Using our synthetic data software, you can process sensitive data to generate artificial data with statistical similarities to the original datasets.

We use deep learning models to generate synthetic data. In summary, models learn the joint probability distribution of an original dataset to sample synthetic data following a similar joint distribution.

‍

After deploying the SDK in your preferred environment, you train a model to generate synthetic data. The software comes with a programmatic interface (API) and a command-line interface (CLI).

‍

For this walkthrough, we use the Adults Data Set, which is available on the UCI Website. With this dataset, the goal is to predict whether a person's income is higher or lower than $50k/year based on their attributes. The data set has 48,842 observations and 14 features.

‍

The first thing is to create a Model, the high-level API object that orchestrates the synthesization. For that, we can use profiles.

‍

Profiles are a ready-to-use set of parameters that let you quickly prioritize speed, utility, or privacy depending on your needs. There are 4 profiles:

`testing`: to produce very quick results, putting less stress on the utility of the resulting synthetic data. That profile is well suited for a first synthesization of the data for quick checks and making sure that everything is working without hiccups.
`fast`: to obtain useful results in a narrower time.
`accurate`: to produce high utility synthetic data. Keep in mind that this is slower than the fast profile.
`dp`: ready-to-use profile for Differential Privacy (DP) to train differentially private synthetic data. Well suited when the original data is sensitive, and you need high privacy protection.

‍

To train a model using the Statice SDK API, we use Model.from_dataframe and specify a profile. For this example, we use the fast profile.

With our dataset, it takes us around 5 minutes (3 minutes for training and 2 minutes for sampling using CPU and a single worker) to generate the synthetic data. You can also use a GPU machine to speed up the training up to 5 times or train over multiple workers.

‍

Once we have synthetic data, there are two aspects we want to check. First, we verify that the statistical properties of the real data are similar to those of the synthetic data. Secondly, we confirm that the scientific conclusions drawn from statistical/machine learning models trained on synthetic datasets can be applied to real datasets.

How to evaluate your synthetic data utility

‍

The Statice SDK ships with a series of ready-to-use utility evaluations. We first review to which extent the synthetic data maintained our original dataset's properties and statistical information.

We use the Utility Evaluator, which wraps multiple evaluators and provides a high-level view on the utility of our synthetic dataset without disclosing any of the statistical properties.

Utility Evaluator

‍

The utility Evaluator compares an original and a synthetic datasets on their marginal distributions, pairwise dependencies, and responses to SQL style counting queries. For each of these three categories, it provides a quality score 0%–100%, where 100% denotes the highest utility. These scores measure how similar original and synthetic data are with respect to the three categories.

The method score() then provides an overall utility score 0%–100% which is the arithmetic average of the three individual scores.

To have more overview about individual scores, we can call scores.

Let us see each of these Evaluator scores in detail.

Marginal Distributions

‍

Next, we want to confirm that the marginal distributions are well preserved in the synthetic data. Statice provides a distribution comparison tool for each standard column type: categorical, continuous, discrete, and date.

We generate the marginal distribution evaluation.

The following plots allow us to compare the distribution of categorical features. They show that the data distributions from the original dataset are close to those of the generated dataset. The original label below refers to the real Adult data, while the synthetic label refers to the data generated using Statice’s software.

‍

Fig. 1. Categorical columns: Frequency distributions of some of the columns from the adult dataset .

‍

For continuous numerical data, we can compare the histogram approximations of the original and synthetic data distributions. The columns in the data set with continuous data (age, fnlwgt), are highly skewed. As we can see below, this aspect is nicely preserved in the synthetic data; all distributions are very close to those of the original data.

We generate continuous marginal evaluations as follows:

Fig. 2. Frequency distributions of some of the columns from the adult data set .

‍

Pairwise Dependency Evaluator

‍

We then look at whether the synthesization preserved the dependencies between columns. We use another evaluator, Pairwise Dependency Evaluator.

‍

It does so by computing the correlation among numerical columns, the mutual information among categorical columns, and the correlation ratio between categorical and numerical columns. All three methods are indicators of the amount of information that can be inferred from one column about the other.

‍
To generate the correlation heatmap we run:

Fig. 3. correlation matrices for some columns(numerical) in the Adult dataset (left; original) and the synthetic dataset (right; synthetic).

The heat map above shows that there is not too much correlation between the columns in neither of the datasets.

To generate the mutual information heatmap we run:

Fig. 4. Pairwise mutual information matrices for Adult data set columns(categorical) (left; original) and the synthetic dataset (right; synthetic).
‍

The colors in the heatmap already reveal how well the synthetic dataset mirrors the pairwise dependencies. Dark orange squares represent high PMI (Pointwise mutual information), and light squares represent low PMI. Thus, it is rather clear that we are preserving the mutual information between the features.

Out of all the correlation coefficients we have to estimate, the correlation ratio one is probably the trickiest. We present it in a heatmap, the same way we presented the numerical correlation features and categorical features.

To generate the correlation ratio heatmap we run:

In the figure below, we can see that we kept the correlations between the different columns.

Fig. 5. Correlation ratio matrices for some columns(categorical and numerical) of the Adult data set(left; original) and the synthetic dataset (right; synthetic).

To get a more compact picture, we can plot the pairwise dependency values in the original and synthetic data against each other.

The closer the displayed points are to the diagonal, the closer the original and synthetic values are to each other. This is measured quantitatively by the mean absolute error, That is the main difference between the original and the synthetic pairwise dependency values. The closer this value is to zero, the better the pairwise dependencies are preserved. We can consider anything below 0.03 a great fit.

Fig. 6.scatter plots of pairwise dependency evaluations (correlation, mutual information and correlation ratio).

Aggregated Statistics

‍

These evaluations are a collection of tools that measure the data quality by comparing the outcome of business intelligence style queries as applied to the original and the synthetic data. The closer the query results are to each other, the more utility is preserved by the synthetic dataset.

Our Counts Evaluator measures the conservation of high dimensional relationships by comparing the outcome of SQL-like counting queries.

‍

For a visual comparison, we can plot the original and synthetic query counts against each other. The utility is well preserved if the dots are close to the dashed diagonal line. The correlation between the synthetic and original counts is computed and displayed in the bottom right corner as r. The closer this number is to 1, the better the synthetic data assimilates the original one. Anything above 98% is a great fit.

Because we can not see the content of our customers' data, the CubesEvaluator is a very useful evaluator to fit their use case, run queries to validate custom KPIs, and ensure the synthetic data utility. Customers best know the data and what needs to be checked. The CubesEvaluator is an extension of the CountsEvaluator. The queries are not generated randomly, but as the exhaustive combinations of the specified columns ("cubes") just like in the following SQL query.

The original and synthetic query counts can be plotted against each other. The utility is well preserved if the dots are close to the dashed diagonal line. As a measure of fit, the correlation between the synthetic and original counts is computed and displayed in the bottom right corner as r.

Conditional distributions

We can investigate more complex relationships in the data by looking at conditional distributions. The plot_conditional_distributions() function allows visualizing the distribution of a continuous column as conditioned on categorical columns by displaying a different distribution for each value combination of the condition columns.

The figure above will show a violin plot for each combination of “label” and “age”. The violin plot shows both original(Blue) and synthetic(orange) distributions. The more the violin is symmetric at the center, the better the results are.

Synthetic data and Machine Learning performance

As previously mentioned, the training of machine learning models is a great use-case candidate for synthetic data. Today, finding significant volumes of compliant data to train machine learning models is challenging in many industries. Synthetic data can be a useful asset to feed data-hungry systems. ‍

‍

However, this raises one important question: how well do ML models perform with synthetic data?

As we did for utility, we can assess the performance of models trained with synthetic data using a set of evaluations. A common technique for this is cross-classification. It helps to see how well a synthetic dataset captures the statistical dependence structures existing in the real data. It works in two phases:

First: you train your model on the real data and test on hold-out data from real and synthetic datasets.
Second: you train your model on the synthetic data and test on hold-out data from real and synthetic data.

To evaluate the suitability of our synthetic data for machine learning, we have done the following:

We split the original data into two datasets: an original training set and an original test set.
We generated synthetic data from the original training set without looking at the original test set.

We trained 2 models:

Synthetic model: model trained on synthetic data and tested on real data (orange line in fig. 11)
original model: model trained on original data and tested on another subset of the original data which is not included in training(blue line in fig. 11)

We use Decision tree-based models (Random forest). We will evaluate both models using the area under the curve AU-ROC (Receiver operating characteristic curve). The ROC curve is a graph showing the performance of a classification model at all classification thresholds.

This curve represents two axes:

True positive rate: refers to the proportion of those who received a positive result on this test out of those who actually have the condition(ex: income higher than 50K)
False-positive rate: refers to the proportion of those who received a negative result on this test out of those who do not actually have the condition.

A ROC curve represents a classifier with a random performance level. The curve separates the space into two areas for good and poor performance levels. Both of our curves lie on the upper side, meaning that both models have good performance. By analogy, the Higher the AUC, the better the model is. The figure below shows that the model trained on synthetic data and tested on real data performs similarly to the original model.

‍

Fig. 11 ROC curve for the original model and synthetic-original model

In our case, the training of machine learning models can be performed on synthetic with minimal loss in prediction accuracy.

⬝⬝⬝

In this post, we saw how we could generate synthetic data with the Statice SDK and evaluate the quality of that synthetic data from the utility and the ML performance point of view. The ability to generate data with high-fidelity is crucial if we want our models to draw patterns, extract insights, and generate high-value predictions that can guide better decisions and smart actions.

‍

We didn’t cover the privacy aspect in this post, which is important to keep in mind. Synthetic data is not inherently private by default. Users need to implement additional protection mechanisms to protect the data subjects. On that subject, we recommend reading our post on protecting the privacy of synthetic data and this great interview with privacy machine learning expert Franziska Boenisch.

Generate high-fidelity synthetic data with the Statice SDK

Generate synthetic data with the Statice SDK using profiles