In this blog post, we show how to generate synthetic data using the Statice SDK and evaluate the synthetic data utility and its performance for Machine Learning.
Using our synthetic data software, you can process sensitive data to generate artificial data with statistical similarities to the original datasets.
We use deep learning models to generate synthetic data. In summary, models learn the joint probability distribution of an original dataset to sample synthetic data following a similar joint distribution.
After deploying the SDK in your preferred environment, you train a model to generate synthetic data. The software comes with a programmatic interface (API) and a command-line interface (CLI).
For this walkthrough, we use the Adults Data Set, which is available on the UCI Website. With this dataset, the goal is to predict whether a person's income is higher or lower than $50k/year based on their attributes. The data set has 48,842 observations and 14 features.
The first thing is to create a Model, the high-level API object that orchestrates the synthesization. For that, we can use profiles.
Profiles are a ready-to-use set of parameters that let you quickly prioritize speed, utility, or privacy depending on your needs. There are 4 profiles:
To train a model using the Statice SDK API, we use Model.from_dataframe and specify a profile. For this example, we use the fast profile.
With our dataset, it takes us around 5 minutes (3 minutes for training and 2 minutes for sampling using CPU and a single worker) to generate the synthetic data. You can also use a GPU machine to speed up the training up to 5 times or train over multiple workers.
Once we have synthetic data, there are two aspects we want to check. First, we verify that the statistical properties of the real data are similar to those of the synthetic data. Secondly, we confirm that the scientific conclusions drawn from statistical/machine learning models trained on synthetic datasets can be applied to real datasets.
The Statice SDK ships with a series of ready-to-use utility evaluations. We first review to which extent the synthetic data maintained our original dataset's properties and statistical information.
We use the Utility Evaluator, which wraps multiple evaluators and provides a high-level view on the utility of our synthetic dataset without disclosing any of the statistical properties.
The utility Evaluator compares an original and a synthetic datasets on their marginal distributions, pairwise dependencies, and responses to SQL style counting queries. For each of these three categories, it provides a quality score 0%–100%, where 100% denotes the highest utility. These scores measure how similar original and synthetic data are with respect to the three categories.
The method score() then provides an overall utility score 0%–100% which is the arithmetic average of the three individual scores.
To have more overview about individual scores, we can call scores.
Let us see each of these Evaluator scores in detail.
Next, we want to confirm that the marginal distributions are well preserved in the synthetic data. Statice provides a distribution comparison tool for each standard column type: categorical, continuous, discrete, and date.
We generate the marginal distribution evaluation.
The following plots allow us to compare the distribution of categorical features. They show that the data distributions from the original dataset are close to those of the generated dataset. The original label below refers to the real Adult data, while the synthetic label refers to the data generated using Statice’s software.
For continuous numerical data, we can compare the histogram approximations of the original and synthetic data distributions. The columns in the data set with continuous data (age, fnlwgt), are highly skewed. As we can see below, this aspect is nicely preserved in the synthetic data; all distributions are very close to those of the original data.
We generate continuous marginal evaluations as follows:
We then look at whether the synthesization preserved the dependencies between columns. We use another evaluator, Pairwise Dependency Evaluator.
It does so by computing the correlation among numerical columns, the mutual information among categorical columns, and the correlation ratio between categorical and numerical columns. All three methods are indicators of the amount of information that can be inferred from one column about the other.
To generate the correlation heatmap we run:
The heat map above shows that there is not too much correlation between the columns in neither of the datasets.
To generate the mutual information heatmap we run:
The colors in the heatmap already reveal how well the synthetic dataset mirrors the pairwise dependencies. Dark orange squares represent high PMI (Pointwise mutual information), and light squares represent low PMI. Thus, it is rather clear that we are preserving the mutual information between the features.
Out of all the correlation coefficients we have to estimate, the correlation ratio one is probably the trickiest. We present it in a heatmap, the same way we presented the numerical correlation features and categorical features.
To generate the correlation ratio heatmap we run:
In the figure below, we can see that we kept the correlations between the different columns.
To get a more compact picture, we can plot the pairwise dependency values in the original and synthetic data against each other.
The closer the displayed points are to the diagonal, the closer the original and synthetic values are to each other. This is measured quantitatively by the mean absolute error, That is the main difference between the original and the synthetic pairwise dependency values. The closer this value is to zero, the better the pairwise dependencies are preserved. We can consider anything below 0.03 a great fit.
These evaluations are a collection of tools that measure the data quality by comparing the outcome of business intelligence style queries as applied to the original and the synthetic data. The closer the query results are to each other, the more utility is preserved by the synthetic dataset.
Our Counts Evaluator measures the conservation of high dimensional relationships by comparing the outcome of SQL-like counting queries.
For a visual comparison, we can plot the original and synthetic query counts against each other. The utility is well preserved if the dots are close to the dashed diagonal line. The correlation between the synthetic and original counts is computed and displayed in the bottom right corner as r. The closer this number is to 1, the better the synthetic data assimilates the original one. Anything above 98% is a great fit.
Because we can not see the content of our customers' data, the CubesEvaluator is a very useful evaluator to fit their use case, run queries to validate custom KPIs, and ensure the synthetic data utility. Customers best know the data and what needs to be checked. The CubesEvaluator is an extension of the CountsEvaluator. The queries are not generated randomly, but as the exhaustive combinations of the specified columns ("cubes") just like in the following SQL query.
The original and synthetic query counts can be plotted against each other. The utility is well preserved if the dots are close to the dashed diagonal line. As a measure of fit, the correlation between the synthetic and original counts is computed and displayed in the bottom right corner as r.
We can investigate more complex relationships in the data by looking at conditional distributions. The plot_conditional_distributions() function allows visualizing the distribution of a continuous column as conditioned on categorical columns by displaying a different distribution for each value combination of the condition columns.
The figure above will show a violin plot for each combination of “label” and “age”. The violin plot shows both original(Blue) and synthetic(orange) distributions. The more the violin is symmetric at the center, the better the results are.
As previously mentioned, the training of machine learning models is a great use-case candidate for synthetic data. Today, finding significant volumes of compliant data to train machine learning models is challenging in many industries. Synthetic data can be a useful asset to feed data-hungry systems.
However, this raises one important question: how well do ML models perform with synthetic data?
As we did for utility, we can assess the performance of models trained with synthetic data using a set of evaluations. A common technique for this is cross-classification. It helps to see how well a synthetic dataset captures the statistical dependence structures existing in the real data. It works in two phases:
To evaluate the suitability of our synthetic data for machine learning, we have done the following:
We trained 2 models:
We use Decision tree-based models (Random forest). We will evaluate both models using the area under the curve AU-ROC (Receiver operating characteristic curve). The ROC curve is a graph showing the performance of a classification model at all classification thresholds.
This curve represents two axes:
A ROC curve represents a classifier with a random performance level. The curve separates the space into two areas for good and poor performance levels. Both of our curves lie on the upper side, meaning that both models have good performance. By analogy, the Higher the AUC, the better the model is. The figure below shows that the model trained on synthetic data and tested on real data performs similarly to the original model.
In our case, the training of machine learning models can be performed on synthetic with minimal loss in prediction accuracy.
In this post, we saw how we could generate synthetic data with the Statice SDK and evaluate the quality of that synthetic data from the utility and the ML performance point of view. The ability to generate data with high-fidelity is crucial if we want our models to draw patterns, extract insights, and generate high-value predictions that can guide better decisions and smart actions.
We didn’t cover the privacy aspect in this post, which is important to keep in mind. Synthetic data is not inherently private by default. Users need to implement additional protection mechanisms to protect the data subjects. On that subject, we recommend reading our post on protecting the privacy of synthetic data and this great interview with privacy machine learning expert Franziska Boenisch.
Contact us and get feedback instantly.