How do you generate statistically accurate synthetic data?

Our team recently held a webinar on synthetic data in the financial industry, together with Sven Guhr from Sopra Steria. We discussed the challenges and costs associated with data inertia for financial institutions. Additionally, we presented how synthetic data could help data teams tackle internal data restrictions, along with real-life applications for synthetic financial data. You can get access to the on-demand version in the webinar section of our website. We answered below some of the interesting questions that we received from the audience.

How long would it take to create synthetic data, for a dataset of one million customer data records, for example?
From the perspective of complying with the GDPR, which tests do a data officer must do before releasing a synthetic dataset?
When generating synthetic data, how can we keep the statistical characteristics present in the original data?
What are the pitfalls to consider while using synthetic data?

How long would it take to create synthetic data, for a dataset of one million customer data records, for example?

The answer is that it depends. There are multiple factors; but it mostly comes down to the following two points:

The dimensionality of the data, i.e., the number of attributes each customer has and how many possible values these attributes can take, and the complexity of the relationships between attributes.
What computational resources are available: the Statice software is suitable for parallelization and can leverage multiple CPUs and GPUs.

But to give you an idea, a customer from the finance industry recently installed Statice and generated synthetic data in less than half a day. Kicking-off a full project is usually between one to four weeks. Generating synthetic data in itself can be swift.

From the perspective of complying with the GDPR, which tests do a data officer must do before releasing a synthetic dataset?

The GDPR states that data guaranteed to be anonymous isn't subject to personal data processing requirements. It means that processing synthetic data does not require the same legal basis the original data would if it were to contain personal information. Synthetic data generated by Statice has anonymity guarantees can thus be used more freely for various purposes.

Enterprise-ready synthetic data solutions should provide data officers and CISO with ways of assessing the generated data’s privacy. It’s important to understand both the logic and the privacy guarantees that a technological approach offers. So it’s common to look at how this artificial data is being generated and review the various mechanisms that guarantee its anonymity.

At Statice, we do this first by having our technology satisfy the definition of differential privacy. This mathematical proof ensures the anonymity of the data. To this, we add the ability to run privacy attacks on the data generated to prove its safety.

‍

When generating synthetic data, how can we keep the statistical characteristics present in the original data?

Every dataset can be thought of as a ‘realization’ of an underlying probability distribution. This mathematical entity tells you how probable (or improbable) any given set of data attributes is, and therefore how likely it is to appear in the data. All the data's statistical properties are defined and can be derived from this distribution.

Statice’s solution uses deep-learning algorithms to model the probability distribution that generated the original dataset. In this learning process, several techniques are employed to make the model capture the underlying relationships between all the attributes. Differential privacy is also employed to make sure that the learning is not influenced by information specific to any individual.

At the end of the training process, our model has learned a good approximation of the probability distribution underlying the original data. From this, arbitrarily many synthetic samples can be generated. Coming from two very similar distributions, the synthetic data will preserve, to a high degree, the statistical properties of the original data.

Once you have the synthetic data, Statice's tool embeds evaluation to assess its utility. You can compare the attribute distributions and dependencies between the original and the synthetic data to ensure that it preserved all statistical characteristics.

synthetic data deel generative models — Statice's deep generative models learn the original data set's statistical properties to produce synthetic media that mimic the original subject. Privacy protection is ensured by training the models using Differential Privacy.

What are the pitfalls to consider while using synthetic data?

There are a couple of them. First, there are different ways of generating synthetic data. There are different approaches and models you can use. We tried a couple of them. We learned that it's challenging to find a model that works with multiple data types and generates use-case agnostic data. In an enterprise context, you need something that works consistently across different data types and structures. So it's essential to consider which information you are going to work with to pick the right approach.

Another element to consider is the size of the original dataset. The more data and entries are available in a dataset, the better we can reflect the synthetic records' statistical properties. The opposite stands true. When you have a very sparse dataset, it's more challenging to generate synthetic data that respect privacy and retain the original data's statistical value. Usually, you end up removing a lot of the data's statistical value.

How do you generate statistically accurate synthetic data?

How long would it take to create synthetic data, for a dataset of one million customer data records, for example?

From the perspective of complying with the GDPR, which tests do a data officer must do before releasing a synthetic dataset?

When generating synthetic data, how can we keep the statistical characteristics present in the original data?

What are the pitfalls to consider while using synthetic data?

Get the latest content straight in your inbox!

Get the latest content straight in your inbox!

Articles you might like

How do you generate statistically accurate synthetic data?

How long would it take to create synthetic data, for a dataset of one million customer data records, for example?

From the perspective of complying with the GDPR, which tests do a data officer must do before releasing a synthetic dataset?

When generating synthetic data, how can we keep the statistical characteristics present in the original data?

What are the pitfalls to consider while using synthetic data?

Get the latest content straight in your inbox!

Get the latest content straight in your inbox!

Articles you might like

Get in touch.