Generating privacy-preserving synthetic data with the Statice software

This post presents the Statice software, with more details about its technical features and the properties of the synthetic data you can generate with it.

Key take-aways:

Data access is a strategic topic for enterprises. We built our software so teams could be more agile and process data in a compliant way.
Our software's core is a SDK that lets you generate privacy-preserving synthetic data, usable as a drop-in replacement in your data processing activities.
You generate privacy-preserving synthetic data that has a high utility and robust guarantees against privacy attacks.
The Statice software is made to fit your environment, save you time, and adjust to your data team's needs.

Data access: a strategic priority for enterprises

In today's data-rich world, data access is a strategic topic for most, if not all, businesses. Data powers our day-to-day operations, fuels the development of AI applications, and allows analysis that helps enterprises to build personalized customer experiences and develop competitive products. Thus, enterprises must improve their data access strategies to answer the challenges posed by privacy and security constraints, data silos, and governance processes.

The field of privacy-enhancing technologies has attracted much interest in the last two decades. A lot of its success comes down to the double benefits it provides. On the one hand, these technologies equip enterprises with the tools to comply with data regulations and privacy constraints. On the other hand, they open up significant potential for data usage.

It’s at this intersection that we stand at Statice. Our privacy solutions bring enterprises more agility and freedom to process data in a compliant way. We build technology so teams can regain the ability to process data for secondary purposes while respecting individual privacy and mitigating corporate risk.

‍

Our technology and promise

The Statice software is an enterprise-ready platform for the generation of privacy-preserving synthetic data. At the core of the platform lies a software development kit (SDK) responsible for the platform capabilities, from identifying, evaluating and preprocessing common enterprise data types to the generation of privacy-preserving synthetic data.

Our software gives data teams the ability to artificially generate synthetic data from structured data types, supporting simple and complex use cases, starting from single tables with primitive types to multi-table, transactional, sequential and geographical data.

The result is data with a high utility, available as a drop-in replacement for behavior, predictive, or transactional analysis. The uses of this alternative data asset stretch along the enterprise data lifecycle, from integration to dissemination.

‍

Examples of possible use-cases with privacy-preserving synthetic data

‍

Statice privacy-preserving synthetic data

The synthetic data generated mimics the structure and statistical properties of the source data that might be too sensitive to move off premise or use for AI development. With the Statice software, you get synthetic data with high statistical accuracy. Additionally, privacy mechanisms guarantee data anonymity and privacy-compliance, meaning you can process it under current privacy regulations like GDPR and CCPA.

Utility

Our software uses state-of-the-art deep generative models to generate synthetic data with high statistical accuracy. They learn the underlying distributions of the original data and use it to sample new data.

Trials of different machine learning and statistical models led us to conclude that a one-model-fit-them-all approach wouldn't be optimal to support several data types in the same dataset.

Therefore, we opted for a hybrid approach to synthetic data generation in which we partition the synthesis problem. Our software splits the data into groups and chooses the most effective option for each data type from a collection of models.

The Statice software partitions the synthesis problem by splitting up the data into groups and tackling each group with the most effective model

This approach offers robust performances with structured data of various types. As a result, the synthetic data obtained maintain to a high degree statistical properties such as:

Marginal distributions
Conditional distributions
Aggregated statistics
Dates and times distributions
Pairwise dependencies, e.g., correlations and mutual information

In the example below and using classifiers from the SDGym benchmark library, scores were computed from an original test set and predictions from a synthetic data set. The results show that synthetic data performs very well in machine learning tasks, in this particular case for a classification task.

The software integrates a battery of evaluations to check the utility of the synthetic data. These statistical evaluations are easy to use and can be easily shared to report on the data's quality. See the complete list of available evaluations in our technical white paper (on-demand).

Privacy

A common misconception is to think that synthetic data is inherently private. Even with synthetic data, it could be possible to gather information of a real event or individual in the original dataset from the synthetic data.

The Statice software embeds privacy mechanisms to counter this privacy risk. You can choose to train the model using algorithms that satisfy the definition of differential privacy. It minimizes privacy risks to a theoretical mathematical minimum and outputs privacy-preserving synthetic data robust against privacy attacks.

A set of privacy evaluations is available for you to monitor the probability of information leakage of the original data from the synthetic data.

The Statice platform in the enterprise

Data science and innovation teams are the primary users of the Statice data anonymization software. We built the platform with data teams and enterprise environments in mind and spent time ensuring that deploying and using the solution is straightforward.

Made to fit your environment

The Statice software is flexible and can be deployed on almost any common infrastructure used for data science tasks. This makes it possible to train the synthetic data generating models where the data is, removing any requirement to move the sensitive data anywhere.

You can deploy on-premise or in a private cloud. You can also deploy on any major public cloud providers such as Google Cloud, AWS, Azure, and data analytics platforms like Databricks or JupyterHub.

The software supports data in tabular form: from CSV files to database exports (like Postgres, MySQL, MongoDB). It is also possible to use custom data formats on request. Notably, the data never leaves your infrastructure, nor is it shared with us for a greater security guarantee.

The software offers two interfaces: a command line interface (CLI) and an application programming interface (API). Both produce the same results and benefit from the automatic customization of the software.

Made to save you time

The "time-to-data" is an essential aspect for data teams, so it's also central in the Statice software. The deployment is quick. It typically takes two hours for a client to install and run their first dataset synthesization.

The synthetic data generation process is also a one-command process. Using the CLI, a simple “generate” will launch the synthesis. Using the Python API, your team can generate synthetic data in three API calls.

An important feature of the software is automatic customization. Contrary to open-source solutions that require trial-and-error tuning of machine learning parameters, the Statice software automatically learns particulars of the original data and tunes the synthesization to get the best results. It customizes itself to each dataset, saving you precious time.

There is, of course, the possibility for manual tuning, and you get full control over the utility and privacy of the synthetic data.

Made for the enterprises

The Statice Software supports many data types out-of-the-box and does automatic preprocessing to enhance the synthesization results. However, some businesses have particular data processing requirements. For those cases, the Statice software offers a set of extension mechanisms to accommodate tailored data processing and business logic support.

The solution also scales to accommodate the needs of modern enterprises and can handle large datasets. Clients successfully processed datasets with tens of millions of entries and over 500 columns.

If your team wants to assess how the Statice software would answer a specific data challenge or test it on real data, we’d be happy to hear from you. Real-world use tends to bring out all the challenges of a particular use case, and so we try to enable our clients to test the software on their projects with support from our experts. For more information on how this works, contact us!

Generating privacy-preserving synthetic data with the Statice software

Data access: a strategic priority for enterprises

Our technology and promise