Data privacy by Statice

Statice generates synthetic data - just like real data, but privacy-compliant

Why care about data privacy

Data privacy is becoming increasingly important for businesses. Digitalization means that more customer data is entrusted in the hands of corporations, which is why it is their responsibility to ensure the protection of their customer data. Data privacy is also a necessity, as new regulations are creating requirements for companies to make privacy a priority in their data strategy. 

According to recital 26 of the General Data Protection Regulation (GDPR), guaranteed anonymous data is excluded from the GDPR. Recital 26 states that “this Regulation does not therefore concern the processing of such anonymous data, including for statistical or research purposes" (Basic Data Protection Regulation 2018).

At Statice, we have developed a software which generates guaranteed anonymous data by using the latest research in machine learning and data privacy. For more information about how it works, please watch our talk at PyData. Below we briefly explain what synthetic data is and how it compares to traditional anonymization methods.

The shortcomings of traditional privacy-enhancing methods

Pseudonymization

One of the first approaches to protecting an individual’s privacy was pseudonymization. Pseudonymization is the removal of personal data such as names and telephone numbers. In the pseudonymized dataset below, the phone numbers are removed.

Original dataset
Pseudonymized dataset

The release of pseudonymized data carries a high risk of re-identification, as it can very easily be linked to additional data sources - a so-called linkage attack. An example of such a massive violation is the leaked health records of the Governor of Massachusetts, which were linked to the public electoral register.

Based on our simplified example above, imagine that there's only one female living in the area with the zip code "75090". We can use some public records to find this information and link it to the individual in the third row. By combining this public information we can conclude that the person is earning $120,000.


K-Anonymization

Another method that aims to counteract linkage attacks is K-anonymization. This is done by generalizing and deleting attributes in data records that could re-identify an individual.

Original dataset
K-anonymized dataset

This technique makes it possible to prevent links to external data records based on indirect identifiers for re-identification purposes. This technique compromises the granularity of the original data, since the values are highly aggregated or completely removed.

Another issue of this approach is that it is not privacy-preserving if the sensitive attributes are homogeneous or have skewed distributions. With external knowledge, an attacker will be able to partially or fully re-identify an individual.

Going back to our example, we cannot identify the person in the third row anymore, because she is protected by another person with the similar attributes in the second row. Note that the person in the first row is still being unprotected from the linkage attack. We can prevent this by masking the "sex" attribute, but it will lead to a drastic utility loss of this dataset.

Synthetic data is much more than just fake data

Synthetic data is data generated by an algorithm, as opposed to original data which is based on real people’s information.  It is important to note that the term "synthetic data" is a collective term and by no means does all synthetic data have the same properties. The general idea is that synthetic data consists of new data points and is not simply a modification of an existing data set.

Not all synthetic data is anonymous. Synthetic data generated by Statice is privacy-preserving synthetic data as it comes with a data protection guarantee and is considered fully anonymous.

Original dataset
Synthetic dataset

Synthetic data works just like original data

What initially looks like an arbitrary randomization of data points is in fact a very complex technological process in which Statice generates new anonymized data points that reflect the statistical properties of the original data. To find out how Statice works click here.

Privacy-preserving synthetic data generated by Statice has the following properties:

Wondering if this is for you?

Book A DEMO