Data privacy is becoming increasingly important for businesses. Digitalization means that more customer data is entrusted in the hands of corporations, which is why it is their responsibility to ensure the protection of their customer data. Data privacy is also a necessity, as new regulations are creating requirements for companies to make privacy a priority in their data strategy.
According to recital 26 of the General Data Protection Regulation (GDPR), guaranteed anonymous data is excluded from the GDPR. Recital 26 states that “this Regulation does not therefore concern the processing of such anonymous data, including for statistical or research purposes" (Basic Data Protection Regulation 2018).
At Statice, we have developed a software which generates guaranteed anonymous data by using the latest research in machine learning and data privacy. For more information about how it works, please watch our talk at PyData. Below we briefly explain what synthetic data is and how it compares to traditional anonymization methods.
One of the first approaches to protecting an individual’s privacy was pseudonymization. Pseudonymization is the removal of personal data such as names and telephone numbers. In the pseudonymized dataset below, the phone numbers are removed.
The release of pseudonymized data carries a high risk of re-identification, as it can very easily be linked to additional data sources - a so-called linkage attack. An example of such a massive violation is the leaked health records of the Governor of Massachusetts, which were linked to the public electoral register.
Based on our simplified example above, imagine that there's only one female living in the area with the zip code "75090". We can use some public records to find this information and link it to the individual in the third row. By combining this public information we can conclude that the person is earning $120,000.
Another method that aims to counteract linkage attacks is K-anonymization. This is done by generalizing and deleting attributes in data records that could re-identify an individual.
This technique makes it possible to prevent links to external data records based on indirect identifiers for re-identification purposes. This technique compromises the granularity of the original data, since the values are highly aggregated or completely removed.
Another issue of this approach is that it is not privacy-preserving if the sensitive attributes are homogeneous or have skewed distributions. With external knowledge, an attacker will be able to partially or fully re-identify an individual.
Going back to our example, we cannot identify the person in the third row anymore, because she is protected by another person with the similar attributes in the second row. Note that the person in the first row is still being unprotected from the linkage attack. We can prevent this by masking the "sex" attribute, but it will lead to a drastic utility loss of this dataset.
Synthetic data is data generated by an algorithm, as opposed to original data which is based on real people’s information. It is important to note that the term "synthetic data" is a collective term and by no means does all synthetic data have the same properties. The general idea is that synthetic data consists of new data points and is not simply a modification of an existing data set.
Not all synthetic data is anonymous. Synthetic data generated by Statice is privacy-preserving synthetic data as it comes with a data protection guarantee and is considered fully anonymous.
What initially looks like an arbitrary randomization of data points is in fact a very complex technological process in which Statice generates new anonymized data points that reflect the statistical properties of the original data. To find out how Statice works click here.
Privacy-preserving synthetic data generated by Statice has the following properties: