The truth about anonymous data revealed

August 7, 2019
The cyber threat is real

Following the recent news about cybersecurity and data breaches, there is no wonder that the alarm bells are going off around the safety of our data in the hands of corporations. In July, a hacker gained access to personal information of 106 million customers and applicants of the bank Capital One through a misconfigured firewall. In the same month, a massive data breach affected much of the population of Bulgaria. Shortly before that, the media reported about the penalties following the massive data breaches that affected the Marriott hotel chain and British Airways respectively. The cyber threat is real and corporations must protect their customers’ data before more people get hurt.

The myths and facts about data anonymization

A study titled “Estimating the success of re-identifications in incomplete datasets using generative models” is causing noise around the safety of data anonymization. The study, published in Nature, points out the shortcomings of so-called traditional, or older data anonymization methods. It sparked concern in mainstream media, with headlines in the New York Times, The Guardian and CNBC, suggesting that anonymous data is doomed. But those who read beyond the headlines will learn that fully anonymizing data is the only way to preserve data privacy today, the only alternative being no data at all. There is no turning point.

The study poses the question if anonymous data truly is anonymous, which is stirring up public concern around data protection and the safety of our personal data. The truth comes in threes. For one, there is no one solution to data anonymization. Secondly, the older traditional anonymization techniques often lack in either security or utility. Finally, this has been known among privacy researchers for 20 years, who have been actively working towards finding better privacy-preserving methods.

There are cases where individuals have been identified after de-anonymized data has been recovered. One such example is the Netflix case from 2007, when the streaming company published 10 million movie rankings as part of a public challenge, to come up with better recommendation systems. The data was anonymized by removing personal details and replacing names with random numbers, to protect the privacy of the 500,000 customers who submitted their recommendations. Two researchers then proved that they could de-anonymize some of the Netflix data by comparing rankings and timestamps with public information from the IMDb website.

Companies that intend to either share data with external partners or use if for purposes other than those initially intended should be wary of the legal restraints put forward by laws such as the GDPR and CCPA. Consider the following scenario: A bank wants to share the client’s data with insurance firms, based on the same contract for a bank account and personal loan. According to the GDPR, this secondary processing would not be permitted without the explicit consent of the client, as the purpose is not the same as the original purpose for which the data was processed. This is one of the biggest challenges for companies today, due to the loss of explorative analysis opportunities caused by incomplete datasets.

Why pseudonymization is not enough

Many companies are still applying pseudonymization for data protection, thinking that they are protecting their customers’ data and perhaps even that they are being legally safe. Pseudonymization is a traditional method of protecting data which amounts to removing some sensitive data, such as names and social security numbers, but not all. Pseudonymization does not remove all information that may trace back to an individual but merely reduces the most obvious linkage risk. Pseudonymized data can be re-identified by record linkage and it is therefore not in itself a sufficient anonymization method according to the GDPR and other legal frameworks. Pseudonymized datasets carry a high risk of re-identification, due to the fact that they can very easily be linked to other data sources which can reveal information about an individual.

Although pseudonymization is recommended to minimize the risk, the GDPR recommends that companies dealing with personal data of EU citizens, if possible, anonymize customer data with appropriate de-identification methods so that no individual can be re-identified. Modern anonymization, contrary to pseudonymization, ensures that the data can never be retrieved by any means. In sum, pseudonymization is not enough according to law.

We have previously written about the misconceptions around pseudonymous versus anonymous data here. Modern laws, such as the GDPR, are setting new standards for data anonymization methods, which is a step in the right direction in terms of privacy and utility.

The GDPR is clear about data pseudonymization not being a sufficient method of anonymization and therefore requires the same protection as raw data. Recital 28 demands that pseudonymization is “not intended to preclude any other measures of data protection”. Recital 26 cites that “[p]ersonal data which have undergone pseudonymization, which could be attributed to a natural person by the use of additional information, should be considered to be information on an identifiable natural person”. The European law on data protection asks corporations to seek out the appropriate anonymization measures or support, to ensure data protection of the customer data they have access to. Appropriate, in this regard, refer to modern anonymization methods.

Privacy-preserving data sharing is thriving

Poor methods of data anonymization, which the Nature study is referring to, carry risks of re-identification and are vulnerable to attacks. Since the 1990s, privacy researchers have continuously proven that data that have been de-identified can often be successfully re-identified via record linkage, for instance. In the late 1990s, Dr. Latanya Sweeney famously applied a linkage attack on a dataset of de-identified hospital records. By comparing the ZIP codes, date of birth and sex of patients with publicly available voting records, which also contained names and addresses alongside the same information as aforementioned, the individuals in the hospital records could be re-identified. What the authors of the Nature study rightly conclude is that such traditional techniques are subject to modern legislation such as GDPR. Companies must thus take modern measures to ensure guaranteed data anonymity.

In this video, we have reviewed the modern techniques for data sharing and their limitations. The video explains why it is hard to share useful data while preserving the privacy and point to promising approaches available today.

The good news is that there are approaches to anonymizing data safely, which fall in line with modern data protection laws such as the GDPR. But first of all, it is important to recall that anonymous data is more than deleting the name or any other identifiable information. Modern anonymization is irreversible and is therefore not subject to laws such as the GDPR. In contrast to the traditional anonymization methods described above, modern anonymization works.

The Statice solution

Statice, along with companies like Apple and Google, apply a modern concept called differential privacy — a mathematical standard of analysis that allows getting insights in data that concerns more than one person, without exposing specifics about individuals, which would otherwise compromise their privacy. A differentially private computation guarantees that the output is the same with or without any specific individual’s input. This is the privacy guarantee.

At Statice, we go even further by generating privacy-preserving synthetic data — new artificial data that can be used and released without ever compromising the personal privacy of any individual. As opposed to only treating so-called sensitive data, Statice assumes that all data attributes are sensitive and can be traced back to an individual. Using machine learning, a new dataset is generated which contains the same statistical and structural values as the original data, maintaining the value for sharing and using the newly generated synthetic data.

To keep up with development, we not only apply the concept of differential privacy when generating synthetic data, but we continuously apply recent data privacy research to improve our product and the end results for our clients at Statice.

The Statice solution offers privacy-by-design and guaranteed data privacy. Synthetic data that fulfills a mathematical guarantee such as differential privacy can be regarded as anonymized data and therefore excluded from the GDPR. Truth to be told, there is no easy way to anonymize data. But there is a better one. To learn more about our product, get a free demo here.

Wondering if this is for you?

Book A DEMO