How “anonymous” is anonymized data?

anonymized data
Elise Devaux

How “anonymous” is anonymized data?

What constitutes data anonymization is often debated within the scientific community. Regulators also frequently revise the definitions of data anonymization within legal frameworks. So it is safe to wonder, what exactly is anonymized data under the GDPR? And does it provide true anonymity?

anonymized data definition
Download our free white paper on the subject of anonymized data to learn more

In the beginning, there was data

The intent of anonymization is to ensure the privacy of data. Companies use it to protect sensitive data. This category encompasses:

  • personal data,
  • business information such as financial information or trade secrets,
  • classified information such as military secrets or governmental information.

So, anonymization is, for instance, a way of complying with the privacy regulations related to personal data. Personal and business data types can overlap. This is where lies the customer information. But not all business data falls under regulations. We focus here on the protection of personal data.

anonymized data definition
Example of sensitive data types

In Europe, regulators define as “personal data” any information that relates to someone (your name, for example). Information linking to a person in any way also falls under that description.

According to the National Institute of Standards and Technology (NIST):

Personally Identifiable Information (PII): Any representation of information that permits the identity of an individual to whom the information applies to be reasonably inferred by either direct or indirect means.

As the collection of personal data democratized over the previous century, the question of anonymization started to rise. The regulations coming into effect around the world sealed the importance of the matter.

What is data anonymization and why should we care?

Let’s begin with the classic definition. The EU’s General Data Protection Regulation (GDPR) defines anonymized information as follows:

information which does not relate to an identified or identifiable natural person or to personal data rendered anonymous in such a manner that the data subject is not or no longer identifiable.

The “identifiable” and “no longer” parts are essential. It doesn’t only mean that your name shouldn’t appear in the data anymore. It also means that we shouldn’t be able to figure out who you are from the rest of the data. This refers to the process of re-identification (sometimes de-anonymization).

The same GDPR recital also states a very important fact:

“[…] data protection should therefore not apply to anonymous information”.

Data anonymization example

For instance, one of the methods of anonymization is synthetic data. Synthetic data is algorithmically created but it looks and behaves like real data. Generative models learn the statistical distribution in the original data and draw artificial samples from it to generate synthetic data.

The synthetic data generation process completely breaks the 1-1 relation between the original and synthetic records. Thus, synthetic data does not contain PII, and with the right handling, can be freely used for sharing, monetization, research, machine learning and more.

How is anonymized data used?

So, if you manage to anonymize your data, you are no longer subject to GDPR laws. 

Anonymization can help you perform any processing operations such as analysis or monetization. This opens quite some opportunities:

  • Selling data is an obvious first opportunity. Around the world, privacy regulations are restricting the trade of personal data. Anonymization offers an alternative for businesses.
  • It represents an opportunity for collaborative work. Many enterprises share data for innovation or research purposes. They can limit risks by using anonymization.
  • It also creates opportunities for analysis and Machine Learning. Running operations with sensitive data while remaining compliant is getting increasingly complex. Anonymization represents a safe raw material for statistical analysis and model training.

The opportunities are clear. But truly anonymized data is often not what we think.

The spectrum of privacy mechanisms

Privacy preservation of data is a spectrum. Over the years, experts developed a collection of methods, mechanisms, and tools. These techniques produce data with various anonymousness levels and various risk levels of re-identification. We could say it ranges from personally identifiable data to truly anonymized data.

A spectrum of data privacy

On one end, you have data that contains direct personal identifiers. Those are elements from which we can identify you like name, address, or telephone number. On the other end, you have the anonymous data that is no longer the subject of the GDPR.

But there is an intermediary category. It lives between identifiable and anonymized data: pseudonymized and de-identified data. Note that this delimitation is subject to debate. Some presentations make pseudonymization a part of de-identification, others exclude it.

For instance, the data is considered de-identified when direct or indirect identifiers linking it to a particular individual are removed, or when identification cannot be made through the use of known characteristics of that individual (deductive disclosure).

However, as pointed out by the International Association of Privacy Professionals (IAPP), de-identification doesn’t usually succeed in anonymizing data, since there are so many sources of data that still hold identifying information.

In itself, there is nothing wrong with pseudonymization or de-identification. They are efficient data minimization techniques. Depending on the requirements of one’s use-cases, they will be relevant and useful.

What we need to keep in mind is the fact they don’t produce truly anonymous data. Their mechanisms do not have the guarantee to prevent re-identification. And referring to the data they produce as “anonymous”, is misleading.

pseudonymization vs anonymization

What is the difference between de-identified and anonymized data?

Pseudonymization and de-identification are indeed a way of preserving certain aspects of privacy. But they don’t produce anonymized data, per the GDPR description.

Pseudonymization techniques remove or replace the direct personal identifiers from the data. For instance, you delete name, email, and address from a dataset. You can’t identify someone directly from pseudonymized data.

But you can do it indirectly. Indeed, the rest of the information often retains indirect identifiers. These contain information that you can combine to create direct identifiers. They could date of birth, zip codes, or gender for example.

For that matter, pseudonymization has a separate definition within the GDPR framework.

“[…] the processing of personal data in such a way that the data can no longer be attributed to a specific data subject without the use of additional information”.

Contrary to anonymous, pseudonymous data falls under the GDPR regulations.

De-identification techniques remove both direct and indirect personal identifiers from the dataset. On paper, the frontier between de-identified and anonymized data is simple. De-identification offers technical safeguards that guarantee that an individual can never be re-identified. It’s a “true until proven false” kind of situation.

And experts are pushing the line further every time they manage to re-identify data that was initially de-identified.

The difference between data masking and anonymization

Masking data is the standard way of pseudonymizing data. Masking, like pseudonymization, seeks to protect data by removing or altering its direct, sometimes indirect, identifiers. Masking is reversible, as is pseudonymization.

The term "anonymization" is often used in reference to masking. Masking is completely different from anonymization, and the main difference is:

  • Masking falls under the scope of GDPR, but true anonymization does not.
  • In contrast to masking, anonymization protects against re-identification (single out, link, infer) attacks.

De-anonymization attacks: what are they?

The mechanism types described above do not have the same effectiveness for privacy preservation. Hence, what you intend to do with the data matters. Companies regularly release or sell data that they claim “anonymous”. It becomes a problem when the methods they used don’t guarantee that.

Many events showed that pseudonymized data was a poor privacy preservation mechanism. The indirect identifiers create a strong risk for re-identification. And as available data volumes grow, so does the opportunities to cross-reference datasets:

  • In 1990, an MIT graduate re-identified the Massachusetts Governor from de-identified medical dataset. She cross-referenced the information with publicly available census data to identify patients.
  • In 2006, AOL shared de-identified search data as part of a research initiative. Researchers were able to link search queries to the individuals behind them.
  • In 2009, Netflix released an anonymized movie rating dataset, as part of a contest. Texas researchers successfully re-identified the users.
  • In 2009, researches were able to predict an individual’s Social Security Number using only publicly available information. 

Studies showed that de-identified data also was, in fact, re-identifiable. Researchers at UCL in Belgium and Imperial College London found that:

“99.98% of Americans would be correctly re-identified in any dataset using 15 demographic attributes”.

Another study, conducted on anonymized cell phone data, showed that:

“four spatio-temporal points are enough to uniquely identify 95% of the individuals”.

Technology is improving. More data is being created. As a result, researchers are pushing the delimitation between de-identified and anonymous data. In 2017, researchers released a study stating that:

“web-browsing histories can be linked to social media profiles using only publicly available data.”

I‍n 2021, several researchers investigated the risk of individuals' re-identification in genomic datasets “in the wild” (think direct-to-consumer DNA testing like 23andMe) based on linkage with publicly posted photos. Using the public OpenSNP dataset, researchers identified 126 individual genotypes for which they were able to successfully find publicly posted photographs.

Taken separately some datasets aren’t re-identifiable. But combined with leaked data, they represent a larger threat. Students from Harvard University were able to re-identify de-identified data using leaked data.

Another alarming point arises from the exposition of personal data through breaches. The amount of personal information leaked keeps on growing. ForgeRock Consumer Identity Breach Report predicted that 2020 will top last year’s number of breaches, with already over 1,6 billion customer records exposed in Q1 2020 in the US only. According to ForgeRock's 2021 report edition, attacks involving usernames and passwords increased by a 450%, adding up to 1.48 billion breached records.

The healthcare industry was the biggest target by a wide margin, accounting for 34% of all breaches, followed by financial services with 12%. It should come as no surprise to see those two verticals at the top of the ranking, as they handle some of the most confidential and valuable personal data.

In Europe, the number of data leaks reported to Belgium’s Data Protection Authority, APD, increased significantly last year too. “Over the past twelve months, reported cases of breached data have gone up to 1,529. The year before, there were 1,232.”, the Brussels Times newspaper cites APD spokesperson Aurélie Waeterlinckx.

“Such leaks include, for example, a firm hacked by information pirates who install a malware, so its data is held hostage, or an organization that accidentally sends data to the wrong person,” Ms. Waeterlinckx added. “Sometimes an employee also loses a laptop containing files with personal data.”“The increase in the number of reported cases is linked to the accelerated digitization of our society during the pandemic,” Waeterlinckx explained.

What we consider “anonymous data”, is often not. Not all data sanitization methods generate true anonymization. Each presents its own advantages, but none offer the same level of privacy as anonymization.

Can anonymized data be de-anonymized?

Almost everything in our lives today is recorded and stored digitally. Every interaction with technology leaves a digital trail. After a dataset is publicly released, it can never be strengthened, only 'weakened' by future releases of information that may lead to the re-identification of individuals in the dataset.

As we produce more data, it becomes harder and harder to create truly anonymized data. And the risks of companies releasing potentially re-identifiable personal data grows.

Publicly available information online combined with powerful computer hardware have made it possible to re-identify seemingly anonymized data.

One way for companies to prevent data re-identification is to implement dedicated frameworks and tools such as Privacy Enhancing Technologies (PETs), a category of tools designed to guarantee data protection.

anonymized data definition
Download the white paper

Get the latest content straight in your inbox!

Articles you might like

Pseudonymization vs anonymization: differences under the GDPR

Read more

How to manage re-identification risks with synthetic data

Read more

A list of events and resources for Data Privacy day 2021

Read more