Pseudonymization and anonymization play an important role in data processing, data security, and data access processes since the General Data Protection Regulation (GDPR) came into force. It’s because companies use protection methods to comply with regulations while being able to use data for business projects.
Both data protection methods are closely related but they fall under different categories according to the GDPR. If you want to use sensitive data in projects while meeting all data protection obligations, make sure you understand all the nuances of those methods.
In this article, we are going to focus on personal data processing within organizations so you get a grasp of the topic. You’ll learn:
Disclaimer: Remember, this article is an educational one. It’s not legal advice and should not be treated as such.
According to the National Institute of Standards and Technology (NIST):
Personally Identifiable Information (PII): Any representation of information that permits the identity of an individual to whom the information applies to be reasonably inferred by either direct or indirect means.
PII can be any information that lets you trace and identify an individual. So this can be full name, address, passport number, email, credit card numbers, date of birth, telephone number, login details, and many more.
Personal Identifiers (PID) are a subset of PII data elements that identify a unique individual and can permit another person to “assume” an individual's identity without their knowledge or consent.
Now that you understand what PII and PID are, let’s jump to the topic of anonymization and pseudonymization.
Pseudonymization is one of the methods in the family of data masking, a set of techniques that aim at hiding the identity of a data subject, a person, in a dataset. To pseudonymize personal data, you replace personal identifiers (PID) with random numbers or codes in a dataset.
But don’t confuse pseudonymization with data encryption, which is an entirely different method of personal data protection.
The encryption method converts data into a code, so that only people with a decryption key can read it. Data encryption allows only authorized users to access data, whereas pseudonymization allows a wider audience to access some of the data while concealing the "key" fields.
Let’s take a look at how the GDPR defines pseudonymization.
According to art. 4 GDPR:
‘pseudonymisation’ means the processing of personal data in such a manner that the personal data can no longer be attributed to a specific data subject (individual person who can be identified) without the use of additional information, provided that such additional information is kept separately and is subject to technical and organizational measures to ensure that the personal data are not attributed to an identified or identifiable natural person;
In other words, pseudonymous data shouldn’t allow you to re-identify individuals unless it’s connected with additional information that breaks identifiers or “pseudonyms”.
What’s important: the process is reversible and there is a high likelihood of data subjects re-identification. Pseudonymization lessens the chances of re-identification, but does not prevent it in the event of data leakage. This method does not provide adequate protection from misuse of the data for the individual and should be used as an internal risk mitigation method.
Under the General Data Protection Regulation, pseudonymized data is still considered personal data.
This is because merely removing unique identifiers is not enough to protect from the re-identification risks. Attackers can use the quasi-identifiers to link the pseudonymized data with other data in their possession. Quasi-identifiers are considered as “less sensitive” values that are left untouched.
A job title is an example of a quasi-identifier. There is no direct link between a job title and an individual, especially since a data set might have a lot of similar job titles. However, more senior or 'unique' positions like CEO, Director, or Data Evangelist are generally held by a smaller group of people, making it easier to link a quasi-identifier to a real person.
Even if each quasi-identifier is very common, their combination is not. That can lead to revealing the personal data and, therefore, the identity of individuals.
Thus, you can’t freely transfer pseudonymized data to a third party or use it for machine learning or big data analysis without obtaining secondary consent from the data subject and complying with all GDPR requirements for personal data protection (purpose limitation, storage limitation, integrity and confidentiality, etc.).
And if you want to process data for analytics purposes, the data anonymization method is a more effective way to safeguard personal data privacy and derive insights from it.
Data anonymization is the process of irreversible transformation of personal data. The goal of anonymizing data is not only to remove personal identifiers but also to ensure that it’s impossible to determine who an individual is from the rest of the data.
Let’s see how the GDPR defines data anonymization.
According to recital 26 EU GDPR:
The principles of data protection should therefore not apply to anonymous information, namely, information that does not relate to an identified or identifiable natural person or to personal data rendered anonymous in such a manner that the data subject is not or no longer identifiable.
This regulation does not, therefore, concern the processing of such anonymous information, including for statistical or research purposes.
Since anonymous data doesn’t contain PII and the process is irreversible, it is no longer subject to the GDPR.
It has many business advantages for you. Anonymous data:
Don’t confuse anonymized data with de-identified data. The data is considered de-identified when direct or indirect identifiers linking it to a particular individual are removed, or when identification cannot be made through the use of known characteristics of that individual (deductive disclosure).
However, as pointed out by the International Association of Privacy Professionals (IAPP), de-identification doesn’t usually succeed in anonymizing data, since there are so many sources of data that still hold identifying information.
Anonymous data, on the other hand, doesn’t offer an opportunity for anyone, including the data handler, researcher, or a third party, to link the information to a particular individual.
Although anonymization sounds like a great plan to unlock the potential of your data, be mindful of its limitations:
A good example is the Netflix case of 2007. The streaming company published 10 million movie rankings as part of a public challenge that aimed at improving the recommendation systems. Though Netflix anonymized some of the data by removing personal details and replacing names with random numbers, two researchers de-anonymized some part of this data. They compared rankings and timestamps with publicly available information on the IMDb website, revealing personal data.
The anonymization reduces the risk of data re-identification but this depends on other factors too. For example, what controls are put in place to prevent the anonymous dataset from being linked back to personal data. A good source of recommendations on the accountability and governance measures needed for anonymization are listed in ICO's Chapter 4.
What’s also important, data anonymization is a form of data processing. Therefore,
Companies must comply with the GDPR processing requirements, including consent and purpose limitation - see Article 5(1)(b). In other words, companies must have the user agreement to anonymize their personal data.
And lastly, companies that want to use anonymous data freely have to prove that data subjects are no longer identifiable. Those rules apply to all anonymization methods.
So far, you understand that both pseudonymization and anonymization involve hiding or data masking in some way.
The key difference between pseudonymization and anonymization is about the likelihood of re-identification of the PII. Pseudonymization holds a higher likelihood of re-identification for the data subject than anonymous data.
This leads to a key difference in the GDPR. Pseudonymous data is still considered personal data under the GDPR while anonymized data isn’t.
IAPP summarizes the obligations under the GDPR for both pseudonymous and anonymous data.
Before you decide to employ anonymous data in your analytics project, know the three risks of data re-identification.
According to Working Party Article 29, anonymous data is robust if it’s protected against the following attacks:
After reading this, you might be wondering to what extent anonymous data is really anonymous. If you’re curious, read this article.
Is there a safer method than the traditional method of data anonymization? Yes, it’s synthetic data generation.
Synthetic data is an outcome of artificial data generation. The new dataset resembles the quality of the original data and retains the statistical distribution. This means synthetic data looks like and behaves like real personal data.
If you want to keep the safe level of privacy compliance, aim at generating privacy-preserving synthetic data. Privacy-preserving synthetic data generation breaks the relationship between the original data subject and synthetic data, increasing the level of data protection.
However, machine learning models can memorize patterns of the original dataset which increases the privacy risk. So to keep the high synthetic data utility but minimize the privacy risk, you can add additional privacy layers to synthetic data – e.g., differential privacy.
For context, differential privacy (DP) is a mathematically sound definition of privacy for statistical and machine learning purposes. By looking at the output of a differentially private algorithm, one cannot determine whether a given individual's data was included in the original dataset or not. To put it another way, a differentially private algorithm is guaranteed to remain stable regardless of whether an individual joins or leaves the dataset.
DP algorithms are used to create differentially-private synthetic records based on the original distribution of data. As a result, the synthetic data benefits from the theoretical guarantees that DP provides.
These layers of protection significantly enhance the privacy of the synthetic data. However, no method can ensure perfect privacy while maintaining some utility. The GDPR requires that companies assess the residual risks of re-identification.
For example, at Statice, we developed evaluations that quantify the privacy risk of synthetic data with a set of evaluations to assess the utility and privacy of the data it produces.
As it turns out, synthetic data is safer than traditional anonymization methods. And it’s incomparably safer when compared to pseudonymization.
What’s key, privacy-preserving synthetic data doesn’t fall under the scope of the GDPR so you can freely use this data for your business projects. For example, use synthetic data in AI/ML model training, big data analysis, internal data sharing, data testing, or product development.
Contact us and get feedback instantly.