How to manage re-identification risks with synthetic data

data privacy
By
Omar Ali Fdal
July 27, 2021
-
6
minutes read

This article originates from our presentation at IPEN webinar 2021, an event organized by the European Data Protection Supervisor. The conference gathered speakers from the public and private sector around the topic of synthetic data. We presented the challenges of data privacy and the protection mechanisms of synthetic data


This article covers:


  • the main privacy risks associated with anonymized data;
  • levels of protection against re-identification of data protection techniques;
  • the privacy-preservation properties of synthetic data.

Privacy risks and re-identification mechanisms of anonymized data


Companies are bound to protect the personal data they process from unauthorized access and disclosure. Commonly known as hacking, unauthorized access can stem from broken, or misconfigured authentication mechanisms, or stolen credentials. 

Additionally, organizations are vulnerable to unintentional leaks. 94% of the enterprises surveyed for Egress’ Insider Data Breach Survey 2021 reported insider data breaches last year. Phishing, or most commonly, human mistakes lead to personal data leaks when the data isn’t sufficiently protected. 

Confidentiality breaches can also take the form of voluntary data releases with insufficient privacy protection mechanisms.

Today, a company can freely process or release personal data once it has been effectively anonymized. Under data protection laws, that means that the re-identification of a data subject in that dataset is no longer possible. And this applies whether the dataset stands alone or is combined with other data sources.

There is no universally accepted legal definition of what “no longer possible re-identification” technically means. The closest to a definition is the work of the WP29. In 2014, they evaluated the re-identification risks of several techniques in the framework of European regulations. They identified three criteria strongly associated with re-identification risks:

  • Possibility to single out an individual in the dataset, meaning some of the records that identify an individual can be isolated from the data. 
  • Possibility to link at least two records that concern the same data subject within a dataset or between two separate datasets. A linkage attack is an attempt at such an event.
  • Possibility to infer information in the dataset, meaning the values of a set of attributes can be deduced from the data. An attribute inference attack is an attempt at such an event. 

Researchers contested the robustness of the criteria, stressing for example the different types of inference that this approach didn’t take into account.

Organizations traditionally used several methods to address these risks and protect personal data. For example, data masking techniques or generalization, such as pseudonymization or k-anonymity, transform the original data until the data subject is no longer identifiable. But how safe are these methods faced with re-identification risks?



How effective is pseudonymization or k-anonymity against re-identification?


If the re-identification of an individual takes place, it can lead to information data disclosure. Thus it’s crucial to understand and assess the residual privacy risks inherent to each data protection method. 

Pseudonymization

The first type of protection mechanisms, pseudonymization or masking techniques, focuses on removing or hiding Personally Identifying Information (PII), such as names or phone numbers. The other “less sensitive” values are left untouched. These types of attributes, like the demographic information shown in the table below, are generally referred to as quasi-identifiers. 

Figure 1 - Data masking techniques removes PII from the data and leave quasi-identifiers



Pseudonymization is a step in the right direction, but merely removing unique identifiers is not enough to protect from the re-identification risks presented above. In particular, an attacker can use the quasi-identifiers to link the pseudonymized dataset with other data in their possession, leading to re-identification. This is possible because, even if each quasi identifier is very common, their combination is not. Researchers repeatedly demonstrated that given enough quasi-identifiers, it's possible to uniquely identify anyone.

On The Observatory of Anonymity, an interactive online project from researchers of the Computational Privacy Group (CPG), individuals can assess the likelihood of being re-identified in a dataset from several demographics not considered unique identifiers, such as gender or location.

Figure 2 - With the above values on the 6 attributes Country, Home region, Birthdate, Sex, Marital Status and Education level, an individual in Austria would have 100% chances of being re-identified within an “anonymized dataset”. Source: The Observatory of Anonymity 

 

The vulnerability to linkage attacks of pseudonymized data explains why privacy regulations chose to consider it as personal data. Under the GDPR, organizations must handle pseudonymized data with the same care and constraints as personal data.

“Generally speaking removing directly identifying elements in itself is not enough to ensure that identification of the data subject is no longer possible.” Article 29 Data Protection Working Party  


k-anonymity, l-diversity, and t-closeness


Another type of protection techniques acknowledges the limits of pseudonymization and goes a step further to protect quasi-identifiers. k-anonymity maintains privacy by editing, via suppression and generalization, quasi-identifiers so that each combination of them is present at k times. Since the same quasi identifiers are shared between different rows, k-anonymity prevents unique joints that expose sensitive attributes.

Figure 3 - The k-anonymity method hides the individual record in a group of similar records to reduce the vulnerability to linkage attacks.


However, research showed that k-anonymity is subject to attribute inference attacks.

Other techniques, such as l-diversity or t-closeness, partially address the shortcomings of k-anonymity. However, they increase the complexity, reduce the utility of the data, and still assume that some attributes are more special than others.

In summary, traditional methods almost systematically present re-identification risks. To completely remove privacy risks, one would need to remove most, if not all data, reducing the data utility to zero.  

Figure 4 - A “zero privacy risk” approach would most likely reduce data utility to zero as well.


These privacy limitations are the reason why organizations started to look beyond traditional protection methods to anonymize data. And it's where synthetic data as a privacy mechanism comes into play. 


Protecting privacy with synthetic data


Is synthetic data anonymous?


In our context, synthetic data is algorithmically created data that looks and behaves like real data. Generative models learn the statistical distribution in the original data and draw artificial samples from it to generate synthetic data.


Figure 5 - To generate synthetic data, models learn the joint probability distribution in an original dataset and generate a new one with the same distribution.


The synthetic data generation process completely breaks the 1-1 relation between the original and synthetic records. Contrary to other techniques, like pseudonymization, there is no key to go back from the synthetic records to the original ones.This process is irreversible. This is already a desirable property of synthetic data from a privacy preservation point of view, and it’s not the only one.

The deep learning models used for synthetic data generation can solve complex tasks due to their large computational capacity. However, they might memorize features in the training data. Ultimately, memorized patterns can be reproduced in the synthetic data, leading to privacy leaks.

Luckily, we can add additional layers of privacy to the synthesization mechanisms, such as differential privacy.


Differentientially-privacy synthetic data 


Differential Privacy (DP) is a mathematical definition of privacy widely accepted by the community. Loosely speaking, DP uses noise to mask the presence of any particular individual in the input data. 

To generate differentially-private synthetic records, the models learn the original data distribution with a DP algorithm. This way, the synthetic data benefits from the theoretical guarantees that DP provides.

These layers of privacy protection greatly enhance the privacy of the synthetic data. However, no technique can guarantee perfect privacy while keeping some utility. Regulations such as the GDPR require that companies assess the residual risks of re-identification. Currently, this assessment is left to the appreciation of each company. At Statice, we developed evaluations that let you measure the privacy risk of synthetic data.


 Assessing re-identification risk: linkage potential 


The Statice software comes with a set of evaluations to assess the utility and privacy of the synthetic data it generates. Two of these evaluations let you measure the linkage and attribute inference risks on your synthetic data. 

The Linkage Potential evaluation is designed to detect synthetic records presenting a high probability of linkage to unique or rare original records. This is done by means of measuring how close synthetic records are to the original data points.

Figure 6 - Within the Statice SDK, the Linkage Potential Evaluator searches for compromising synthetic records that unambiguously link to an original record and would be liable to leak information.


We can also evaluate this risk overall at the dataset level by looking at the distribution of the linkage potential of synthetic records with original records and comparing it with the distribution of linkage potential of original records with other original records. Ultimately, synthetic records should not be closer to the original ones than original records are to other original records. 

Figure 7 - Within the Statice evaluation, the similarity between original records establishes the baseline for deciding when a synthetic record is suspicious



Privacy risk assessments: Attribute inference


The Attribute Inference evaluation detects specific information leak, or information that can only be learnt from the original data. In a scenario where an attacker has partial knowledge of the original data, we want to know how much knowledge on specific records they could gain by seeing the synthetic data. If the synthetic data provides more information about the training data than it provides about held-out test data, there is a privacy leak. 

Figure 8 -Within the Statice SDK, we analyze the robustness of the synthetic data to a linkage attack, an attack based on linking a target database with a source of auxiliary information to acquire knowledge on sensitive attributes.


 

Below, the left picture shows a privacy-preserving synthesization where the synthetic data allows to infer general information (better than the random guess baseline) while not providing more information than the original data.

Figure 9 - Comparing the inference risk between synthetic and original records.

Managing privacy risks when working with personal data

Using and sharing data in a privacy-preserving way is challenging. Understanding the risk is crucial no matter what technology is being used. Privacy risks exist, and they are to be taken into account when working with data, whether pseudonymized or synthetic data.

Some techniques like differentially private synthetic data provide stronger protection against re-identification than traditional masking methods. Thus, they are an excellent opportunity to combine utility and privacy for data processing. Additionally, the right tools also allow enterprises to assess residual risks and comply with data protection regulations.

Talk with our privacy experts

Get the latest content straight in your inbox!

Articles you might like

A list of events and resources for Data Privacy day 2021

Read more

Which data protection methods do you need to guarantee privacy?

Read more

How “anonymous” is anonymized data?

Read more