This article originates from our presentation at IPEN webinar 2021, an event organized by the European Data Protection Supervisor. The conference gathered speakers from the public and private sector around the topic of synthetic data. We presented the challenges of data privacy and the protection mechanisms of synthetic data.
This article covers:
Companies are bound to protect the personal data they process from unauthorized access and disclosure. Commonly known as hacking, unauthorized access can stem from broken, or misconfigured authentication mechanisms, or stolen credentials.
Additionally, organizations are vulnerable to unintentional leaks. 94% of the enterprises surveyed for Egress’ Insider Data Breach Survey 2021 reported insider data breaches last year. Phishing, or most commonly, human mistakes lead to personal data leaks when the data isn’t sufficiently protected.
Confidentiality breaches can also take the form of voluntary data releases with insufficient privacy protection mechanisms.
Today, a company can freely process or release personal data once it has been effectively anonymized. Under data protection laws, that means that the re-identification of a data subject in that dataset is no longer possible. And this applies whether the dataset stands alone or is combined with other data sources.
There is no universally accepted legal definition of what “no longer possible re-identification” technically means. The closest to a definition is the work of the WP29. In 2014, they evaluated the re-identification risks of several techniques in the framework of European regulations. They identified three criteria strongly associated with re-identification risks:
Researchers contested the robustness of the criteria, stressing for example the different types of inference that this approach didn’t take into account.
Organizations traditionally used several methods to address these risks and protect personal data. For example, data masking techniques or generalization, such as pseudonymization or k-anonymity, transform the original data until the data subject is no longer identifiable. But how safe are these methods faced with re-identification risks?
If the re-identification of an individual takes place, it can lead to information data disclosure. Thus it’s crucial to understand and assess the residual privacy risks inherent to each data protection method.
The first type of protection mechanisms, pseudonymization or masking techniques, focuses on removing or hiding Personally Identifying Information (PII), such as names or phone numbers. The other “less sensitive” values are left untouched. These types of attributes, like the demographic information shown in the table below, are generally referred to as quasi-identifiers.
Pseudonymization is a step in the right direction, but merely removing unique identifiers is not enough to protect from the re-identification risks presented above. In particular, an attacker can use the quasi-identifiers to link the pseudonymized dataset with other data in their possession, leading to re-identification. This is possible because, even if each quasi identifier is very common, their combination is not. Researchers repeatedly demonstrated that given enough quasi-identifiers, it's possible to uniquely identify anyone.
On The Observatory of Anonymity, an interactive online project from researchers of the Computational Privacy Group (CPG), individuals can assess the likelihood of being re-identified in a dataset from several demographics not considered unique identifiers, such as gender or location.
The vulnerability to linkage attacks of pseudonymized data explains why privacy regulations chose to consider it as personal data. Under the GDPR, organizations must handle pseudonymized data with the same care and constraints as personal data.
“Generally speaking removing directly identifying elements in itself is not enough to ensure that identification of the data subject is no longer possible.” Article 29 Data Protection Working Party
Another type of protection techniques acknowledges the limits of pseudonymization and goes a step further to protect quasi-identifiers. k-anonymity maintains privacy by editing, via suppression and generalization, quasi-identifiers so that each combination of them is present at k times. Since the same quasi identifiers are shared between different rows, k-anonymity prevents unique joints that expose sensitive attributes.
However, research showed that k-anonymity is subject to attribute inference attacks.
Other techniques, such as l-diversity or t-closeness, partially address the shortcomings of k-anonymity. However, they increase the complexity, reduce the utility of the data, and still assume that some attributes are more special than others.
In summary, traditional methods almost systematically present re-identification risks. To completely remove privacy risks, one would need to remove most, if not all data, reducing the data utility to zero.
These privacy limitations are the reason why organizations started to look beyond traditional protection methods to anonymize data. And it's where synthetic data as a privacy mechanism comes into play.
In our context, synthetic data is algorithmically created data that looks and behaves like real data. Generative models learn the statistical distribution in the original data and draw artificial samples from it to generate synthetic data.
The synthetic data generation process completely breaks the 1-1 relation between the original and synthetic records. Contrary to other techniques, like pseudonymization, there is no key to go back from the synthetic records to the original ones.This process is irreversible. This is already a desirable property of synthetic data from a privacy preservation point of view, and it’s not the only one.
The deep learning models used for synthetic data generation can solve complex tasks due to their large computational capacity. However, they might memorize features in the training data. Ultimately, memorized patterns can be reproduced in the synthetic data, leading to privacy leaks.
Luckily, we can add additional layers of privacy to the synthesization mechanisms, such as differential privacy.
Differential Privacy (DP) is a mathematical definition of privacy widely accepted by the community. Loosely speaking, DP uses noise to mask the presence of any particular individual in the input data.
To generate differentially-private synthetic records, the models learn the original data distribution with a DP algorithm. This way, the synthetic data benefits from the theoretical guarantees that DP provides.
These layers of privacy protection greatly enhance the privacy of the synthetic data. However, no technique can guarantee perfect privacy while keeping some utility. Regulations such as the GDPR require that companies assess the residual risks of re-identification. Currently, this assessment is left to the appreciation of each company. At Statice, we developed evaluations that let you measure the privacy risk of synthetic data.
The Statice software comes with a set of evaluations to assess the utility and privacy of the synthetic data it generates. Two of these evaluations let you measure the linkage and attribute inference risks on your synthetic data.
The Linkage Potential evaluation is designed to detect synthetic records presenting a high probability of linkage to unique or rare original records. This is done by means of measuring how close synthetic records are to the original data points.
We can also evaluate this risk overall at the dataset level by looking at the distribution of the linkage potential of synthetic records with original records and comparing it with the distribution of linkage potential of original records with other original records. Ultimately, synthetic records should not be closer to the original ones than original records are to other original records.
The Attribute Inference evaluation detects specific information leak, or information that can only be learnt from the original data. In a scenario where an attacker has partial knowledge of the original data, we want to know how much knowledge on specific records they could gain by seeing the synthetic data. If the synthetic data provides more information about the training data than it provides about held-out test data, there is a privacy leak.
Below, the left picture shows a privacy-preserving synthesization where the synthetic data allows to infer general information (better than the random guess baseline) while not providing more information than the original data.
Using and sharing data in a privacy-preserving way is challenging. Understanding the risk is crucial no matter what technology is being used. Privacy risks exist, and they are to be taken into account when working with data, whether pseudonymized or synthetic data.
Some techniques like differentially private synthetic data provide stronger protection against re-identification than traditional masking methods. Thus, they are an excellent opportunity to combine utility and privacy for data processing. Additionally, the right tools also allow enterprises to assess residual risks and comply with data protection regulations.
Contact us and get feedback instantly.