The market for synthetic data continues to expand, and according to Gartner is expected to replace real data in AI models by 2030. However, it is still unclear how synthetic data fits into data privacy compliance, primarily due to the novelty of the approach.
This article sets out an overview of synthetic data from a data privacy perspective. It provides insight into how privacy technologies are used for GDPR compliance, how synthetic data works, and how to evaluate synthetic data compliance.
The General Data Protection Regulation (GDPR) establishes two different protected states for data: anonymization, and pseudonymization. Both terms were well-known prior to the GDPR, but the law established new terminology that defines these terms for the purposes of compliance.
Pseudonymization had no clear definition before the GDPR and was commonly considered to be the application of masking and tokenization. When the GDPR came into force, a new definition, and a legal standard, were established. According to Article 4 of the GDPR:
‘pseudonymization’ means the processing of personal data in such a manner that the personal data can no longer be attributed to a specific data subject (individual person who can be identified) without the use of additional information, provided that such additional information is kept separately and is subject to technical and organizational measures to ensure that the personal data are not attributed to an identified or identifiable natural person;
In other words, the GDPR standard of pseudonymization requires you to protect the entire dataset. You cannot re-identify individuals unless you are authorized to use the additional information (stored separately and securely) necessary to relink tokens or “pseudonyms” to the identity of individuals. This is a much higher standard than simple masking and tokenization.
Anonymization was widely used prior to the GDPR and still is. However, privacy protection is a spectrum, and whether something is truly “anonymous” is often up for debate. The aim is to remove identifiers to the point where the person is no longer identifiable.
According to recital 26 EU GDPR, anonymization is:
…information that does not relate to an identified or identifiable natural person or to personal data rendered anonymous in such a manner that the data subject is not or no longer identifiable.
It doesn’t only mean that your name shouldn’t appear in the data anymore. It also means that we shouldn’t be able to figure out who you are from the rest of the data. In addition, you should ensure that other datasets can’t be combined and used to identify the person either.
The key difference between pseudonymization and anonymization is that the latter irreversibly transforms the data and eliminates the possibility of re-identifying individuals within the dataset.
Both pseudonymization and anonymization are states that can apply to a dataset (or not), and both can be accomplished with a wide range of technologies.
The problem is that the regulations in the GDPR do not specify which technologies or approaches should be used to meet these standards. Organizations use a wide variety of privacy-protection technologies to either try to comply with the GDPR, or to step outside of it by anonymizing data. Synthetic data compliance approaches attempt to make data anonymous, to avoid being subject to the GDPR.
Synthetic data is artificially generated realistic data. Synthetic data removes the relationship to the real data, and is usually considered to be “anonymized” data. This means in many cases it can be used without being subject to the rules of the GDPR, as it is no longer considered to be personal data. It can be created in two ways:
The question of whether or not synthetic data is compliant with the GDPR, or sits outside of the GDPR, is up for debate.
Synthetic data, like other datasets, is still subject to re-identification attacks or data breaches. There is no universally accepted legal definition of what “no longer possible re-identification” technically means. The closest to a definition is the work of the WP29.
Synthetic data compliance approaches should therefore take into account the robustness of the synthetic dataset, and follow WP29 recommendations. This includes checking how the data stands up to testing against re-identification attacks such as singling out, linkage attacks, and inference attacks. If a dataset appears to be robust in the face of these attacks, it is more likely to meet the GDPR standard of anonymity, preventing the re-identification of original data subjects.
In addition, applying further privacy protection approaches such as differential privacy can help. Differential privacy uses noise to mask the presence of any particular individual in the input data. However, the further application of privacy techniques can decrease the utility of the data, and a balance must always be struck.
Synthetic data vendors often offer built-in prívacy evaluators to aid DPOs with privacy assessment. For example, at Anonos we offer evaluators that let you measure the privacy risk of synthetic data based on WP29 recommendations.
Many organizations use synthetic data, particularly for use cases such as collaborating with third parties. Organizations do not want to risk sharing real data, so instead they share synthetic data with their partners and collaborators. Others use synthetic data for advanced analytics, by replacing sensitive real data with synthetic data to speed up time-to-data. In addition, for machine learning projects synthetic data can be used to train applications, as these require large amounts of high quality data that is usually not available in the real world. With synthetic data, small data sets can be extended and expanded to provide more information to train the model.
For example, synthetic data can be applied in the use case of brain imaging for healthcare purposes. Machine learning engineers at Charite created synthetic neuroimaging (artificially generated images of brains that do not belong to real people), as original brain scans are sensitive personal information that could re-identify a patient. To conduct research on brain vessels they built synthetic MRI images.
These are just some ways in which synthetic data can be used for a variety of business use cases. However, in all situations, organizations must consider whether or not their data use case is compliant with the law. Synthetic data, like other privacy enhancing-technologies, can be done well, or it can be done badly.
The ICO notes that synthetic data can be useful for complying with the GDPR and other privacy regulations in a number of ways. First, a large amount of synthetic data can be created from a small real dataset, which aligns with the principle of data minimization (using as little data as possible). They caution that synthetic data should be checked to make sure it does not contain discriminatory biases and whether the data is representative. In addition, organizations should consider whether their use of synthetic data could have “consequences (i.e. legal or health consequences) for individuals.”
Synthetic data is an emerging topic and adoption is still in its infancy. While synthetic data compliance approaches provide significant benefits for organizations and individuals, each case should be carefully evaluated.
Using data in a privacy-respectful way while preserving utility can be difficult. Privacy risks exist no matter what technology is used, and they should always be taken into consideration when working with sensitive data. Even if data is pseudonymized or synthetic, organizations should evaluate their tools for robustness. The use of synthetic data can provide strong protection against re-identification, and can allow organizations to achieve both utility and privacy for data processing. Synthetic data may be new, but it should be considered by organizations as “yet another valid tool in the ever-growing privacy tool belt.”
Contact us and get feedback instantly.