The validity and impact of synthetic data in healthcare

By
Elise Devaux
November 24, 2021
-
7
minutes read

Last week, our team had the opportunity to attend Medica, an international trade fair around healthcare and life science, taking place in Düsseldorf.

Our CEO, Omar Ali Fdal, moderated and participated in a panel discussion around the validity and future impact of synthetic data in healthcare. Below, we share the highlights from that discussion with healthcare and data experts. You'll find all the links at the bottom of the article.

Key take-aways: 

  • Alexander Pimperl from AstraZeneca presented how data plays a role at each step of the data lifecycle of a drug conception and how synthetic data could represent an alternative to the use of real patient data in many applications, from risk prediction to product development.
  • Tabea Kossen from Charité explained how the sensitive nature of any health data creates barriers for data access that guarantee patient privacy but hinder research. She also presented how synthetic brain images could be used as an alternative to real imaging to fuel the research on brain vessel segments with the proper privacy mechanism in place.
  • Josephine Jacob from InGef showed how restricted access to national claim data in Germany slows down and sometimes prevents healthcare research institutions from collaborating. She presented InGef most recent project on claim data anonymization.

The panelists

Tabea Kossen is Machine Learning Engineer at ai4medicine and a researcher from Charité Berlin. Her work and research focus on overcoming scarce medical data by training Generative Adversarial Networks for stroke imaging. 

Josephine Jacob is the Head of Health Services Research at InGef, Berlin Institute for Applied Health Research, which supports the development of health care in Germany with practical research. 

Alexander Pimperl is Director of the Data Insights & Business Intelligence unit at AstraZeneca, the pharmaceutical and biotechnology company. 

Omar Ali Fdal, is CEO and co-founder of Statice.


AI and the impact of data in the medicine lifecycle


Alexander opened the panel with an overview of the role of data throughout the medicine lifecycle. From the research and development phases to the launch phases, data plays a crucial role in the drug development process.

When launching new medicines, data supports risk-predictions efforts. It also complements digital solutions that enhance patient’s engagement or provide a personalized patient experience with real-time health insights. In general, the need for data is constant over the medicine lifecycle and grows as we develop more personalized and intelligent yet data-hungry systems.

“When we redefine clinic trials, we now see digital tools and data generated from the real world with all kinds of devices. That could make the clinic trials fast and on the same side improve safety and the outcomes of the patient experience.” A.Pimperl


© Alexander Pimperl, presentation at the Medica Health IT Forum conference 2021


As Alexander explained, synthetic data could play a role in these data applications. And at the heart of the potential for synthetic data applications is data privacy and data access challenges. 

Privacy and data access in healthcare  


Regarding privacy, Tabea pointed out that the most important aspect of healthcare patient-organization relationships is respecting and protecting patients' privacy.

“In the medical field, privacy comes down to the risk of re-identification. If a patient is re-identified, then the privacy is not guaranteed. Nobody wants their brain scan to end up on the desk of their employer. And that's something that we don't want to happen. The patients trust us to handle the data with care.” T.Kossen

While some information might uniquely and directly identify a patient, like a brain scan, other less unique information might as well be as sensitive, as our third panelist explained.

Josephine works alongside a 30-people team to evaluate medical interventions and epidemiologic research questions, for which they use notably claims data provided by over statutory German health insurances. 

Re-identification is also a major challenge. Although when using claim data, we are not talking about a uniquely fine structure such as the brain, which might be unique for each person. But claim data contains so much information on one single individual that just due to the scope of the data it might be possible to re-identify your neighbor or your colleague.” J.Jacob

Given enough non-unique attributes, it'll always be possible to re-identify an individual, even in an 8 million individual dataset. Thus, healthcare actors are constantly collecting and processing data with a high degree of sensitivity.

And the concerns around protecting privacy materialize in constraints around data access.

For data protection reasons, the data analysis currently can only be conducted at Ingef, by employees of Ingef because claim data is regulated. And regulatory bodies have to approve the use of data for research purposes which often leads to a time delay or even disapproval by the regulatory body. And consequently, access to claims data is very limited for institutions which do not have a standing cooperation with statutory health insurances, or even for the private industry.” J.Jacob

© Josephine Jacob,  presentation at the Medica Health IT Forum 2021


As Joesphine described, lack of access can directly hinder research and development.

During the current COVID-19 pandemic, there are only a few working groups which could ad-hoc use claim data for relevant research questions during the pandemic, such as background incidences on tremble embolisms after vaccinations. These were research questions that needed to be urgently addressed and for which data were not really available in Germany, or only available through very small groups of researchers.” J.Jacob

By 2022, the new ordinance on data transparency, Datentransparenzverordnung, shall improve the availability of health data for research in Germany.

Anonymizing claim data for research collaboration is one example of how synthetic data could help address health data access issues. Tabea and Alexander shared other exciting examples and concrete cases for organizations in the rest of the discussion.

Possible application of synthetic data  

Following the lifecycle he presented, Alexander illustrated multiple possible uses for synthetic health data. 

In medical developments, this kind of data could help increase the robustness of the machine learning models and the AI models that are used there. It could also help to scale up too small sample sizes. Synthetic data could help facilitate the reusability of clinical studies and it could help address imbalances in data.
You maybe could also use it to augment small sample datasets, for example for rare diseases, or if you have underrepresented patient subgroups in clinical trials. [...] and for the customer experience we coil also have it especially for the fast testing of digital products and to bring them to a minimal viable product stage so you can look at the data structure, data flows that are lying around it.” A.Pimperl

As a concrete example, Tabea presented a real-life application with her research project from Charité around the synthesization of brain imaging data.

The starting point is a similar difficulty: neuroimaging data is sensitive personal information, and we could re-identify a patient from a brain scan, which depicts a unique structure.

To conduct privacy-preserving research on brain vessel segments, Tabea's team needs data that can guarantee privacy. So they built synthetic MRI images.

What we can do is basically synthesize artificial data. And one way to do that is with generative adversarial networks (GANs). T.Kossen

© Tabea Kossen, presentation at the Medica Health IT Forum 2021

Each generative adversarial network consists of two different networks, the generator and discriminator, which are both getting input and outputs. The generator gets a noise vector as input and outputs the synthetic data sample, in our case an image. This is then fed into the discriminator together with real data so that the discriminator also knows what realistic data should look like. The discriminator's job is to give a score to the data on how realistic it looks, information which is then fed back to the generator. It's basically a game between two different neural networks that play against each other. And eventually after the training, the synthetic data should become so good that it looks really realistic.” T. Kossen

Because neural networks are computationally heavy to train, Tabea’s team focuses on 2D patches and segmentation labels. To provide additional privacy guarantees and prevent information from leaking, they add noise during the training: the more noise, the greater the privacy.

2D patches and their corresponding segmentations (bottom line) of the real data (left image, top line) and synthetic data with more noise being progressively introduced.
© Tabea Kossen, presentation at the Medica Health IT Forum 2021

From there, they can use labels generated with the 2D patches to train a unit, another neural network, trained to perform a vessel segmentation task.

We can now train this network on the synthetic data and then evaluate it on real data to have a better estimate on how useful our data is. We can do that with our synthetic data, but also with our real data to have a performance comparison, and again without putting any noise inside the discriminator.”

The results show that it's possible to perform a good brain vessel segmentation task with synthetic data. However, as the emphasis on privacy grows, the task performance diminishes. So there is for researchers a trade-off to find between synthetic data utility and privacy. Yet, it is still very promising to see the possibility to synthesize brain patches.

The full replay is available on our youtube channel and the other talks from the events on this page until December 2021. 

Talk with our experts

Get the latest content straight in your inbox!

Articles you might like

Generating privacy-preserving synthetic data with the Statice software

Read more

How do you generate synthetic data?

Read more

Types of synthetic data and 5 real-life examples

Read more