Together with Roche's data team, we worked to prove the utility and privacy of synthetic clinical trial data to train Machine Learning applications. Using a Harvard Dataverse data sample with over 50 thousand observations on 123 patients, we successfully produced synthetic data, realistic in terms of statistical distribution but without any possible connection to real patients. Synthetic data proved to be a reliable alternative to real patient data for healthcare companies.
“We worked with Statice on a collaboration investigating the power of synthetic data. This area has been rapidly growing in recent years, and we're working to see how to utilize this power in a clinical data setting. We've identified several areas we are interested in, including software and testing, medical data insights, and data sharing”.
Alex Hughes, Data Sharing Consultant at Roche
Research and innovation in healthcare rely on accessing and analyzing large volumes of data. Data fuels the machine learning models that help discover new diseases. It powers personalized medicine and fosters research on drug efficacy. But unfortunately, the road to data-driven healthcare is paved with obstacles.
Organizations have to comply with legal requirements for processing personal medical data. The GDPR, the European Medicines Agency's guidelines for the publication of clinical data, or the Health Insurance Portability and Accountability Act (HIPAA) in the US, regulate the storing and processing of personally identifiable medical data.
Given these challenges, Roche's team needed to find a technological solution that could protect patients' privacy and comply with data processing regulations. More specifically, to comply with the legal requirement for data anonymization, the team needed a guarantee that no patient could be re-identified after data transformation. Additionally, the technology should have ensured the statistical integrity of the data and captured its high-level structure.That is why Roche team decided to assess synthetic data.
Together with Roche, we evaluated the feasibility of producing synthetic medical data according to these requirements using the Harvard Dataverse dataset. This public data sample provides 4 datasets of clinical trial data, containing 33 variables and 55660 observations on 123 patients.
A high utility level was of the utmost importance to Roche when generating new patient data. The correlations and statistical patterns in the synthetic data were well preserved, proving that synthetic data would be just as useful for analysis as the original.
It was Roche's goal to protect patient privacy and adhere to GDPR requirements for anonymous data. Using Statice’s software built-in evaluators, we tested the dataset on the three known re-identification attacks for synthetic data: singling out, linkage attack, and attribute inference attack. The evaluation showed no one-to-one connection between the original and new data, which meant it wasn't possible to tell whether an individual was part of the original dataset or not.
In the pharmaceutical industry, synthetic data has numerous potential applications. By analyzing synthetic data, we can discover diseases and drugs faster, provide a more personalized approach to patient treatment, and improve patient outcomes.In the case of clinical trials, data science teams can use synthetic data as a foundation for studies where they can't operate on real data or such data is too scarce.
And in cases where privacy regulations, legacy infrastructure, and governance processes restrict the availability of data, synthetic data could dramatically increase data agility among researchers.