Which industries have the strongest need for synthetic data?

By
Elise Devaux
December 11, 2020
-
5
minutes read

Which industries have the strongest need for synthetic data?

Our team recently hosted a webinar presenting the technical landscape of synthetic data. Our Product & Engineering Lead, Daniel Perez, and Senior Data Scientist, Dr. Christoph Wehmeyer, discussed the techniques and challenges of synthetic data generation, along with tips to evaluate the quality of a solution. You will find the on-demand version in our webinar section

During the Q&A session, we received some great questions from the audience, which we present in this post.

  • Are you training the model on-premise to avoid shipping PIIs to the cloud, or do you have an anonymization solution to move the training data to the cloud before you train the model?

  • For structured data, how do you solve the joint issues amongst different relational tables?

  • Which industries have already expressed the strongest need for synthetic data? Are there anonymized examples?

  • I’m a data scientist, and I see value in my organization for synthetic data. To whom should I present this? Should I talk to the CTO or the compliance/legal team?



Are you training the model on-premise to avoid shipping PIIs to the cloud, or do you have an anonymization solution to move the training data to the cloud before you train the model?


This is an important point. From a general perspective, we always train the model where the data is. We know that most of our clients store their data in secured environments. Sometimes that’s cloud, and sometimes it’s on-premise.

So it is an element we had to consider from the get-go while building our software. It needed to be flexible with infrastructures and be able to function on-premise or in the cloud. 


The Statice software ships in the form of an SDK, so you can deploy it on-premise, or in a secured cloud, wherever you want the data to be trained. The data never leaves the client infrastructure, nor it is shared with us for a greater security guarantee.

This does mean that the infrastructure where the data is synthesized needs to have enough compute power for the task, but we spent quite a lot of time making sure that Statice can run in a range of enterprise environments For additional information, you can read this post that presents 9 facts about the Statice software



For structured data, how do you solve the joint issues amongst different relational tables?


This is a good question. We must expect relationships between different tables that you want to preserve when dealing with real-life data. And it’s important to know what kind of relationships we are dealing with, e.g., many-to-many or one-to-many.


How you then deal with it depends on the use case. For example, suppose you have a two-table situation with static customer data and sequential data like transactions. Our API provides a tool to transform these into suitable representations where this connection is maintained.

We use a sequential model for the sequential data, which is conditionally dependent on a static table. The latter is learned by a different, static model which doesn’t need a sequential component.



Which industries have already expressed the strongest need for synthetic data? Are there anonymized examples? 


In today’s world, all customer-facing industries must process personal data. And a growing number of countries are adopting stricter modern data regulation laws. Virtually all industries relying on personal or sensitive data to innovate or function could benefit from synthetic data. 


From our experience, banks, insurance, and healthcare organizations signify a strong need for synthetic data, as the inability to share and process data strongly impedes them from innovating. 


Banks are trying to leverage massive historical datasets but are faced with legal and security concerns. The other industries we see a heavy need from: 

  • Healthcare companies, who have sensitive data and a high need for sharing, 
  • finance and insurances, which usually collect and could process many customer data,
  • and companies that have quite high-frequency data from consumers, like automotive, telecommunication


Below is a table summarizing recurring scenarios. You can also read our post presenting 10 use-cases for synthetic data.

Examples of industries that can leverage synthetic data

I’m a data scientist, and I see value in my organization for synthetic data. To whom should I present this? Should I talk to the CTO or the compliance/legal team?


We see a great variety of profiles and situations with our clients. Sometimes the CTO will be the deciding on such projects, while other times it’s a Compliance team responsibility. From our experience, commonly successful projects are those where you have a concrete problem to solve. And this problem is known to the stakeholder you chose to approach, and affects their business responsibilities and targets. 


For example, you know it’s taking your team three months to get access to a data set and that is slowing your work down. Or you are developing a tool but can’t get access to proper training data. That’s a way of quickly demonstrating where and how synthetic data brings value. For more on the value and ROI of synthetic data, you can read our guide on the subject.


We hope you found these answers as insightful as we found the questions interesting! You can always access the webinar on-demand. Our team provided an extensive overview of the current technical approaches to synthetic data generation and what differentiates them.


Access on-demand webinar

Get the latest content straight in your inbox!

Articles you might like

AI-driven data agility: a case for synthetic data in insurance

Read more

Newsenselab able to make medical data available for research while guaranteeing patients’ anonymity

Read more

Preserving privacy with synthetic financial data

Read more