Our team recently hosted a webinar presenting the technical landscape of synthetic data. Our Product & Engineering Lead, Daniel Perez, and Senior Data Scientist, Dr. Christoph Wehmeyer, discussed the techniques and challenges of synthetic data generation, along with tips to evaluate the quality of a solution. You will find the on-demand version in our webinar section.
During the Q&A session, we received some great questions from the audience, which we present in this post.
This is an important point. From a general perspective, we always train the model where the data is. We know that most of our clients store their data in secured environments. Sometimes that’s cloud, and sometimes it’s on-premise.
So it is an element we had to consider from the get-go while building our software. It needed to be flexible with infrastructures and be able to function on-premise or in the cloud.
The Statice software ships in the form of an SDK, so you can deploy it on-premise, or in a secured cloud, wherever you want the data to be trained. The data never leaves the client infrastructure, nor it is shared with us for a greater security guarantee.
This does mean that the infrastructure where the data is synthesized needs to have enough compute power for the task, but we spent quite a lot of time making sure that Statice can run in a range of enterprise environments For additional information, you can read this post that presents 9 facts about the Statice software.
This is a good question. We must expect relationships between different tables that you want to preserve when dealing with real-life data. And it’s important to know what kind of relationships we are dealing with, e.g., many-to-many or one-to-many.
How you then deal with it depends on the use case. For example, suppose you have a two-table situation with static customer data and sequential data like transactions. Our API provides a tool to transform these into suitable representations where this connection is maintained.
We use a sequential model for the sequential data, which is conditionally dependent on a static table. The latter is learned by a different, static model which doesn’t need a sequential component.
In today’s world, all customer-facing industries must process personal data. And a growing number of countries are adopting stricter modern data regulation laws. Virtually all industries relying on personal or sensitive data to innovate or function could benefit from synthetic data.
From our experience, banks, insurance, and healthcare organizations signify a strong need for synthetic data, as the inability to share and process data strongly impedes them from innovating.
Banks are trying to leverage massive historical datasets but are faced with legal and security concerns. The other industries we see a heavy need from:
Below is a table summarizing recurring scenarios. You can also read our post presenting 10 use-cases for synthetic data.
We see a great variety of profiles and situations with our clients. Sometimes the CTO will be the deciding on such projects, while other times it’s a Compliance team responsibility. From our experience, commonly successful projects are those where you have a concrete problem to solve. And this problem is known to the stakeholder you chose to approach, and affects their business responsibilities and targets.
For example, you know it’s taking your team three months to get access to a data set and that is slowing your work down. Or you are developing a tool but can’t get access to proper training data. That’s a way of quickly demonstrating where and how synthetic data brings value. For more on the value and ROI of synthetic data, you can read our guide on the subject.
We hope you found these answers as insightful as we found the questions interesting! You can always access the webinar on-demand. Our team provided an extensive overview of the current technical approaches to synthetic data generation and what differentiates them.
Contact us and get feedback instantly.