We continue our series of interviews of the Statice team with Borbála and José, both members of our product team. After discussing enterprise data access and privacy challenges with Beatrice and Ben, we learn about synthetic structured data generation in this post.
Borbála is one of our data scientists. Within the product team, she works on further developing the Statice synthesization engine and evaluation suite. Before joining Statice, she worked on financial risk assessment software and modeling user behavior at an ad-tech company.
José is also one of our data scientists. In the product team, he's responsible for research on generative models. A physicist by training, José's research career changed focus over the years from materials science to statistics, data analysis, and modeling of neuroimaging data.
You can discover the rest of team Statice on our company page.
Jose: I have had the opportunity to work in various aspects related to synthetic data generation as part of the product team at Statice. I started looking into how to evaluate the utility of the data from statistical and machine learning perspectives. Later, I helped with testing the product and developing the support of different data types to be synthesized.
Borbala: Lately, I have been working on advanced features around personally identifiable information (PII) detection. PII can be any information that can be used to identify an individual, e.g., names, company names, addresses, all kinds of ids. Identifying these is crucial in creating *safe* synthetic data as they carry an elevated privacy risk with them..
Apart from that, I am also involved in research on generative models to further improve the synthetic data quality and the synthesization speed. As we previously explained, different models have different pros and cons depending on the data types and use-cases. This is why we decided on a hybrid approach. The Statice software picks the model best suited to the data characteristics.
Borbala: At Statice, I am lucky to be surrounded by very smart people. We have experts from a wide variety of backgrounds, so there is a lot we can learn from each other.
My colleagues are very helpful, and they face each challenge with a great deal of scientific curiosity!
Jose: Working as part of the product team at Statice is very exciting because every day there's a huge chance to learn something new and incredibly interesting.
The whole Statice team is friendly and incredibly intelligent, and knowledgeable. This means that there is always someone willing to help and support the product development, whatever the task at hand. And if this is important in any company, particularly so at Statice where we are always at the frontier of development and need to find new answers to some unsolved problems ourselves.
Jose: When doing data science, I work mainly with the typical data science Python stack.
About current developments, I find some ideas from deep learning very interesting. So I am always keeping an eye on new research in the field of deep generative models, for example. And also at new frameworks, like Jax and the whole ecosystem around it.
Borbala: In my day-to-day work, I use Python and its standard libraries.
I am excited about the current developments in privacy-preserving technologies. For instance, it is nice to see how machine learning with differential privacy is getting popular, and more and more libraries support it.
Borbala: We test our software on a wide range of open-source datasets to ensure that the Statice SDK generally performs well, but real-life datasets are always more interesting.
They all have their peculiarities, and understanding these can help a lot to develop robust software. So the more real-life use-case we are able to work on through our customers, the better we can improve our technology.
Jose: I am going to deviate a bit from the tools. I think that understanding the aspects of the data that are more important for the customer is the crucial aspect of our product development. That domain knowledge allows us to understand better what "real-life data" means for different businesses and helps us improve the generated synthetic data.
Jose: Coming a bit from the last question, the biggest hurdle is that the data out there is "real-life data," and it's talking with customers that we gather the information to understand better what's expected of the generated synthetic data.
One particular point is supporting missing data, for example. That's not easy because the lack of data can also mean something depending on the use case.
So it's with a good understanding of a customer's need that we can adapt the product to new use cases and deliver a complete solution for the customer.
Borbala: The more complex the database at hand, the more challenging it gets to create high-quality synthetic data within an acceptable time interval. Our team is doing intensive research to continuously improve our models' performances.
Borbala: Go for it! It’s a nice opportunity. I’d definitely recommend applying if you want to work with interesting people on exciting topics. :)
Contact us and get feedback instantly.