Statice's synthetic data technology is now part of Anonos Data Embassy, the award-winning data security and privacy solution.
Last month, our CEO Omar had the chance to discuss with Jonas Piela, host of the Digital Insurance Podcast. Here is what Jonas and Omar discussed:
You can listen to the recording on the platforms above or check the episode highlights in this post.
Let's talk about what privacy is.
What's privacy in your words?
Privacy can be thought about from multiple angles, and usually it's not a very easy concept to define. There have been many definitions throughout history in different legal cases.
While this historically meant that we were referring to protecting one's homes from intruders, for example, with information technology, things have obviously changed. It's now not necessarily about physical privacy that we're talking about today.
It's rather about information privacy or sometimes communication privacy, and while regulations have done quite a good job at protecting physical privacy, laws for protecting digital information privacy are rather new.
So when you talk about data privacy, you brought the example of someone invading your home.
How does this translate to data privacy?
That's a good question. When you upload your photos on Facebook, for example, you do it for a purpose. You want to share it with your friends; you want to collaborate on a shared album. This is a contract that you have between you and, say, Facebook.
Now, this is fine, and we do this all the time. I share very sensitive information with my doctor because I want them to provide service to me. Same with my tax advisor. We share sensitive data all the time, but we trust the parties with whom we share that sensitive data to use it to fulfill a certain contract that we have together. And this is the trust relationship that could be easily broken.
What does breaking this trust mean?
If I went to my doctor, I told him about whatever illness or problem I have, and then he forwarded this information to someone who's trying to sell me medication for whatever issue ahead. This would be some breach of trust, right?
Exactly so, and if I'm in a data world, and that Facebook was selling my pictures to a stock photo image company, or if Facebook analyzes those pictures, looking at what's in them in order to target you with very specific ads, that would be a use that you didn't really think of in the 1st place, and that definitely was not part of the relationship you had with the service provider.
So there's privacy, and in this privacy relationship, I trust someone with my data for a certain reason, a certain purpose, to fulfill a certain specific task. And repurposing my data is sort of where the red line of privacy is.
Let's talk about this data. It could be phone numbers, names, dates of birth. What else is there that could be part of this data?
In insurance, for customer datasets some personal data could be names, phone numbers, visits to the doctor, medicines taken after if we were talking about health insurance. If we're talking about car insurance, it would be the brand of a car, its size, if it's made for children, Is it a utility car, how many cars does someone own.
So you can imagine how much information you could collect about a single person.
What could I derive from this information?
Why is it sensible?
If I see, for instance, in the data that you applied for life insurance and it has been denied. This could inform that you may have a chronic disease or something that you don't really want to share. So it could be a very innocuous line in this data file: this person has been denied this application, but behind it, the reasons for that denial could be something extremely sensitive and something that you don't want to expose.
This can harm people. When a leakage happens, and once your data gets in the wild, it is very difficult to undo once it's there, it's extremely difficult to revert back. There are secondary markets for this sort of data, and sometimes it leads to serious consequences on specific people.
There's the famous case of Netflix where movie ratings have been used in order to identify people from a seemingly anonymous data set. What happens is that some people found that their movie-watching habits were out there in the wild.
So how can we still do AI and data analysis without data? Do privacy rules create some limitation of access to data that hinders innovation?
There is a part of the truth to that. I think this is due to two things in particular. The first one is that after the GDPR came into effect in Europe, organizations, in general, had to react quite quickly because most of them were not well prepared. What happened is that most innovation projects or efforts that were based on data almost went to a complete stop. This was due to introducing a series of processes and controls.
These processes mean that in a project, instead of getting access to your data in days, you now have to go through a process that would take up to several weeks, several months, or sometimes even more.
We had this conversation with several customers where for every single use case that requires data, there's a conversation about what this use case is going to use in terms of data, exactly what is the scope, and what is the value that we expect to bring from this use case. And sometimes, all of those questions cannot be answered without actually having the data in the 1st place without knowing whether this use case is worth pursuing or not. So in this sense, definitely the reaction to the privacy regulations has introduced some friction to innovation.
Does this also create innovation?
Does it enable people to be more creative when it comes to accessing data?
This is seeing the opportunity in the challenge, and it's definitely something at Statice that we went through. Without personal data regulations, we would not be talking today, and we would not have started thinking about this problem and actually solving this problem.
One of our very first projects was working on a hackathon with a very small startup that had data but wanted to get external data scientists to help on their own problem, but their data was extremely sensitive, so they couldn't obviously do that. We were looking for ways to decouple the learnings and the patterns and so the interesting insights from the sensitive and identifying information and the personal aspects. We had at the time several approaches we were thinking about in different ways.
And what we were getting at is that in general, what we're interested in when we analyze the data is very rarely the specifics of the people. If I'm looking at your data as part of a data set, what I'm interested in is what you're doing at a specific moment at a specific place. What I'm interested in is what are the insights in this data on an aggregate basis. This is really the mindset in general when it comes to data analysis.
That was the first realization, the fact that there is no need to access specific personal information in order to derive learnings, including advanced analytics, including developing machine learning models. This is why what we say at Statice, you do not need the real data in order to develop some of those analysis cases.
What is it specifically that you guys are doing? We talked about the fact that the problem with data and privacy is that there's often too much information in this data that you cannot share with other people.
Synthetic data is completely artificially newly created. Data that is definitely based on the learnings from the real data. So there is the software and an algorithm that analyzes the original data and gains the learnings from the relationships between the columns or attributes or different entities of your data.
Once you have that, you can generate artificial people. So these are people that do not exist in the real world. These are people that are not your customers but, taken as a whole, they behave similarly to how your customers would behave. So at this step, we lose the individual specificities of your real customers and replace them with artificially created customers that preserve these patterns but don't leak the individual secrets. This is the whole point.
So if we imagine I was in an IT department, that I had all this very sensitive data in my database, and I get tons of requests from innovation departments throughout the entire company asking me. "Can I have some data, please?" In the past, I'd always said, "I can't give you that because of data regulation laws."
But now, with the synthesization of this data done probably through your software, I could have a sort of experiment set of data in my drawer, and whenever someone comes around and wants some data, I'm like, "Oh yeah, sure, take this and go ahead, do whatever you want with this, because this is non-sensitive data yet behaves statistically just like our real data."
This is exactly the metaphor. You'd have a synthetic data generator drawer, and you can on-demand generate from the different records for your internal stakeholders or sometimes external cases. However, you'd still need to assess its impact from a data protection point of view and this is also something that we develop, to ensure that re-identification risks are properly mitigated.
Read the case studies for synthetic data in insurance from Mobiliar and Provinzial.
Contact us and get feedback instantly.