Synthetic data for predictive analytics in insurance: The case of Provinzial

predictive analytics insurance
By
Evgeniya Panova

Synthetic data for predictive analytics in insurance: The case of Provinzial

Data is one of the most valuable assets for insurers, and predictive analytics help make the most of it. Over two-thirds of insurers state that predictive analytics helped them reduce underwriting expenses (costs incurred when determining whether to accept or decline a risk), and 60% say the data helps them increase sales and profitability. 

How insurers use predictive analytics. Source: www.wtwco.com
How insurers use predictive analytics. Source: www.wtwco.com

However, you can only achieve the best use of your data with excellent data management. Customer data in the insurance industry is sensitive and cannot be freely shared between departments or external partners.

Through testing Statice's synthetic data solution, the data science team at Provinzial, the second largest public insurance group in Germany, aimed to revamp how they put their customer data to work. 

Due to the challenges of sensitive data usage, sharing, compliance, and the need to work with it faster in a competitive market, Provinzial sought out advanced data anonymization solutions for insurance data analytics.

Provinzial used synthetic data for a predictive analytics recommendation engine to identify the needs of over a million customers and predict what services and products they will buy next. 

In this article, you will learn how Provinzial data science team:

  • Streamlined the data usage approval process with the privacy team
  • Achieved over 80% usability of synthetic data while maintaining anonymity 
  • Trained machine learning model on synthetic data and achieved 97% performance effectiveness 
  • Reduced the time-to-data by 4 weeks without having to adjust the internal sharing workflow
  • Saved up to 3 months in evaluating data privacy risks.

"Statice's solution helped us conduct predictive analytics and test our hypotheses while keeping customer data secure. We have found it to be a useful solution for our data science team to simplify data access and focus on our data projects, machine learning model optimizations, and testing new ideas."
Dr. Sören Erdweg, Artificial Intelligence & Data Development at Provinzial

Synthetic data as a new perspective on internal privacy management

Easily accessible, high quality data is crucial for identifying, improving, and optimizing predictive analytics projects. However, when it comes to insurance data analytics, teams cannot foresee each possible data application at the very beginning of the project. What they need is to be able to freely explore the dataset first. Internal and external privacy restrictions pose strict limitations on data usage. 

Data silos limit the ability to make informed business decisions and contribute to the lack of a holistic view of data across the company. Loss of insight comes at a high cost. Gartner research has found that poor data quality costs businesses $15 million per year. Moreover, many ideas never make it to the point of becoming a real data project. 

As a data-driven company, Provinzial wished to make the most of its insurance data analytics to boost growth and increase customer satisfaction.

Throughout Provinzial's data projects, their data science and AI team encountered data access challenges. To begin using data, teams must first go through an evaluation of data privacy. A back and forth process of fleshing out all possible uses, determining what sensitive data can be accessed and why, and evaluating the potential risk of leakage, may require several weeks. 

“At the beginning of the project, it is extremely difficult to have strong arguments for and against different variations (of data use). We do not know which data we intend to use at the outset. Our goal is to use as much data as possible, which is, of course, contrary to data privacy rules, and that is also one reason for our desire to make this process more streamlined, to have a clear understanding of what we require for our model.” 
Dr. Sören Erdweg, Artificial Intelligence & Data Development at Provinzial

They used data regulation challenges to get started on synthetic data - to build a dataset for testing, evaluating, and quantifying ideas. Statice's solution helped to manage internal data privacy evaluation measures - privacy-preserving synthetic data doesn’t fall under the GDPR.

Finding the right balance between privacy and utility

“One of the most important requirements was that the method should provide greater utility than other methods without compromising privacy. Either we can have usable data that is not private or we can have private data that is not usable. There does not seem to be a middle ground between these two extremes, which is why the classical solutions were out of the question”. 
Dr. Sören Erdweg, Artificial Intelligence & Data Development at Provinzial

Provinzial’s data science team presented synthetic data to the data privacy team as a solution for current access and evaluation challenges after examining other protection methods. 

Since their goal was to train a machine learning model for predictive analytics, they needed to be able to work with both high quantity and high-quality data. De-identification methods like masking or k-anonymity can increase privacy, but at the expense of utility.

Because Provinzial's customer data was highly detailed and extremely sensitive, they needed an anonymization solution that would not compromise the usefulness of this data. The conventional means of anonymization were not appropriate for this purpose, and data masking or pseudonymization posed a greater risk of re-identification.

Synthetic data, on the other hand, turned out to be a great fit as this method maintains the statistical value of original data, thus increasing the utility.

The Statice solution ships with a series of ready-to-use utility evaluations. The Utility Evaluator wraps multiple evaluations and provides a high-level view on the utility of our synthetic dataset without disclosing any of the statistical properties. 

Ensuring GDPR compliance

Provinzial's data team had to provide a robust privacy-preserving solution to meet the GDPR requirements as well as the company's internal privacy regulations to obtain approval.

Similarly, synthetic data provided high value in that case. The process of generating synthetic data completely breaks 1-1 relationships between original and synthetic records, minimizing the risk of re-identification.

The Statice solution adds additional privacy protection layers to the synthesization mechanisms, such as Differential Privacy. To generate differentially-private synthetic records, the models used to create synthetic data learn the original data distribution with a DP algorithm. This way, the synthetic data benefits from the mathematical guarantees that DP provides.

These layers of privacy protection reduce the risks of re-identification to a threshold considered "reasonable" by the GDPR. 

Reducing time-to-data by integrating with the internal workflow

Data access and sharing come with several risks for individuals and organizations. These include the risks of confidentiality and privacy breaches and the violation of other legitimate private interests, such as commercial interests. That's why corporate data sharing and usage systems are often complex to navigate.

For the Provinzial data team, it was essential to be able to reduce time-to-data without having to change the internal system. The solution had to go along with the existing workflow of the data, so that it would not disrupt or require any changes, as those would have been even more difficult to implement.

The team could also build a data architecture using anonymized synthetic data and perform specific tests on it without needing original data anymore, accelerating time-to-data by 4 weeks. 

“The on-premise software was one of the reasons we selected Statice. It was imperative to us to keep all data in-house. Furthermore, the Python package was easily integrated into our data preparation and pipelines, which was a pertinent technical requirement for us.” 
Dr. Sören Erdweg, Artificial Intelligence & Data Development at Provinzial

Statice's software architecture

The combination of utility and privacy evaluators and an on-premise integration of Statice software allowed Provinzial to save time on data quality assurance and compliance approval without disrupting internal data workflows. 

Results: 'Next best offer' model trained on synthetic insurance data

Provinzial's test scenario was to use an existing “next best offer system” model, train it on synthetic data and compare the result to the model trained on real data. “Next best offer” is a form of personalized marketing based on predictive analytics. The next best offer model predicts consumers' needs and shows them offers and products based on their habits. 

The "next best offer model" had the advantage of already containing a large dataset with 380 features and about one million customers, including customer type, address, product usage, interactions, as well as insurance claims and payment history.

The team got a firsthand look at training machine learning models on synthetic data and experienced how easy it is to generate the dataset as well as how training affects the performance of the model. By testing all these relevant steps, they could then use them in a production environment.

They were able to:

  • test synthetic data integration into their internal data workflow
  • deploy the tool on-premise
  • train machine learning model on synthetic data
  • get a high-quality synthetic version of their customer data for predictive analytics

Provinzial team performed a three-fold evaluation, focusing on data usability, model usage, and privacy regulations

Result 1: Privacy

The synthetic dataset was proven to be anonymous. As a result of the privacy evaluation performed on the SD, re-identification was highly unlikely, ensuring data privacy. Although there were many variables in the dataset, the large volume acted as an additional shield, minimizing the risk of re-identification.

With this ready-to-use privacy evaluation in the Statice software, the Provinzial team saved about three months' worth of time.

The results of Provinzial & Statice POC.
Provinzial's & Statice's synthetic data project results

Result 2: Utility

By comparing the two datasets (original and synthetic), the Provinzial data team found that over 80% of the synthetic data was similar to the original data & usable for the use case purposes. Utilizing Statice utility evaluations, the team was able to quickly assess the usefulness of the synthetic data and adjust it as needed, saving about a month of manual work.

An example of Statice's utility evaluation
An example of Statice's utility evaluation.

Result 3: ML model performance

Their second evaluation phase focused on the ML model usage - training synthetic data versus real data where synthetic data reached 97% in performance compared to training on the original dataset. The synthetic dataset passed all tests, performed well and could be used to train the ML model. 

Additionally, synthetic data has proven useful not just for the use case they tested it for, but also for other applications, slightly different models, and use cases with minimal adaptations.

Ultimately, Provinzial concluded that synthetic data could in fact be used in place of the original data to train their recommender engine 'next best offer' models. With synthetic data, getting approvals for the project would no longer be a problem, and the project could even go ahead in full force. 

Synthetic data benefits

  • Synthetic data would have many benefits for team members and departments that work with data daily. For instance, when sharing user data with external companies to perform analysis. These types of projects would not be possible otherwise.
  • Furthermore, it gives the company the flexibility to use synthetic datasets for different projects within the company, which is heavily regulated. With synthetic data, internal projects can be enhanced in a way that was not possible before.
  • Data scientists and ML specialists would find synthetic data useful in daily operations, simplifying data access, and allowing them to focus on model optimization, be creative, and test new ideas and hypotheses.

Predictive analytics help insurers gain actionable insights into every aspect of their business, look into the future, and increase customer satisfaction. In order to deploy predictive analytics in a compliant and privacy-preserving manner, organizations will need to utilize data anonymization methods. For Provinzial, synthetic data proved to be the ideal solution. After all, data is only a strategic asset when you can put it to work.

Read the full Provinzial case study

Get the latest content straight in your inbox!

Get the latest content straight in your inbox!

Articles you might like

Generate high-fidelity synthetic data with the Statice SDK

Read more

9 facts about Statice's data anonymization software

Read more

"How Statice works" under 5 minutes

Read more