What the new European data protection regulation has in store for the future of AI and data-driven innovation.
In 1947 the transistor was invented by John Bardeen, Walter Brattain, and William Shockley. Many scientists and partners contributed along the way, but in the end it was three men who really brought the transistor to life. Three men that brought vastly different talents and strengths to the table. John Bardeen was the thinker, a man who could look at an event no one else comprehended and go beyond common understanding to explain it. Walter Brattain was the tinkerer, a builder who could put together any contraption asked. William Shockley was the visionary, a seer who predicted how important the transistor would be long before anyone else.
Their unique skills brought together in one laboratory enabled the perfect ground for a truly world-changing invention.
While this invention is a perfect example of many different innovation paradigms, it is the fact that the collaborative expertise of all three men combined made the invention of the transistor possible.
If we fast forward to today, this holds more than ever. Nowadays, the knowledge possessed by an individual is rarely enough to spark groundbreaking inventions.
In our growing customer-centric economy, innovation begins with understanding people. Companies partly do this by measuring, tracking, and storing personal data on individuals in order to quantify personal preferences and use this knowledge to tailor experiences and products towards each customer individually. Personal data is the core source of modern services and products and serves as the most important resource for the majority of modern technological advances and discoveries. This does not only hold for scientific settings but also for corporate R&D.
So far companies have been heavily relying on data they collected themselves. Now, with our behaviour becoming more complex and digital services more diverse, companies are increasingly leveraging multi-faceted data from various outside sources in order to understand their consumers.
This also holds for the supply side of data. According to a study by AIG, 75% of companies globally would share data if this provides them with “benefits” such as providing better customer products and services
A recent example of this is an agreement between the city of Washington DC and Uber. Both entities collaborate on pooling ride data from numerous sources to improve the cities infrastructure, establish a curb data standar, and finally enable a better data-driven services. In the future we expect to see more and more companies engaging in sharing and leveraging multiple sources of relevant data for their product innovations.
At least in theory.
Since the 25th of May, the GDPR has changed this notion. The General Data Protection Regulation was put into action with the main goal of giving consumers back the control over their data. This has been long overdue, with recent cases in the media, such as Facebook, showing the immense dangers of companies not taking user privacy seriously enough.
On the other side, the GDPR simultaneously makes it increasingly difficult for companies to leverage personal customer data for secondary data use cases. Secondary data use cases include all cases where data is leveraged for cases outside the initial purpose the data was collected for in the first place. This often includes testing and developing new products driven by data such as AI- and ML-applications.
One of the most important articles in the GDPR is article 6. It governs the legal reasons for compliant data processing. Next to processing data for compliance with a legal obligation, performance of a contract, the vital interest of a data subject (consumer), or a task carried out in the public interest, the reasons for collecting and processing data can be split into two big groups. On the one side, companies are allowed to collect customer data if they obtain explicit and informed consent from their customers for doing so. On the other side, companies may process customer data if it is necessary for the purposes of the legitimate interests pursued by the controller (the company).
The big problem is that the concept “legitimate interest” is very broadly defined. Without clear indications by courts, it is currently up for interpretation what constitutes a legal legitimate interest of companies. This means that currently, the safest way for companies to use data for product innovations is to obtain consent from data subjects to do so.
Next to obvious problems of free tracking and sharing of data of high-profile media cases such as Facebook, there has been a general resentment towards the quantification of our behavior.
Especially the thirst for data of advertisement applications has lead general frustration with omnipresent data tracking.
This resulted in the current negative stigma of other data-driven use cases such as for example AI. Thus, with a current lack of education around the importance of data in building personalized products, it will be highly unlikely that the majority of customers will consent to the use of their data for product tests and innovation — especially if this data will be shared with innovation partners by companies.
Another big change comes with articles 13 to 17 of the GDPR. They require companies to provide a whole new degree of transparency when it comes to using personal data for example for machine learning models and AI products.
In a nutshell, article 13 through 17 give the data subject, in this case the customer, complete transparency on where his or her data is stored, used, and the right to request his or her data to be permanently deleted.
Article 15 is worth some extra attention. Article 15 of the GDPR mandates that personalization due to automated data processes requires explanation by the data controller. This does not necessarily mean that companies need to fear opening up their data-driven algorithms to external parties but they need to be able to explain the basic functionalities of their algorithms to a data subject. This might also hold for test purposes and internal product development purposes before such products are rolled out and commercialized. The ability for consumer to opt-out, require more granular data model management, in order to easily replace and delete personal data and easily retrain data-driven models on the updates dataset.
What is more difficult though, is the apparent perception of lawmakers of how Machine Learning algorithms function. Algorithmic transparency means that a company can see how the decision is reached. With Machine Learning algorithms not being rule-based software, this becomes increasingly problematic. Ones again, the first court cases need to show, how detailed companies will be required to describe and show the use of data for Machine Learning.
With a focus on leveraging data for innovation and product development, the GDPR in its current state results in two major problems:
This means that leveraging customer data for data-driven innovation is becoming increasingly difficult. Often companies used to gather data with the idea to use it in future projects or developments without a clear idea of how these will look like. The GDPR brings a lot of uncertainty around collecting the necessary data for this in the first place.
Since exploration of data and subsequently uncovering trends and insights in data lies at the heart of Data Science, this all raises two big question: How can we support the use of data for product innovation? How can we enable data-driven collaboration while protecting consumers?
The answer to both is data anonymization — because truly anonymous data is not subject to data privacy regulations.
The GDPR clearly states that the use of truly anonymous data is exempt from the regulation.
Truly anonymous data is any data processed in a way that the privacy of individuals is preserved, in the sense that they do not incur any harm from the data being released.
Therefore, the use of anonymous data renders the aforementioned problems unimportant and furthermore allows for companies to freely share data between and across each other — all while protecting their customers, as anonymous data per definition means that the re-identification of an individual data subject is impossible.
And here lies the problem. Truly anonymizing data is difficult. Anonymizing data in a privacy-preserving manner takes time, resources, and significant domain expertise. Additionally, even if the generation of truly anonymous data has been successful, this often equals a significant loss in data utility, thus rendering an anonymous dataset useless.
This is why we built Statice. Statice makes anonymizing data easy while maintaining data utility and data granularity. By leveraging the recent advances in machine learning and state-of-the-art privacy techniques, Statice enables companies to release highly granular datasets with no risk of identifying a single individual.
We empower companies to open up their new synthetic data in a GDPR-compliant manner for product development, training new machine learning algorithms, and unlocking industry-wide insights — internally, or collaboratively with partners.
Get in touch with us to learn how privacy-preserving synthetic data can be of help to you and the data-driven future of your business.
Contact us and get feedback instantly.