Learning the ropes of data businesses and privacy-preserving machine learning

April 16, 2020

Today’s blog post comes from Statice’s newest hire, Peter. Peter joined Statice with a diverse background in several B2C businesses, and wanted to share some of his insights on getting up to speed in the very specific niches of digital transformation, privacy-preserving data science and machine learning. 


To be honest, years ago I never thought I would be involved in data-related subjects. I'm creative by nature, and always considered data as a very dry and complex topic. A topic that is associated with data scientists, sitting in labs somewhere, working on deep research. But that is not true (most of the time). All businesses are becoming more and more dependent on data, as data is the key to innovation, optimization of business processes, and building business to business partnerships.

Therefore from an entrepreneurial point of view, we all need to start being aware of data and to grow our understanding of this topic. According to Harvard Business Review, 53 percent of C-suite executives don’t treat data as a business asset. 69 percent admit they haven’t created a data-driven organization, so I guess that this topic is not only new to me. In my first month at Statice, I’ve already had to dive deep into data-driven business use cases in finance, insurance, and healthcare, and wanted to share some of the learnings I’ve had so far. 


Why it’s not as easy as just creating data lakes

We produce over 2.5 quintillion bytes of data every day. Most of the data is stored just in case someday an organization will find a use case for that data. This makes sense for businesses that are not data-driven organizations yet. For those organizations that are data-driven, they need to optimize their data lakes as sorting out data might bear huge costs and lots of unnecessary working time. 

If you collect data in your organization you certainly know that the collected data needs to be stored somewhere. It’s extremely important to have the huge volumes of data being managed right, and the infrastructure must be planned in advance. There are two popular ways of storing data - data lakes and data warehouses. Data lake stores data in an unstructured way with no hierarchy or organization among the individual data pieces. The data that lands in a data lake is not processed nor analyzed. Whereas a data warehouse stores only data in a structured manner and therefore before a data warehouse starts storing data, it requires an initial stage of preparation of analyzing data sources and understanding business processes

Therefore it’s important to try to make sense of collected data and label it right from the beginning. Otherwise, you run a danger of creating a data lake that will work against you and not for you. As Adam Wray argues, “companies should view their data in a data supply chain perspective that has a beginning, middle, and end, with an organized plan for how data is brought in, found, explored, and transformed. This approach allows them to maximize the value they extract from their data.” 

Having said that, every data-driven company needs to see data and its management as a long term investment. One of the possible reasons that many C-Level managers don’t see data as an asset is due to the fact that data management is costly, and its output is not always measurable in terms of ROI in the short term. It takes quite a bit of forward-thinking to fully understand this and to build a concrete strategy around it.

What makes data so innovative

Just as a successful innovation process generally starts with a well-defined challenge, successful data innovation is made significantly easier with well structured and labeled data.  Understanding what information is contained in data sets, where the data is sourced from, its sensitivity and permission information and so on, allows for more efficient data discovery and further usage. In turn, this creates a competitive advantage and helps to support agile processes.

While business optimization has always been driven by data,  what has changed is the volume of data that is processed in an effective way, due to the low cost of computing power and the rise of data science tools and processes. These tools, combined with the right knowledge, help to understand data better and unlock its value for product building, business optimization, and monetization. 

The use of data allows problems to be solved in different ways. For example, Visa is investing actively in its machine learning systems that use data to detect suspicious financial activity and prevent fraud. Last year Visa prevented approximately $25 billion in fraud using their machine learning systems. Another example is a smart navigation system that uses crowd-sourced driver data to feed machine learning and applies predictive analytics to optimize routes in real-time, as seen with Waze and other connected mapping solutions.

We have been talking of the immeasurable value and importance of data management - data has effectively become a new oil. As it is often the case, new technology brings not only solutions but also new challenges - as with oil, one does not want data spilled anywhere. Data privacy laws are regulations that help not to “spill” data everywhere though they also put certain limitations in what and how data can be processed.


The impact of data privacy on businesses

Data privacy, GDPR, CCPA, and other related regulatory frameworks have been hot topics over the last few years, as organizations increasingly leverage data to optimize products, processes and create innovative business models. As data collection has grown exponentially, regulations have needed to be adapted accordingly to protect consumer privacy and to ensure the responsible use of data. 

Stricter data regulations rules needed to be established in order to protect end users by stopping the indiscriminate exploitation of personal data. The aim of these regulations was to ensure consumer privacy, and as a byproduct, restore trust in the digital economy. From an organizational perspective, it can be argued that such regulations place a burden in terms of compliance, compared to firms in countries with more relaxed regulations. The effects of data protection regulation are not uniform, since the types of data firms employ, the ways they use it and their relations to end-users vary substantially. Swiss privacy law regulations, for example, do not allow companies to transfer data outside of Switzerland. This can lead to inefficiencies for Swiss companies collaborating with partners in other jurisdictions. 

In general, privacy regulations can prohibit and obstruct the deployment of certain technologies, and can make collaboration challenging. However, such laws are necessary to ensure consumer safety. Further, it is known that regulations can also provide additional incentives for innovation, leading to the creation of new technologies, products and markets, and the discovery of overlooked efficiencies. This leads us to the topic of data anonymization. 

How one can manage data privacy while maintaining data value

So that brings us to my biggest takeaway so far: despite onerous regulation and potential challenges to data usage, it is key to be able to use data safely, and to do that, companies need to guarantee data privacy. Data anonymization is the first thing that comes to the mind when thinking of a method to access sensitive data that contains personal information. Unfortunately, it is not as simple as it sounds. Anonymous data is defined by the GDPR as information which does not relate to an identified or identifiable natural person or to personal data rendered anonymous in such a manner that the data subject is not or no longer identifiable. Rendering data anonymous is thus not a trivial matter. 

There are many approaches to anonymization, including pseudonymization, tokenization, k-anonymity, and generalization, which are too complex to go into detail here (if you’d like to dig into this topic, our CDO Omar did a presentation on it here), which have pros and cons and work differently depending on a company’s data, resources and use-cases. The approach we focus on at Statice is known as privacy-preserving synthetic data: we created a data anonymization engine which produces new, synthetic data records that represent very closely the original statistical characteristics of the sensitive source data. Our engine uses machine learning to learn the structural and statistical characteristics of sensitive data sets, and then uses the learning to generate privacy-preserving new synthetic data, which can be safely used for product building, business optimization, and monetization. So, it is possible to both maintain privacy, while at the same time unlocking the value of data. 

It’s been a hectic first month, but I’ve already had the opportunity to learn a lot and get some great insights into digital transformations going on in multiple industries. I hope my learnings were interesting and valuable for you, and look forward to sharing more in the future! 

If you want to stay up to date with more news and insights on data privacy, the Statice team, and our community, sign up to our newsletter below. If you have a concrete use-case you’d like to discuss with us, you can book a call directly here.

Sign me up!


Wondering if this is for you?

Book A DEMO