How to make the most out of time-series data with synthetic data

By
Beatrice Milik
November 23, 2020
-
4
minutes read

A definition of time-series data and why we use it 


Have you ever wondered what self-driving Teslas, autonomous Wall Street trading algorithms, or smart homes have in common?


For one thing, they are examples of how our world is changing at astonishing speed due to our ability to capture and analyze more data faster than ever before.


And, if we are to take a closer look, we’ll notice that each of the above technologies requires a special kind of data: 

  • Self-driving cars continuously collect data about how their local environment and surroundings are changing. 
  • Autonomous trading algorithms continuously collect data on how the markets are varying.
  • Smart homes continuously monitor activity inside and outside of them to regulate temperature, identify intruders, and respond to our beck-and-call (“Alexa, play some workout music!”).

That special kind of data is time-series data. In a simple definition, time-series data is data points collected and ordered in time. Think of a sequence of events where one happens after the other, and if you change their order, you alter the meaning of the data. In time-series, history influences future development. If the Wall Street stock market crashes, it certainly is going to affect the future!

Example of time-series data: Chart comparing the Coronavirus crash in 2020 and the Wall Street Crash of Dow Jones Industrial Average


Though not entirely evident, time-series data is practically everywhere: look into your bank account at the sequence of transactions for 2020. The data you have there is time-series data and is full of patterns relating to your life. In this particular example, these patterns carry information on your spending habits. Measures in healthcare or physical sciences are often time-series data: a series of blood pressure measurements over a month or daily pollution level measures.

Time-series data influences all industries and represents the basis for accurate forecasting and decision making. Whether it is finance, insurance, healthcare, mobility, marketing: you find time-series wherever you can ask the question, "how will a change made now  influence my system later?". 



Why aren’t businesses making the most of time-series data?


Today, a business's success does not lie in the amount of collected data but rather in how it uses this data to innovate and outperform the competition. However, the costs and implications of a privacy breach most often outweigh innovation. As a result, organizations such as banks and hospitals store vast amounts of data in silos that they don’t access or leverage due to data privacy concerns. 


Time-series data falls into the category of data that businesses need to protect. This type of data can carry a lot of information about individuals’ behaviors. The example of bank transactions shows that it can contain a lot of information unique to a person and quickly lead to their identification

In general, while a single data point should not be enough to identify an individual, a sequence of time-series data points can highlight patterns that would expose a specific person's identity. Therefore, time-series data is subject to privacy protection measures that prevent teams from accessing and processing it.


At the same time, businesses need such data about their customers to offer better services, improve decision-making at an organizational level and increase revenue opportunities. But it should not come at the expense of data protection and security. There are ways of analyzing sensitive time-series data and drawing valuable insights while protecting customers' privacy and complying with current and future data protection regulations. 



How can businesses overcome these challenges?


Businesses can protect the privacy of their customers by applying data protection methods.

They can, for instance, use pseudonymized or anonymized data. Pseudonymization refers to the process of removing Personally Identifiable Information (PII) such as financial or social information from the data. Anonymization refers to irreversibly transforming sensitive data to prevent the identification of individuals.

anonymization vs pseudonimization

While anonymization irreversibly transforms data to prevent the re-identification of individuals, pseudonimization only alters direct, sometimes indirect, identifiers.

Modern data privacy laws like the General Data Protection Regulation (GDPR) or the California Consumer Privacy Act (CCPA) differentiate between pseudonymized and anonymized data. The GDPR also states that anonymous data isn’t subject to personal data protection laws anymore.

From a legal perspective, this distinction is crucial. Organizations can only process pseudonymized data under the conditions stipulated by data protection laws. On the other hand, the processing of anonymized data is not as heavily regulated.

So, if they resort to the anonymization of time-series data, companies can be compliant with data protection regulations at all times and efficiently leverage data for various use cases.

A popular anonymization technology available now is differentially-private synthetic data generation. Synthetic data refers to the generation of artificial data with similar statistical properties to the original sensitive data. The synthetic data generation method can satisfy the definition of differential privacy, a mathematical proof that ensures the anonymity of the data. In this case, it is labelled differentially-private. 

Differentially-private synthetic data represents a resource with a high-value for data analysis and enough privacy guarantees to comply with regulations’ definition of anonymized data. It allows organizations to maximize both data utility and privacy and offers many benefits for customers-centric industries. Thus, synthetic time-series data generation provides an alternative to analyzing time-series data that would otherwise be too sensitive.


The value of differentially-private synthetic time-series data generation


Businesses draw a lot of benefits from analyzing time-series data. If it were not for time-series data analysis, electric energy suppliers, for example, could not forecast prices in the electricity markets based on peaks of electricity consumption and electricity production. While this is a relatively simple example, it illustrates how valuable time-series data is. High-value decisions are made by analyzing time-series data.


Multiple real-life applications in various industries are producing time-series data. To take advantage of it without compromising privacy, businesses can generate and use differentially-private synthetic time-series data.


At Statice, we enable companies in the healthcare, automotive, insurance, and finance sectors to make the most out of their data while remaining compliant. To do so, we provide a synthetic data engine that builds on differentially-private deep learning models to generate privacy-preserving synthetic data. The models can learn the statistical properties of original datasets and generate new artificial data points with similar statistical utility. Privacy mechanisms guarantee the full anonymity and privacy-compliance of the synthetic data.

Generating privacy-preserving synthetic data that mimics time-series data's statistical properties but doesn't contain any of its personal information.


Time-series synthetic data generation is, however, a lot more complicated than generating synthetic tabular data. The main difference lies in the fact that synthetic tabular data considers that one individual's information is stored in a single row in the data. In contrast time-series data contains time-sensitive information spread across several columns and rows.

The length of the time-series data also influences the difficulty of this problem: the more history there is, the harder it is to learn the original data's characteristics and replicate them in the generated synthetic data. In time-series data, the deep learning model does not have to generate a single data point but rather a single data point dependent on some other 500 previous and future data points.

The ability to generate and use synthetic time-series data, however, opens a series of doors for enterprises worldwide: unlocking collaboration at an organizational or industry level, safely and efficiently sharing data, complying with data privacy regulations, facilitating innovation by unlocking new applications such as churn modelling in insurance, identifying money-laundering patterns in finance or detecting cancer in healthcare.


Get the white paper

Get the latest content straight in your inbox!

Articles you might like

Generating privacy-preserving synthetic data with the Statice software

Read more

How do you generate synthetic data?

Read more

Types of synthetic data and 5 real-life examples

Read more