Differential privacy has become a cornerstone component in privacy-preserving data analysis. It provides computer scientists and data scientists a way to prevent individual records from being identified by adding noise to data in a controlled way while still allowing for the extraction of valuable insights from the data.
In this blog post, we explore the origins and developments of differential privacy and provide a high-level explanation of its primary mechanisms. We also discuss differentially private machine learning and synthetic data and how enterprises can use them.
Differential privacy is a mathematical framework for ensuring the privacy of individuals in datasets. It can provide a strong guarantee of privacy by allowing data to be analyzed without revealing sensitive information about any individual in the dataset.
The concept was first introduced in 2006 by Cynthia Dwork and Frank McSherry, et al. in two papers titled "Calibrating Noise to Sensitivity in Private Data Analysis” and “Differential Privacy”. In these papers, Dwork and McSherry proposed a mathematical framework for formally defining and achieving privacy in data analysis, which they called "differential privacy."
According to their definition, the presence or absence of any individual record in the dataset should not significantly affect the outcome of the mechanism.
We call "mechanism" any computation that can be performed on the data. Differential privacy deals with randomized mechanisms, which are analyses whose output changes probabilistically for a given input.
Thus, a mechanism is considered differentially private if the probability of any outcome occurring is nearly the same for any two datasets that differ in only one record.
This definition of differential privacy has since become the standard for measuring privacy in data analysis. Researchers and practitioners have widely adopted it in the field, and enterprises are starting to, as we’ll see below.
One of the key features of differential privacy is that it provides a privacy guarantee that holds regardless of what an adversary knows or does when performing attacks on the data. In this context, an adversary is a person or entity that is trying to learn sensitive information about individuals from the output of a data analysis.
The privacy guarantee holds even if the adversary has unlimited computing power and complete knowledge of the algorithm and system used to collect and analyze the data. So even if the adversary were to develop new and sophisticated methods for trying to learn sensitive information from the data, or if new additional information becomes available, differential privacy would still provide the exact same privacy guarantee, making it future-proof.
Differential privacy is a flexible concept that can be applied to various statistical analysis tasks, including those that may not yet have been invented. As new statistical analysis methods are developed, differential privacy can be applied to them to provide strong privacy guarantees.
In the next section, we look at how differential privacy achieves this through the addition of noise that ensures that any individual's information is not disclosed while still allowing for the extraction of insights from the data.
Several mechanisms are commonly used in differential privacy to ensure the privacy of individuals in datasets.
One of the most commonly used mechanisms to answer numerical questions is the addition of calibrated noise: adding enough noise to the output to mask the contribution of any possible individual in the data while still preserving the overall accuracy of the analysis. One concrete example of adding noise for differential privacy is the Laplace mechanism.
In this mechanism, noise is added to the output of a function. The amount of noise depends on the sensitivity of the function and is drawn from a Laplace distribution.
The sensitivity of a function reflects the amount the output can vary when the input changes. More accurately, it is the maximum change that can occur in the output if a single person is added to or removed from any possible input dataset.
The concept of sensitivity is important because it helps to determine the amount of noise that needs to be added to the function's output to protect individuals' privacy in any possible input dataset of that function. The larger the sensitivity is, the more noise must be added.
For example, suppose we have this database containing people with a particular medical condition. We want to release the number of people in a city with that condition while preserving their privacy.
If only a few patients in the city have the condition, and if someone knows a person is in the database, this could reveal that this person has the condition and their medical status. We can use the Laplace mechanism as differential privacy mechanisms to add noise to the count of people with that condition to prevent any individual from being identified.
The amount of noise added to the data would be related to the sensitivity of our function. Since each patient's contribution can change the result of the count by a maximum of one, our sensitivity is equal to one, and we would add noise accordingly. By adding this noise, we can ensure that the released count is differentially private.
Another common mechanism of differential privacy is known as randomized response. It involves asking individuals to respond to a "yes" or "no" question in a randomized manner, with a certain probability of giving a truthful answer and a certain probability of giving a random response.
For example, we want to collect data on sensitive topics such as criminal behavior or political views. To protect the privacy of individuals in the dataset, we could use the randomized response mechanism by asking them to respond to a "yes" or "no" question with a 50% probability of giving a truthful answer and a 50% probability of giving a random response.
Randomized response allows for collecting data while still protecting the privacy of individuals by ensuring that an individual's responses can be claimed to be the product of chance rather than their true response (that is, this technique introduces plausible deniability: individuals may always claim that the mechanism forced them to lie).
One limitation of this mechanism is that it can introduce bias into the data if not used carefully. For example, if the probability of giving a truthful answer is too low, the data may not be representative of the population. This can be mitigated by using multiple questions rather than single ones for example, which would provide a more comprehensive view of the population.
A very powerful and distinguishing feature of differential privacy is the ability to quantify the maximum amount of information that can be disclosed. This upper bound on “information leak” is referred to as the privacy budget.
The privacy budget is typically set using a mathematical formula known as the "privacy loss function," which determines the amount of noise that needs to be added to the data to achieve a certain level of privacy.
Alternatively, it can be calculated a posteriori, after having added noise to the data, to assess the level of privacy.
In the next section, we present some examples of the applications of differential privacy.
Here is a concrete example of using differential privacy in a medical dataset. Suppose we have a dataset containing individuals' medical records, and we want to release the number of individuals in the dataset with a certain medical condition while preserving their privacy.
Releasing the information without any noise or perturbation could lead to identifying individual patients in the dataset. Suppose the dataset is small or the medical condition is not evenly distributed. In that case, it may be possible to identify individual records based on the released count of patients with the condition, with or without additional information or assumptions.
To achieve differential privacy in this scenario, we could follow these steps:
Differential privacy found several applications as a tool to protect the privacy of individuals while still allowing for the extraction of insights from data.
The U.S. Census Bureau uses differential privacy to protect the privacy of individuals while still allowing for the release of aggregate statistics about the population.
Differential privacy can also support the collection of data about how users interact with a product or service, such as which features are used most often, without revealing personal information about individual users. For example, using this method, Apple collects data about how users interact with their devices, such as which features are used most often.
The University of California, Berkeley uses differential privacy to study the spread of infectious diseases, such as influenza and COVID-19, without revealing the identities of individual patients.
The Healthcare Cost and Utilization Project (HCUP) uses differential privacy to study healthcare utilization and costs across the United States while still protecting the privacy of individual patients.
Differential privacy can also support the generation of synthetic data for use in data-driven decision-making, such as in public policy or business planning, without revealing sensitive information about individuals.
In the next section, we explore the topics of differentially private algorithms, machine learning and synthetic data.
Differentially private machine learning algorithms are designed to protect the privacy of individuals in the training data. They use techniques from differential privacy to add noise while still allowing the algorithm to learn from the data and make accurate predictions or decisions.
You can apply differential privacy to machine learning algorithms in several ways. One common approach is to add noise to the data during the training process. Other approaches involve using differential privacy to protect the outputs of the algorithm, such as the predictions or decisions made by the model, or partitioning the data and aggregating the response of a set of models each trained on a single data partition.
The differential privacy training could prevent the machine learning algorithm from revealing sensitive information. For example, preventing an algorithm trained to predict the likelihood of a patient developing a certain medical condition to reveal sensitive information from records of patients who have been treated for the medical condition in the past.
Differentially-private synthetic data is a type of synthetic data that is generated using differential privacy techniques. Synthetic data, which is generated by a computer algorithm instead of being collected from real-world sources, has many applications, such as in testing machine learning algorithms or privacy-preserving data analysis.
To generate synthetic data with differential privacy guarantee, a computer algorithm is used to create data similar to the original dataset but with the added property of differential privacy. It means that noise has been added while training the generative model, making it extremely difficult to determine the individual records in the original dataset from the newly generated data.
After setting a privacy budget, the algorithm can be used to generate the synthetic data. The resulting data will have the property of differential privacy, meaning that it becomes harder to determine the individual records in the original dataset from the newly generated data.
This method, using differential privacy, has certain advantages over standard synthetic data. For example, it can provide privacy-preserving data analysis or enable data sharing between organizations without risking the privacy of individual data.
This layer of protection is critical in a context where synthetic data generation is gaining such popularity. This rise in popularity is notable due to the variety of its purposes, such as training machine learning algorithms, testing algorithms, or sharing data with third parties.
Yet, potentially revealing sensitive information through data privacy breaches may expose the organization or individual to regulatory consequences. Many data privacy laws and regulations require organizations to protect individuals' privacy when using their personal data, and differential privacy is one way to do this.
As for any technology or framework, differential privacy has its limitations.
The main one is the trade-off between privacy and utility. Differential privacy adds noise to data to protect the privacy of individuals. Still, this noise can also reduce the utility of the data, making it less accurate or useful for certain types of analysis. This trade-off can be difficult to manage and requires careful balancing to balance privacy and utility.
The lack of standardization and agreement on best practices also remains a challenge. Differential privacy is a relatively new field, and there is currently no standardized approach to implementing differential privacy. It can make it difficult to compare and evaluate differentially private algorithms and limit the ability to develop a common framework for differential privacy.
As we've seen, differential privacy is about providing a solid guarantee of privacy while allowing data to be used for analysis without revealing sensitive information about any individual in the dataset.
This is achieved through, for example, the use of randomized mechanisms that add noise to ensure that any individual's information is not disclosed, while still allowing for the extraction of useful insights from the data.
It's a valuable opportunity in healthcare, where data is often sensitive and personal, but it can also be precious to improve public policy, healthcare outcomes, and decision-making using data.
Differential privacy has undergone rapid developments over the years. Currently, the main research tracks in the field include the development of algorithms for different types of differentially-private data analysis tasks and improving the utility of differentially-private algorithms.
Additionally, there is ongoing research into the theoretical foundations of differential privacy and the development of tools and frameworks for implementing and deploying differentially private algorithms in practical settings.
Many resources are available online for those interested in learning more about differential privacy. The community-managed Differential Privacy . org is a great place to start.
There are also many research papers and articles on the topic and tutorials and courses that provide a more in-depth understanding of the concepts and techniques involved.
For an excellent hands-on guide, see also Programming Differential Privacy and for more examples of real-life application see "A list of real-world uses of differential privacy"
Disclaimer: This blog post was written with input from ChatGPT, the large language model trained by OpenAI. For an overview of the writing process, see this Twitter thread. The privacy researchers of Statice, Matteo Giomi and Nicola Vitacolonna reviewed the content. Thank you to Ricardo Carvalho, Computer Science PhD student at Simon Fraser University (SFU) and expert on Machine Learning, Generative Adversarial Networks (GANs), and Differential Privacy, for his input and feedback on the post.
Contact us and get feedback instantly.