8 types of data bias that can wreck your machine learning models

Biased data: chances are you know the term well. Maybe you’re the one who’s very skeptical about using data that is skewed for business purposes. In fact, biased data is everywhere. Humans are biased. People filter tons of data daily to get only tiny bits from the noise. You make quick decisions based on fragments of data.

But the problem gets bigger whenever we talk about biased data in machine learning (ML) and AI. Because skewed data hinders innovation in artificial intelligence and machine learning projects.

In this article, you will learn 8 common data biases that will harm your machine learning model:

Discover what are biases in machine learning and AI systems.
8 common types of bias in data.
Fundamentals of the tradeoff between data bias and variance.
How synthetic data can address bias.
Suggestions for the literature to continue exploring the topic.

What are biases in machine learning and artificial intelligence?

Bias in data is an error that occurs when certain elements of a dataset are overweighted or overrepresented. Biased datasets don't accurately represent ML model's use case, which leads to skewed outcomes, systematic prejudice, and low accuracy.

Often, the erroneous result discriminates against a specific group or groups of people. For example, data bias reflects prejudice against age, race, culture, or sexual orientation. In a world where AI systems are increasingly used everywhere, the danger of bias lies in amplifying discrimination.

It takes a lot of training data for machine learning models to produce viable results. If you want to perform advanced operations (such as text, image, or video recognition), you need millions of data points. Poor or incomplete data as well as biased data collection & analysis methods will result in inaccurate predictions because the quality of the outputs is determined by the quality of the inputs.

If you think data bias is the problem of recent times, it’s not. Dating back to 1988, British Medical Journal described a case from St George's Hospital Medical School.

The school developed a program that aimed to reduce the amount of work when selecting candidates for interviews. The school believed it would eliminate human error and inconsistencies.

St. George’s Hospital Medical School admission procedure. Read more about the case via Atlaslab. — *The admission procedure at the St George's Hospital Medical School. Source:* *www.atlaslab.org*

But the Commission of Racial Equality found the school guilty of discrimination against women and people with non-European sounding names. In fact, the program bared the bias that already existed in the system.

From that time, the technology advanced but the different types of bias remained the same.

Here are some examples of these types of biases.

Data bias types in machine learning, including examples

If you want to build a fair AI project and use data ethically, you have to know the types of data bias in machine learning to spot them before they wreck your ML model.

However, data bias in machine learning doesn’t only result from skewed data. There are far more reasons such a bias occurs. Let's take a deeper look.

Systemic biases

Systemic bias occurs when certain social groups are favored and others are devalued.

The National Institute of Standards and Technology (NIST) provides a good example of such bias in their recent special publication on managing bias in AI - the infrastructure for daily living (in most cases) isn’t adjusted to the needs of disabled people.

The reason behind it is institutional and stems from the underrepresentation of disabled people in studies. So, the biggest problem with systemic bias is that it is stealthily hidden in the world and thus overlooked.

Another example of systemic bias comes from St George's Hospital Medical School. This type of bias results from wrong procedures and practices in an organization. Although engineers don’t want to discriminate against any group of people, the bias is already rooted in the system.

Automation bias

Have you ever used a digital tool that, based on artificial intelligence, suggested an action? Automation bias occurs when you take this AI-based recommendation and use it before verifying if the information was right.

For instance, a data scientist depends on the analytics platform's suggestions to remove or modify specific data despite the recommendations worsening the quality of this data.

Selection bias

Randomization is the process that balances out the effects of uncontrollable factors - variables in a data set that are not specifically measured and can compromise results. In data science, selection bias occurs when you have data that aren’t properly randomized. If your dataset isn’t properly randomized, it means the sample isn’t representative - it doesn’t truly reflect the analyzed population.

https://www.explainxkcd.com/wiki/index.php/1999:_Selection_Effect — *Source:* *www.explainxkcd.com*

For instance, when building models for healthcare exploration, a data scientist operates only on white patients. This data sample isn’t reflecting the entire population because it doesn’t take into account BIPOC (Black, Indigenous, and people of color) patients.

This case also overlaps with racial bias – when data skews in favor of a particular group of people from specific demographics.

Overfitting and underfitting the data

In machine learning, overfitting occurs when a model is trained with so much data that it begins to learn from the noise and inaccurate data entries in the data set. Machine learning models have trouble predicting new data based on the training data because this noise cannot be applied to new data.

When a machine learning model fails to capture the underlying trend of the data (because it is too simple), underfitting occurs. In this case, it indicates that the model or algorithm is not fitting the data well enough.

‍

‍

Reporting Biases

A reporting bias is the inclusion of only a subset of results in an analysis, which typically only covers a small fraction of evidence. Reporting bias can take many forms. An example would be analyzing data based on studies found in citations of other studies (citation bias), excluding reports not written in the scientist's native language (language bias), or choosing studies with positive findings rather than negative findings (publication bias) & more.

Dilbert.com — *Source:* *https://dilbert.com/*

‍

As an example, a sentiment analysis model can be trained to predict whether a book review on a popular website is positive or negative. The vast majority of reviews in the training data set reflect extreme opinions (reviewers who either adored or despised a book). This was because people were less likely to review a book they did not feel strongly about. Because of this, the model is less likely to accurately predict sentiment of reviews that use more subtle language to describe a book.

Overgeneralization Bias

When a person applies something from one event to all future events, it is overgeneralization. In the field of data science, whenever you assume that what you see in your dataset is also what would be seen in another dataset, you are overgeneralizing.

Group Attribution Biases

Group attribution biases refer to the human tendency to assume that an individual's characteristics are always determined by the beliefs of the group, or that a group's decisions are influenced by the feelings of all its members. The group attribution bias manifests itself when you give preference to your own group (in-group bias) or when you stereotype members of groups you don't belong to (out-group bias).

For example, engineers might be predisposed to believe that applicants who attended the same school as they did are better qualified for a job when training a résumé-screening model for software developers.

Implicit Biases

Implicit biases occur when we make assumptions based on our personal experiences. Implicit bias manifests itself as attitudes and stereotypes we hold about others, even when we are unaware of it.

We might look for information that would support our beliefs and hypotheses and disregard information that doesn't. You may be more likely to continue testing machine learning models until you obtain results that support your hypothesis (confirmation bias).

‍

When there is not enough data or the data is not representative, you end up with sample bias. For instance, if your training data only features male doctors, the system may conclude that all doctors are male.

Existing stereotypes and poor measurement can creep in data at the stage of data collection. Diverse and representative datasets are crucial for machine learning. When you don't have enough original training data, which is often the case, synthetic data can be used to supplement it.

Fundamentals of the tradeoff between data bias and variance

It’s impossible to create a perfect machine learning model that doesn’t contain any bias or any variance.

The data bias and data variance are interconnected and work in the following way:

Bias is the difference between the prediction of your model and the correct value you try to predict. Biased models are oversimplified so are their skewed results.
Variance is the variability in the model prediction. Unlike biased models, high variance models pay too much attention to training data and don't generalize on data they haven't seen before. This leads to good results on training data but high error rates on test data.

You can’t obtain a machine learning model that has low bias and low variance. It would mean that on average it’s correct and it’s not going to be very different based on the different training sets. What you can do is to make a tradeoff between the bias and the variance.

For instance, you want to eliminate data bias so you build a more complex model. The more complex it is, the higher the variance. What you have to do is to strike the delicate balance between the two.

Can synthetic data address bias?

The short answer is yes, synthetic data can help address data bias. The quality of the raw synthetic data is impacted by the quality of the raw real data. However, the potential of synthetic data is the ability to have control over the output that allows to produce a more balanced, clean, and useful synthetic dataset. Unlike synthetic data, real datasets do not provide this level of control.

Moreover, a good synthetic data generator must be smart enough to detect errors in the real data and suggest how to fix them.

Synthetic data can help mitigate bias by complementing it with what you haven't seen. When data bias occurs because there is not enough data or it is too costly or there is no consent for use in ML projects, synthetic data will help fill this data gap.

Also, if your dataset is imbalanced - for example, the sample is mostly composed of people of a specific social group - synthetic data can help balance this data.

Having enough quality data is key for ML models. Many times, at the beginning of a project, your team can be unsure of what data they will need to train the model. Synthetic data can provide insight into what kind of data went into the model. And such transparency reduces the possibility of bias creeping into machine learning models.

But in any case, make sure to reduce any bias in your original dataset first. By properly labeling, cleaning, and keeping your data, you decrease the data bias risks. Include bias testing in your development process. This practice will help you prepare the original datasets that will better serve as a model for synthetic data generation.

After all, AI systems can only be as good as the quality of their input data.

Hundreds of biases interfere with how we process data, think critically, and perceive reality. If you'd like to explore this further, we'd suggest taking a look at these resources:

‍

8 types of data bias that can wreck your machine learning models

What are biases in machine learning and artificial intelligence?