4 reasons why your data is biased?

May 6, 2022


Nathan Cavaglione

Our ultimate mission at Fairgen is to make AI fair. Since AI is based on data, we believe this will happen through making data fair. To this end, we have created a platform that can debias datasets. But why does biased data even exist and why should we be concerned about it ? After all, data is supposed to reflect reality. Well, not exactly.

1. Data mirrors human behaviour.

Most datasets today come from historical data built by humans. Humans are discriminative. Those discriminative patterns are engraved in the data.

Example: women getting lower salaries at equal skills. If we are to train AI models with this data, the models will be discriminative too. This is a clear problem as you do not want to be teaching such a sexist pattern to an intelligent machine taking decisions at scale.

So should one look for other sources of data ? Should one stop using AI ? The best option is to simply use fairness-constrained data generation on those datasets. This will ensure the data has an equal percentage of men and women getting accepted for a loan.

2. Data reflects the past, not the future.

Any dataset being used in an ML model has been built through years of data collection. Over the years, the data distribution could have drifted to a new reality, meaning the AI model in production makes decisions with patterns of the past on input data from the present.

Example: an insurance AI model might give a price X to a health policy, then Corona hits, people get sick more often, and price X should be X+100. But the data is based on pre-corona so the AI model does not update its thinking to current conditions and it still prices the policy much lower than it should. This could bankrupt the company.

One solution is to use data augmentation to generate enough data to train your model using only data from this new period of time.

3. Data can create a self-fulfilling prophecy.

With the world modernizing itself with AI, the biggest danger is by far getting stuck in a bias loop.

Example: let's think of a world in the near future where AI models decide on who should get a loan using 1970 to 2020 data showing that 30% of women applicants get a loan against 60% of men. What will happen ? From 2020 to 2040, those AI models will reproduce the same sexist pattern. Then, the AI models trained in 2040 with data from the last 20 years will be trained on identically sexist data. We are in a bias loop.

This can be changed at the source with fairness-constrained data generation. On a positive note, if we get stuck in a loop of treating different subgroups equally then this is good news.

4. Data is unbalanced.

It often happens that a collected dataset fails to reflect reality because there are not enough data points of a particular subgroup.

Example: the market branding of a bank has mainly attracted male customers and the bank now has too few female data points to build a robust loan model for women.

A solution to this is data rebalancing which will increase the amount of women data points to the number of men data points.