Guest Authors: Utpal Mangla, VP & Senior Partner; Global Leader: IBM’s Telecom Media Entertainment Industry Center of Competency at IBM, & Luca Marchi, AI Innovation, Center of Competence for Telco, Media and Entertainment, IBM, & Kush Varshney, Distinguished Research Staff Member, Manager at IBM Thomas J. Watson Research Center, & Shikhar Kwatra, Data&AI Architect, AI/ML Operationalization Leader at IBM
Artificial Intelligence (AI) is becoming a key cog in how the world works and how it lives. But the reality is that AI is not as widespread in critical enterprise workflows as it could be because it is not perceived to be safe, reliable, fair, and trustworthy. With increasing regulation, concern about brand reputation, burgeoning complexity, and a renewed focus on social justice, companies are not ready and willing to deploy a “science experiment” at scale in their operations. As Thomas J. Watson, Sr., an early chief executive of IBM said, “The toughest thing about the power of trust is that it’s very difficult to build.”
We’ve seen many newsworthy examples of AI producing unfair outcomes: blacks being discriminated against in criminal recidivism, low-income students systematically having low “predicted” exam scores when the coronavirus pandemic cancelled the real exam, men and women having different lending decisions despite having exactly the same assets, and many more. Why is this happening and what can we do about it?
Lessons from Commercial Aviation
It is instructive to look at the history of commercial aviation to understand what is happening with AI today. The first flights by the Wright brothers and Santos-Dumont during 1903-1906 to the introduction of the commercial jetliner, the Boeing 707, in 1957 can be considered as the first 50 years of aviation. This period was all about just understanding how to make planes fly with limited commercialization. In the second 50 years of aviation that followed, the fundamental nature of airplanes did not change—today’s commercial jets are basically the same as the Boeing 707—but there was a heavy emphasis on safety, efficiency, and automation. Now commercial airlines operate almost everywhere with safety records hundreds of times better than fifty years ago.
What is the lesson for AI? We are just at the beginning of the second 50 years of AI. We can trace the beginnings of AI to a 1956 conference at Dartmouth. We can say that the first 50 years concluded when deep learning won the ImageNet competition in 2012. Just like in aviation, the first 50 years were spent on getting AI to simply work—to be competent and accurate at narrow tasks—with limited commercialization. Now our job is to work on making AI more safe, reliable, fair, trustworthy, efficient, and automated, and bring commercialization everywhere.
Accuracy Isn’t All You Need
To make AI trustworthy, we need it to be more than accurate. We need it to be fair so that it doesn’t discriminate against certain groups and individuals based on their race, gender, or other protected social attributes. We need it to be reliable and robust so that it can be used in different settings and contexts without spectacularly falling apart. We need it to be explainable or interpretable so that people can understand how AI makes its predictions. We need it to realize when it is unsure.
The LF AI & Data Foundation’s three open source toolkits: AI Fairness 360, AI Explainability 360, and Adversarial Robustness 360 Toolbox, are means for practicing data scientists to address these needs for making AI more trustworthy. Let’s dig into fairness in more detail. Where do the problems come from and how can we mitigate them?
Where Do Fairness Issues Come From?
AI, specifically machine learning, tends to reflect back and sometimes amplify unwanted biases that are already present in society. There are four main reasons why there can be unfairness in AI:
- Problem misspecification – when the problem owner and data scientist pose the problem they are going to be creating a solution for, they may make choices that introduce unwanted behaviors. For example, if they want to predict whether someone will commit a crime in the future, but they design an AI system to predict whether someone will be arrested in the future, they can introduce unfairness. First, being arrested does not imply that a person is guilty of a crime. Second, there are more arrests made in neighborhoods where police patrol more often, and that is not done equally.
- Features containing social biases – some attributes in a dataset already contain traces of structural biases that provide systematic disadvantage to certain groups. For example, the SAT score may be used as a feature for predicting an applicant’s success in college, but it is known to already contain biases so that some minority groups do worse because of cultural knowledge embedded in the questions.
- Sampling biases – sometimes datasets overrepresent privileged groups and underrepresent unprivileged groups. For example, face attribute classification datasets are known to be skewed towards white males.
- Data preparation – one key step in AI development pipelines is feature engineering, where raw data is transformed before being fed to the AI. There are several subjective choices made in this process, some of which can lead to unfairness
Measuring and Mitigating Unfairness
Just as there are many reasons why AI can yield unfairness, there are many ways to measure it and mitigate it. Choosing how to measure unfairness is not as easy as it sounds because different fairness metrics encode different worldviews and politics. As one option, you can measure the difference in selection rates of an AI, say the difference between the fraction of black applicants who got accepted to a college and the fraction of white applicants. As a different option, you can measure the difference in accuracy rates between the same two groups. They both sound about the same at face value but are actually quite different. In the first option, you are implicitly assuming that features have social biases (like the SAT score), but in the second option, you assume that all the unfairness is due to other reasons like sampling biases.
If you measure that an AI system is behaving unfairly, what can you do about it? You can apply one of many possible bias mitigation algorithms. The basic idea of bias mitigation is that you want a sort of statistical independence between protected attributes like ethnicity or gender and the predicted outcome like success in college. Statistical independence is the notion that two dimensions are unlinked and have no relationship with each other. There are many statistical methods that encourage independence, but that is a longer discussion for another day. Feel free to check out the AI Fairness 360 documentation for more details about bias mitigation if you can’t wait!
LF AI & Data Resources