Experts say that bias is one of the biggest problems facing the development of artificial intelligence. When a data set reflects systemic discrimination and bias in the real world, that bias gets encoded into an automated system–which can have dire consequences, determining who gets a job, who gets a loan, and even who goes to prison.
Yet it can be hard to tell when a data set is biased, especially when these systems are built by homogenous teams mostly consisting of white men. Even the existing tools that are meant to test algorithms can be biased. Take what’s known as a “benchmark data set,” basically a bunch of data that is used to assess an AI’s accuracy. Two common benchmark data sets used to test facial recognition systems, known as IJB-A and Adience, are actually composed of 79.6% and 86.2% light-skinned faces, which means that these benchmarks don’t test the accuracy of the algorithm for all kinds of faces with the same kind of rigor.
A new project called Gender Shades creates a new benchmark data set that takes both biological sex and race into account to measure three commercial face classification AI algorithms from IBM, Microsoft, and the Chinese startup Face++ (whose facial recognition technology is used by Alibaba). These types of algorithms are widely used to read faces on security cameras, during immigration, in criminal justice, and even in products like glasses for visually impaired people.
The resulting study shows that all of these real-world algorithms have significantly lower accuracy when evaluating dark female faces than any other type of face. It’s troubling proof that the AI already at work in our daily lives is deeply biased–and that we need to demand greater diversity in the people who build these algorithms and more transparency about how they work.
Gender Shades is the work of Joy Buolamwini, a researcher at the MIT Media Lab and the founder of the Algorithmic Justice League. She was able to test these major commercial systems by creating this new benchmark face data set rather than a whole new algorithm. It’s a clever way of targeting bias, which can often remain hidden. “Benchmark data sets are used to assess progress on specific tasks like machine translation and pedestrian detection,” Buolamwini writes on the project’s website as to the project’s significance. “Unrepresentative benchmark data sets and aggregate accuracy metrics can provide a false sense of universal progress on these tasks.”
Buolamwini says that the benchmark data set, composed of 1,270 images of people’s faces that are labeled by gender as well as skin type, is the first data set of its kind, designed to test gender classifiers, that also takes skin tone into account. The people in the data set are from the national parliaments of the African countries of Rwanda, Senegal, and South Africa, and from the European countries of Iceland, Finland, and Sweden. The researchers chose these countries because they have the greatest gender equity in their parliaments–and members of parliament have widely accessible images available for use.
While the algorithms from IBM, Microsoft, and Face++ boast overall accuracies of between 87% and 93%, these numbers don’t reveal the discrepancies between light-skinned men, light-skinned women, dark-skinned men, and dark-skinned women. The study found that the algorithms are 8.1% to 20.6% less accurate when detecting female faces than male faces, 11.8% to 19.8% less accurate when detecting dark-skinned faces versus light-skinned faces, and, most shockingly, 20.8% to 34.4% less accurate when detecting dark-skinned female faces than light-skinned male faces. IBM had the largest gap–a 34.4% difference in accuracy when detecting dark-skinned females versus light-skinned males.
Why does something like this happen? Typically because the data set the AI was trained on had far more light-skinned male faces, and light-skinned faces in general. It’s bias at work.
IBM responded to the research by doing a similar study to replicate the results on a new version of their software. The company reports it found far smaller differences in accuracy with the new version that has yet to be released and says it has several projects underway to address issues of bias in its algorithms. Microsoft says it is working to improve the accuracy of its systems. Face ++ did not respond to the research.
The implications for bias in facial recognition systems is particularly potent for people of color. As police departments begin to use more facial analysis algorithms, the discrepancy in accuracy for darker-skinned people is a huge threat to civil liberties. When these systems can’t recognize darker faces with as much accuracy as lighter faces, there’s a higher likelihood that innocent people will be targeted by law enforcement. In other words, this kind of automation enables the same kind of bias that results in police officers arresting innocent people.
Buolamwini has some ideas about what true algorithmic justice looks like. “Facial analysis systems that have not been publicly audited for subgroup accuracy should not be used by law enforcement,” she writes. “Citizens should be given an opportunity to decide if this kind of technology should be used in their municipalities, and, if they are adopted, ongoing reports must be provided about their use and if the use has in fact contributed to specific goals for community safety.”
Only then can we start to move toward a more equitable algorithmic future.