If artificial intelligence has any hope of expanding our understanding of the world, rather than simply reflecting old biases, it needs radical transparency. What does that mean? In part, it’s about giving consumers access to an algorithm’s data set, so they can make informed decisions about the technology that increasingly shapes their lives.
Data inside of algorithms is incredibly symbiotic with the algorithm itself. In product design, the data fed to algorithms determines the characteristics of a product. Have an algorithm designed for chatbots, and you want to use it in an e-commerce setting? The data you feed the chatbot algorithm will determine what kind of chatbot it is. If you feed it pizza-ordering data, the chatbot will be trained for ordering pizza. Can other food be ordered? Sure, but the chatbot will probably get the content of the orders wrong because the bot has only been trained to understand the data set of “pizza.” What’s in the data sets is as important as how the algorithm was designed. But how can we determine or understand what’s in every data set? What technology needs is data ethnographers and data ethnography.
Ethnography is the study of people and cultures, and ethnographic research is imperative to design research. How does a group relate to or understand a product, what is that group’s needs, what are the tech trends in that group? I advocate we need data ethnography, a term I define as the study of the data that feeds technology, looking at it from a cultural perspective as well as a data science perspective. Data ethnography is a narrower, but no less crucial, field: Data is a reflection of society, and it is not neutral; it is as complex as the people who make it.
The job of a data ethnographer, then, would be to ask questions like: What is the culture of a data set? How old is it? Who made it? Who put it together? When was it updated–has it ever been updated? The ethnographer could then test data and label it, much in the same way that food labels break down nutritional contents. Consumers could then see data sets labeled like “social media data, Twitter, 2021, U.S., 75% male users ages 35-40, 50% white.”
The benefit? A better way to determine what an algorithm is telling us, and why. It’s time for digital products to really show their ingredients so we can understand the results they’re putting out into the world.
We cannot have facial recognition datasets that are 75% male, 80% pale! #AINow2017
— Martin Tisné (@martintisne) July 10, 2017
Consider what happens when you Google “professional hair.” You see mainly white hairstyles. Google “unprofessional hair,” and you see mainly black hair styles. Such bias could be avoided–or at the very least, made transparent–if the data set used in training had been labeled:
year assembled: 2001-2003
original size: 45,000 jpegs
year updated: 2011
current size: 50,000jpegs
people: 60% “white women”+ “blonde hair”; 30% : “white women” + “brown hair”; 10% “black women” + “black hair”
origin: Uni of X, Machine Learning Lab
creators: [creators listed here]
Instead, when data sets are opaque, consumers have no way of accurately assessing search results and other digital products. That, in turn, makes it easier to mistake specific data sets for universal ones.
Perhaps nothing highlights the need for data ethnography better than predictive policing. Predictive policing software is dangerous–not just because of how it dovetails with flawed U.S. police systems, and not just because it autonomizes the judicial system, but because the data it’s trained on is deeply problematic. Black neighborhoods and black people are policed at a higher rate than other races, so the “policing” data is already skewed and trained to give to longer, stricter sentences to black people–which reinforces the bias that already exists in society. Data ethnographers could highlight these biases and help make the case for other, more equitable policing strategies.
Data and artificial intelligence systems are a civil issue, a civic issue, and a human issue. Understanding that data is complicit in how AI works is a step toward making equitable technology systems. Imagine an opensource, transparent, data ethnography group that combines the skill sets of data scientists and ethnographers–imagine the kind of change that could create.