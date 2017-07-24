If artificial intelligence has any hope of expanding our understanding of the world, rather than simply reflecting old biases, it needs radical transparency. What does that mean? In part, it’s about giving consumers access to an algorithm’s data set, so they can make informed decisions about the technology that increasingly shapes their lives.

Data inside of algorithms is incredibly symbiotic with the algorithm itself. In product design, the data fed to algorithms determines the characteristics of a product. Have an algorithm designed for chatbots, and you want to use it in an e-commerce setting? The data you feed the chatbot algorithm will determine what kind of chatbot it is. If you feed it pizza-ordering data, the chatbot will be trained for ordering pizza. Can other food be ordered? Sure, but the chatbot will probably get the content of the orders wrong because the bot has only been trained to understand the data set of “pizza.” What’s in the data sets is as important as how the algorithm was designed. But how can we determine or understand what’s in every data set? What technology needs is data ethnographers and data ethnography.

Ethnography is the study of people and cultures, and ethnographic research is imperative to design research. How does a group relate to or understand a product, what is that group’s needs, what are the tech trends in that group? I advocate we need data ethnography, a term I define as the study of the data that feeds technology, looking at it from a cultural perspective as well as a data science perspective. Data ethnography is a narrower, but no less crucial, field: Data is a reflection of society, and it is not neutral; it is as complex as the people who make it.

The job of a data ethnographer, then, would be to ask questions like: What is the culture of a data set? How old is it? Who made it? Who put it together? When was it updated–has it ever been updated? The ethnographer could then test data and label it, much in the same way that food labels break down nutritional contents. Consumers could then see data sets labeled like “social media data, Twitter, 2021, U.S., 75% male users ages 35-40, 50% white.”

The benefit? A better way to determine what an algorithm is telling us, and why. It’s time for digital products to really show their ingredients so we can understand the results they’re putting out into the world.

We cannot have facial recognition datasets that are 75% male, 80% pale! #AINow2017 — Martin Tisné (@martintisne) July 10, 2017

Consider what happens when you Google “professional hair.” You see mainly white hairstyles. Google “unprofessional hair,” and you see mainly black hair styles. Such bias could be avoided–or at the very least, made transparent–if the data set used in training had been labeled: