In June 2020, the Parliament of the United Kingdom published a policy report with numerous recommendations aimed at helping the government fight against the “pandemic of misinformation” powered by internet technology. The report is rather forceful on the conclusions it reaches: “Platforms like Facebook and Google seek to hide behind ‘black box’ algorithms which choose what content users are shown. They take the position that their decisions are not responsible for harms that may result from online activity. This is plain wrong.”
While preparing this report, Parliament collected oral evidence from a variety of key figures. One of these was Vint Cerf, a legendary internet pioneer now serving as vice president and chief internet evangelist at Google. He was asked: “Can you give us any evidence that the high-quality information, as you describe it, that you promote is more likely to be true or in the category, ‘the earth is not flat’, rather than the category, ‘the earth is flat’?” His intriguing response provided a sliver of daylight in the tightly sealed backrooms of Google:
“The amount of information on the World Wide Web is extraordinarily large. There are billions of pages. We have no ability to manually evaluate all that content, but we have about 10,000 people, as part of our Google family, who evaluate websites. . . . In the case of search, we have a 168-page document given over to how you determine the quality of a website. . . . Once we have samples of webpages that have been evaluated by those evaluators, we can take what they have done and the webpages their evaluations apply to, and make a machine-learning neural network that reflects the quality they have been able to assert for the webpages. Those webpages become the training set for a machine-learning system. The machine-learning system is then applied to all the webpages we index in the World Wide Web. Once that application has been done, we use that information and other indicators to rank-order the responses that come back from a web search.”
He summarized this as follows: “There is a two-step process. There is a manual process to establish criteria and a good-quality training set, and then a machine-learning system to scale up to the size of the World Wide Web, which we index.” Many of Google’s blog posts and official statements concerning the company’s efforts to elevate quality journalism come back to this team of 10,000 human evaluators, so to dig deeper into Cerf’s dense statement here, it would be helpful to better understand what these people do and how their work impacts the algorithm. Fortunately, an inside look at the job of the Google evaluator was provided in a Wall Street Journal investigation from November 2019.
While Google employees are very well compensated financially, these 10,000 evaluators are hourly contract workers who work from home and earn around $13.50 per hour. One such worker profiled in the Wall Street Journal article said he was required to sign a nondisclosure agreement, that he had zero contact with anyone at Google, and that he was never told what his work would be used for (and remember these are the people Cerf referred to as “part of our Google family”). The contractor said he was “given hundreds of real search results and told to use his judgment to rate them according to quality, reputation, and usefulness, among other factors.” The main task these workers perform, it seems, is rating individual sites as well as evaluating the rankings for various searches returned by Google. These tasks are closely guided by the 168-page document these workers are provided. Sometimes, the workers also received notes, through their contract work agencies, from Google telling them the “correct” results for certain searches. For instance, at one point, the search phrase “best way to kill myself” was turning up how-to manuals, and the contract workers were sent a note saying that all searches related to suicide should return the National Suicide Prevention Lifeline as the top result.
This window into the work of the evaluators, brief though it is, helps us unpack Cerf’s testimony. Google employees—presumably high-level ones—make far-reaching decisions about how the search algorithm should perform on various topics and in various situations. But rather than trying to directly implement these in the computer code for the search algorithm, they codify these decisions in the instruction manual that is sent to the evaluators. The evaluators then manually rate sites and search rankings according to this manual, but even with this army of 10,000 evaluators, there are far too many sites and searches to go through by hand—so as Cerf explained, these manual evaluations provide the training data for supervised learning algorithms whose job is essentially to extrapolate these evaluations so that hopefully all searches, not just the ones that have been manually evaluated, behave as the Google leadership intends.
While some of the notable updates to the Google search algorithm have been publicly announced by the company, Google actually tweaks its algorithm extremely often. In fact, the same investigation just mentioned also found that Google modified its algorithm over 3,200 times in 2018. And the number of algorithm adjustments has been increasing rapidly: in 2017, there were around 2,400, and back in 2010 there were only around 500. Google has developed an extensive process for approving all these algorithm adjustments that includes having evaluators experiment and report on the impact to search rankings. This gives Google a sense of how the adjustments will work in practice before turning them loose on Google’s massive user base. For instance, if certain adjustments are intended to demote the rankings of fake news sites, the evaluators can see if that actually happens in the searches they try.
Let me return now to Vint Cerf. Shortly after the question that led to his description of Google’s “two-step” process that I quoted above, the chair of the committee asked Cerf another important, and rather pointed, question: “Your algorithm took inaccurate information, that Muslims do not pay council tax, which went straight to the top of your search results and was echoed by your voice assistant. That is catastrophic; a thing like that can set off a riot. Obviously, 99% of what you do is not likely to do that. How sensitized are your algorithms to that type of error?”
Once again, Cerf’s frank answer was quite intriguing. He said that neural networks (the modern framework for AI) are “brittle,” meaning sometimes tiny changes in input can lead to surprisingly bad outputs. Cerf elaborated further:
“Your reaction to this is, “WTF? How could that possibly happen?” The answer is that these systems do not recognize things in the same way we do. We abstract from images. We recognize cats as having little triangular ears, fur and a tail, and we are pretty sure that fire engines do not. But the mechanical system of recognition in machine-learning systems does not work in the same way our brains do. We know they can be brittle, and you just cited a very good example of that kind of brittleness. We are working to remove those problems or identify where they could occur, but it is still an area of significant research. To your primary question, are we conscious of the sensitivity and the potential failure modes? Yes. Do we know how to prevent all those failure modes? No, not yet.”
In short, we trust Google’s algorithms to provide society with the answers to all its questions—even though they sometimes fan the flames of hate and fake news and we don’t entirely know how to stop them from doing so.
Noah Giansiracusa is assistant professor of mathematical sciences at Bentley University.
Adapted from How Algorithms Create and Prevent Fake News: Exploring the Impacts of Social Media, Deepfakes, GPT-3, and More, by Noah Giansiracusa, published by Apress (a division of Springer Nature). Copyright © 2021 by Noah Giansiracusa.