MIT and Qatari scientists are training computers to detect fake news sites

Thanks to social media, it’s easy to come across reporting from unfamiliar news sources around the world. But it can often be difficult to tell which sites are presenting the straight truth, which have a political bias, and which are spreading outright lies.

A new research project from the Massachusetts Institute of Technology’s Computer Science and Artificial Intelligence Lab and the Qatar Computing Research Institute aims to use machine learning to detect which sites focus on facts and which are more likely to churn out misinformation.

“If a website has published fake news before, there’s a good chance they’ll do it again,” said postdoctoral associate MIT CSAIL Ramy Baly, lead author on a paper about the technology, in a statement.

The tool uses a machine learning technique known as support vector machines to learn to predict how media organizations will be classified byMedia Bias/Fact Check, an organization that tracks the level of factual content and political bias in thousands of news sites. It takes into account the actual content of articles on the sites, as well as external factors such as the site’s Twitter presence, the structure of its online domain name and how it’s described on Wikipedia.

“The most useful information source for judging both factuality and bias turns out to be the actual articles,” says Preslav Nakov, a senior scientist at QCRI, in an interview.

Perhaps unsurprisingly, less factual sites were more likely to use hyperbolic and emotional language than those reporting more factual content. Additionally, Nakov says, news sources with longer descriptions on Wikipedia tend to be more reliable. The online encyclopedia can also provide verbal indications that news sources are suspect, such as references to bias or a tendency to spread conspiracy theories, he says.

“If you, for example, open the Wikipedia page of Breitbart, you read things like ‘misogynistic,’ ‘xenophobic,’ ‘racist,'” Nakov says.

Separately, sites with more complex domain names and URL structures were generally less trustworthy than sites with simpler ones. Some of the more complex URLs belonged to sites with longer addresses essentially impersonating familiar ones with simpler domains.

The researchers focused on tracking the reliability of entire news outlets rather than individual stories partly in the expectation that algorithms might be better at handling full bodies of work rather than short posts. A system that classifies entire sites can also be useful in helping readers evaluate new content from the site, even if it hasn’t been studied by human fact-checkers of the type social networks like Facebook are increasingly employing. Fact-checkers could also use the algorithm’s ratings in evaluating cases where different sites report differently on the same subject, Nakov suggests.

When presented with a new news outlet, the system was roughly 65% accurate at detecting whether it has a high, medium, or low level of factuality and 70% accurate at detecting whether it leans to the left, right, or center. The researchers plan to present the paper in a few weeks at the Empirical Methods in Natural Language Processing conference in Brussels.

The research is far from the only ongoing project involving combating misinformation. The Defense Advanced Research Projects Agency has been funding research into detecting forged images and videos, and other researchers have delved into how to teach people to spot suspect news.

In the future, the MIT and QCRI researchers plan to test the English-trained system on other languages and see how it fares with other biases than left and right, such as spotting religious or secular-leaning news in the Islamic world. The group also has plans for an app that could offer users a look at news stories from a variety of political perspectives.

Recognize your brand’s excellence by applying to this year’s Brands That Matter Awards before the early-rate deadline, May 3.

Explore Topics