In the 21st century, a crack reporter’s best source for juicy leads might not be pounding the pavement and hounding sources. Instead, it
might be an algorithm that gathers the information hidden in documents and data sets and
displays them in an eureka-inspiring format. To that end, Google and the Knight Foundation recently announced millions of dollars in grants to digital journalism projects, with the overarching goal of funneling the firehose of new data into an easily digestible format.
One of the awards went to The Chicago Tribune’s “PANDA,” which aims to sew otherwise incompatible datasets together, allowing journalists to find unknown relationships from the archived data that normally sits dormant on individual hard drives. Joe Germuska, Senior Developer of News Applications at the Tribune, tells Fast Company the idea was motivated by reporters who would “inadvertently” stumble upon an old dataset that helped piece together a story. “One thing we wanted was to make that less of an accident and more straightforward,” he said.
“All journalists should be data journalists,” said a PANDA team document, referring to the practice of applying statistical techniques to mine raw data for investigative purposes. Without computers, news outlets have been forced to rely on readers to comb through data dumps (such as the Palin emails), or dedicate a substantial number of man-hours from their staffs (what the New York Times had to do for WikiLeaks).
Moreover, teams of individual readers can’t spot connections between data on pages they’ve never read, nor can they cross-reference them with other archived data collecting dust inside of forgotten computer files. Project lead Brian Boyer hopes that “when a reporter gets a 10,000-row campaign contribution list, they can reconcile it against databases we keep on file to see what things pop up.”
For now, Germuska says the Tribune team will be trying to solve some of the prosaic but annoying problems of working with large data sets, like figuring out how to stitch slightly incompatible datasets together (for instance, if one has first and last names in a single field, and another separates them into two fields, or uses a nickname). Additionally, the first iteration of PANDA will not be a hosted service, but an application owned and operated by each newsroom.
Their future aspirations are to bring the power of statistical analysis to those without math degrees. PANDA describes this as some type of sophisticated middleman between consumer statistical software, like Excel (which the team says “sucks”), and the PhD-grade stats program, R, which they refer to as a “visitor from an evil alternate universe (though it’s often the best tool for the job).”
Other grants include the DocumentCloud Reader, which allows public annotations on documents and OpenBlock Rural, to aid rural news organizations in becoming public-data storehouses like their big-city counterparts. The Associated Press’s “Overview”
project also received funding. Overview will explore raw data by using visualization tools to highlight patterns and collusion that
journalists might otherwise never have found while poring over reams of
paper for hours on end.
[Image: Flickr user Marius B]