Last weekend, media outlets around the world released stories based on the Paradise Papers, a collection of more than 13 million leaked files detailing how the world’s wealthiest people have parked their money in offshore tax havens.
The International Consortium of Investigative Journalists analyzed the documents along with 380 reporters at least 95 organizations around the globe, after they were first obtained by the Süddeutsche Zeitung, a major German newspaper. Stories so far uncovered offshore accounts used by multinational companies like Apple looking to minimize taxes and detailed Commerce Secretary Wilbur Ross’s investments in companies tied to associates of Russian President Vladimir Putin, and more reports are expected to follow in the coming weeks.
But while journalists from around the world analyzed the materials from international law firm Appleby, trust company Asiaciti and corporate registries in 19 jurisdictions, much of the technical work to turn the raw leaked documents into a searchable, usable, and secure database was done by a skeleton crew at the ICIJ. About 9 or 10 developers, analysts, and product managers worked for about 14 months on the technical side of the project, says Pierre Romera, the ICIJ’s chief technology officer.
“We are a very, very small team,” he says.
Of course, it wasn’t the ICIJ’s first time dealing with a massive document dump: the famous Panama Papers leaked in 2015 included a similar volume of material, and the ICIJ has worked with smaller-scale offshore leaks as well. Over the years, ICIJ’s team has developed digital tools and procedures to understand the stories such huge networks of documents can tell. They’ve built their own software when needed, some of which is available on GitHub, and learned to harness tools like the open source Apache Solr search engine and the Neo4J connected data analysis platform.
“Most of the time we used existing tools that we improved ourselves, according to our need,” says Romera.
Essentially every story the ICIJ tells with its leaked documents includes an interactive network graph, built with Neo4J and Linkurious, a visualization tool. That lets readers see how the companies and individuals in the story interconnect. The graph database is also useful to reporters looking to uncover stories and factcheckers working to verify them, Romera says.
To protect the consortium’s sources he can’t go into too many details about how the Paradise documents were received, but materials included emails, spreadsheets and plenty of PDF files.
“One of the biggest challenges was the PDFs, because some of them were only images without text in it,” he says.
That meant that before the files could be loaded into any kind of search engine, they’d have to be processed with optical character recognition software that could turn images of words and numbers into actual data a computer could understand. The ICIJ has built an open-source tool called Extract for efficiently pulling useful data from these kinds of file dumps, and it made some improvements this go-round to handle some new document formats, Romera says.
Emails, too, present their own challenges for analysis, because there’s so much duplicated data across files. Earlier messages are reproduced in replies and forwards, and the same message can be multiple included in the data repository: once for its sender and once for each of its recipients.
“If we were not deduplicating the database, the number of the total documents would be much bigger,” says Romera.
Then, the data can be loaded into the graph database, a tool that Neo4J CEO Emil Eifrem says is becoming more prevalent in journalism overall, presumably as datasets outlining social and commercial networks become that much more common. ICIJ’s success using Neo4J to analyze the Panama Papers has helped stir interest among other organizations working with similar financial data, like banks and tax agencies around the world, Eifrem says.
On the journalistic side, the company counts Buzzfeed, The Guardian and The New York Times as users, offers training and support to news organizations and recently sponsored a connected-data fellowship at ICIJ.
Recent releases of Neo4J have included more support for data visualization without the need for developers or external tools, software for loading information from other common databases, and algorithms that can help find interesting patterns in the data.
“Clustering algorithms are very interesting for finding, for example, clusters of transactions,” says Eifrem.
The ICIJ plans to release additional stories and data in the coming weeks, Romera says, and it’s likely the Paradise Papers will continue to have an impact for some time to come. Nawaz Sharif, the former prime minister of Pakistan, continues to face corruption charges relating to revelations from the Panama Papers.
Given the high stakes involved for so many powerful people, the ICIJ and its partner organizations naturally emphasize digital security: Romera says ICIJ helps its partners install security software and build secure computing environments where they can analyze the data without leaks of their own.
Phishing emails are a regular hazard of work on Romera’s team, he says, with some even directed at newer employees whose email addresses aren’t widely circulated. The team uses two-factor authentication for its own network logins and takes other security measures as well, Romera says—including carefully validating the leaked data it receives.
“You should never trust your data,” he says. “You have to check everything.”