Graph, search and the Panama Papers

The Panama Papers, released on April 3, 3016, represent a watershed moment in international journalism. For the first time ever, layers of the onion around offshore banking and the opaque system of private banking.

How could so much money move through so many systems, on behalf of so many owners, for so long, outside of the purview of the law or taxation?

The Panama Papers

Panama Papers, image © ICIJ

The so-called Panama Papers is a collection of documented leaked by an anonymous whistleblower at a law firm in Panama. This is one of the most massive disclosures of confidential data in human history, including:

  • Volume of raw data
  • Number of documents
  • Number of people implicated
  • Number of organizations cited
  • Amount of money cited

Let’s look at the structure of the data cited in the Panama Papers:

Person A -> Invested money into -> Organization X

But also

Organization X <- Controlled by <- Persons B, C, D etc.

And when you combine these things, you get a picture of who owns whom, and how money flows from entity to entity.

Graph alone isn’t enough

Is it all about finding the connections? Yes and no. There’s much more than the graph. Wired has an in-depth overview of the analysis. Some interesting observations:

There are 2.6 Terabytes of information.

“Heterogeneous data is hard to ingest and cross-reference,” Gabriel Brostow, an associate professor in computer science, at University College London, told WIRED. “Tables, figures, PDFs are almost impenetrable.”

Once the text was extracted it could then be inserted into the index and database. The final database size was predicted by Barron to be 30 per cent of the original data size. “We allowed ICIJ and Süddeutsche Zeitung to run their keyword searches, we could also bring out entities: first names, second names and figures,” Barron said. “We could also use our analytics to find how these names refer to the documents. If you find a person’s name in an email, you may want to find out where else that person has been mentioned across all of the other data.”

Graph and Search together are the key.
Find the data you need. Find connections in the data.

Some related links: