The Next Data Goldmine: Paper

Captricity uses a mix of image analysis and Mechanical Turk to turn paperwork into digital data with 99.98% accuracy.

The Next Data Goldmine: Paper
[Photo: Flickr user MedillNSZ]

Data analysis is a runaway growth field these days–which is all fine and good when the data’s digital to begin with. But what about when it’s in paper form?


The Veteran’s Administration has something over 600,000 outstanding cases, for instance. Those cases must be processed into data for qualified personnel to make decisions about patients, and a lot of the delays in processing are logistical–like different descriptions of medical issues that have to be reconciled.

Captricity has built a service to convert that mountain of paperwork into digital data that is ripe for analysis. The company ingests the paperwork using a combination of algorithm-powered image engines and crowdsourced manual image processing via Mechanical Turk, a cheap Internet-sourced labor service owned by Amazon.

Mechanical Turks manually process form fields, then feed that data into Captricity’s algorithmic engines. Every subsequent round requires fewer Mechanical Turks as the algorithms process forms with increasing accuracy, training the algorithms to recognize all allowable answers in a field, from binary Yes/No to a specific village of origin within the nation of Malawi.

Kuang Chen was inspired to start Captricity after doing his PhD work in African medical clinics. The nurses would lay a baby on a cot and check their vaccine records on a paper form, because paper doesn’t run out of batteries. But without the tools or resources to process that paper data, the medical personnel could only slowly track a village’s rate of malarial infection.

Bolstered by an FDA contract and $10 million investment boost last July, the San Francisco company has since attracted contracts from the insurance industry. That is being used to help fund Captricity’s benevolent arm,, and follow through with the idealistic plan that first sparked founder Chen’s interest: providing data processing at cost for NGOs and civil-serving institutions.

“It’s like that William Gibson quote: ‘The future is here, it just isn’t equally distributed yet,’” Chen says.


Not everyone is comfortable with machines processing their forms, and Captricity honors client requests that every form be checked by human eyes in a secure way. Captricity automatically chops forms up into anonymized batches of fields and sends them off without any identifying data to the humans working for Mechanical Turk.

There are multiple failsafes as well. If one algorithm is uncertain about the content in a form, they give it to another algorithm. If the second can’t read a form, they transfer it to two Mechanical Turks; if they disagree, the form is sent to a third to evaluate both prior human answers. After five Turks deem the form illegible, the form is sent back. That’s a lot of steps to pick apart sloppy handwriting, but sometimes the answer isn’t even in a language.

“They’ll mark things as impossible if they’re written in crayon, but they’ll also mark things as impossible if something like a snowman is drawn in the form. That’s actually happened,” Chen says.

While Mechanical Turk’s policy won’t explicitly name how many Turks are in use, Chen says, the 26 Captricity employees are responsible for wrangling a Mechanical Turk image processing army “in the low tens of thousands,” Chen says.

“The computer vision was very hard, but getting the Turks to behave was equally hard,” Chen says. The accuracy of a workforce like the Mechanical Turk isn’t standardized, so how can Chen trust them?

“The thing is, we don’t trust them,” Chen says. “We verify all the time and so far it’s worked very well.”


Many large-scale employers of Mechanical Turks pre-vet them and retain a trusted pool. But people can change, Chen says, and so repeated quality testing is necessary to make sure the data you’re processing is accurate. Especially with the sensitive and health-associated data that Captricity is processing. Hidden within every collection of form snippets sent to the Turks are planted questions that Captricity matches against known answers. Get enough wrong and Captricity will dismiss the Turk.

Imagine Captricity as a pipeline where the algorithms filter out answers at each stage. Stage one: Is the field blank? Stage two: Is it a binary solution? With 30 stages, they can account for solutions at 98% accuracy, Chen says. Before greenlighting the contract, the FDA required Captricity to undergo an accuracy test. Captricity processed 99.98% of form data accurately, better than human manual processing.

Though in the fortunate position of having to prioritize for their influx of insurance clients such that they doubled their workforce since July, Chen is excited to parlay this experience into helping more civil organizations.

Interested organizations must meet certain criteria, including a clear inability to afford Captricity’s services and a follow-up report about what was done with the data. But it’s a step toward fulfilling a social responsibility that Chen sold his employees on. Captricity has been in development since 2011 and his team signed on to solve this massive legacy problem for civil organizations.

“They’re the true believers,” Chen says of his team. “The way I was able to recruit them wasn’t with startup sweets–it was to really help.” When Captricity was contracted to help the FDA last summer, they ended up processing a backlog of adverse event forms. And when the FDA has a backlog, the nation isn’t getting up-to-date information on drug effects. All because the data only exists on paper.