Parse.ly doesn’t sound like a typical name for a defense contractor.
But the New York-based web analytics startup has been awarded more than $1 million through a Defense Advanced Research Projects Agency (DARPA) program called Memex, focused on developing the next generation of web search.
“It’s trying to explore all of the myriad use cases that search and webcrawling can do for you that aren’t just commercial web search,” says Parse.ly cofounder and CTO Andrew Montalenti.
One early anticipated application, for instance: tracking and shutting down online transactions related to human trafficking and modern-day slavery.
Eliminating human trafficking is a “key [Defense Department] mission,” according to DARPA, and the new search tools can help detect human trafficking activity online, identifying groups engaged in trafficking and discovering ties to other nefarious activities, according to a White House report that cites the Memex project .
“The use of forums, chats, advertisements, job postings, hidden services, etc., continues to enable a growing industry of modern slavery,” DARPA said in a statement announcing the program earlier this year. “An index curated for the counter-trafficking domain, along with configurable interfaces for search and analysis, would enable new opportunities to uncover and defeat trafficking enterprises.”
IST Research, a lead contractor on the project that’s also worked on efforts to use SMS to gather and disseminate information in low-connectivity areas, also pointed to potential applications to epidemiology and tracking the sale of counterfeit goods.
In its ordinary business, Parse.ly provides media companies with tools to analyze who’s visiting websites, how long they’re spending on different pages, what they’re sharing on social media and so forth. And to answer some of those questions, the company has developed tools to crawl customer websites, finding new content and automatically extracting author, section, tag and other information.
That work, some of which Parse.ly has made open source, drew DARPA’s attention last year, says Montalenti. Parse.ly probably won’t be working on specific applications, like anti-trafficking, but will be continuing to develop general-purpose tools for crawling sites and analyzing content in real time.
When DARPA approved funding for Parse.ly, says Montalenti, the agency offered a suggestion: “The way you use the research dollars is you basically work on research projects with your team and basically work on open source projects like you’re already doing.” DARPA, which famously funded the 1970s-era Internet predecessor known as ARPANET, increasingly backs projects designed to produce peer-reviewed, reproducible scientific results or publicly available open source code, according to Montalenti.
And since Parse.ly’s technically a subcontractor on the project, most of the bureaucratic overhead of government contracting—which Montalenti freely admits is outside Parse.ly’s expertise—is handled by the lead contractor on the project, leaving Parse.ly free to focus on the science and engineering.
“They basically said to us, [an option] for you is you can team up with a more established government contractor,” Montalenti says. “They can take care of all the red tape for you; you can focus on the more fundamental research.”
One goal is to essentially build a open source, distributed webcrawling API, which would make it possible for anyone to launch high-performance crawlers using technologies such as Amazon’s Elastic Compute Cloud, similar to proprietary tools which search engines like Google and Bing have in-house.
“We’re coming up with a way to scale out nodes to crawl a specific part of the web and do whatever you want with the results,” says Montalenti. “You could take a list of news domains that you want to monitor, and you could spin up a bunch of Amazon EC2 instances and have web crawlers running there that are crawling that specific area of the web on a more frequent basis, and give you real-time results when new stuff shows up in that area of the web.”
The company already works with a number of open source projects to build its crawlers, including the Python scraping framework Scrapy, the distributed real-time processing engine called Apache Storm, and the distributed message-passing and logging framework known as Apache Kafka.
Kafka manages streams of data like URLs, page content, and metadata for scraping and analysis projects, and Storm makes it possible to run analyses of massive sets of documents—”a great technology to use if you need to do large-scale document processing,” says Montalenti.
Since Parse.ly normally works in the Python programming language, some of its open source contributions have involved building interfaces from that language—with its powerful natural language processing and scientific computing libraries—to Storm and Kafka, which are more typically used with Java and other languages that run on the same virtual machine platform.
In addition to funding work that Parse.ly can use in its usual line of business, the Memex program has also connected the company’s developers with others working in similar areas, including in the academic world.
That’s helped the company stay on the cutting edge of crawling and data-crunching technology, Montalenti says.
At DARPA, “they have full-team meetings and get-togethers where all the researchers from different organizations get together and present what they’ve been doing,” he says. “It’s really awesome and humbling.”