In a San Jose federal court today, Apple will attempt to capitalize on this summer’s $1 billion win against Samsung by alleging that six more Samsung devices, including the wildly popular Galaxy S III, boast features stolen from the iPhone.
To demonstrate that the Korean phone-maker intentionally ripped off Apple, lawyers will have to show a pattern of let’s-copy-Apple talk among tens of millions of Samsung’s internal documents–too many for any legal team to sort through manually, except at enormous cost. So when a billion-dollar verdict is on the line, companies like Samsung and Apple are turning to a relatively new forensic artificial intelligence technique to prove who’s wrong. And since this type of cyberforensics is being used by the offense, the ramifications for business could be massive.
With a single Linux box and some open-source machine-learning software, corporations can now undertake risky data-driven lawsuits that, until recently, would have been expensive, prolonged, and circumspect in court. But “predictive coding,” as it’s known in legal circles, is also a boon for any company trying to take stock of what it knows, comply with regulators, reverse-engineer a decision, or complete a merger. And since the software is based on an open-source project, almost any company can undertake to use it. Here’s how it works.
In 2009, Joe Looby led the team that cracked Bernie Madoff’s “black box” servers, appointed by case trustee Irwin Picard to determine if any of Madoff’s $65 billion in trades were real. From his office high above Times Square, Looby can see into a neighboring dance studio where Broadway performers rehearse their acts. In much the same way, predictive coding gives his firm, FTI Consulting, a detailed forensic window into a company’s practice.
“A couple good things are happening now,” Looby says. “Courts are beginning to endorse predictive coding, and training a machine to do the information retrieval is a lot quicker than doing it manually.”
The process of “Information retrieval” (or IR) is the first part of the “discovery” phase of a lawsuit, dubbed “e-discovery” when computers are involved. Normally, a small team of lawyers would have to comb through documents and manually search for pertinent patterns. With predictive coding, they can manually review a small portion, and use the sample to teach the computer to analyze the rest. (A variety of machine learning technologies were used in the Madoff investigation, says Looby, but he can’t specify which.)
“Every case is different, so it’s hard to give an estimate of the IR time saved,” says Looby, “but if we use reasonable assumptions and model the time for two similar teams to pass through a million documents, a team with a predictive discovery machine should finish the job in less than a third of the time it takes the manual team to finish.”
The savings are significant. “We’ve had clients save anywhere from several hundred thousand to a million dollars on these types of projects,” estimates Looby. Since companies perform a cost benefit analysis as to whether to settle or litigate, machine learning can mean the difference between an affordable suit and a frivolous one.
E-discovery has been in practice for more than 10 years, but only this year has it become defensible and reliable enough to play a substantive role in major litigation. Predictive coding is so new that being unaware of it almost cost Samsung this summer’s entire trial. The lede paragraph of a July 31 Law.com article about the suit began:
[In] one of this year’s most dramatic e-discovery decisions… Magistrate Judge Paul Grewal of the U.S. District Court for the Northern District of California ordered an adverse inference jury instruction against Samsung for failure to take adequate steps to prevent the destruction of relevant emails.
An “an adverse inference jury instruction” is bad: it means the judge bad-mouthed Samsung straight to the jury for trying to conceal evidence. According to Law.com, the judge’s statements to the jury were:
- Samsung failed to prevent the destruction of relevant evidence for Apple’s use in this litigation;
- The evidence was destroyed because Samsung failed to meet its discovery obligations;
- The jury “may presume” both that the lost evidence would have been used at trial and that it would have been favorable to Apple.
Why did the judge get so pissed? Because Samsung wasn’t prepared for e-discovery. The company’s error was an expiration date on all corporate emails which caused them to automatically delete every two weeks. The problem, the court said, was that Samsung knew as early as 2010 that it would be facing an Apple suit, and never turned off its auto-delete feature to make way for the e-discovery process. Destroying evidence this way is called “e-discovery spoilation.”
Even with a two-week auto-delete mechanism in its email system, Samsung still managed to supply the court with 12 million pages of documents from other sources, providing a benchmark for the sheer scale of information retrieval required by mega-lawsuits like this one.
Samsung lost anyway, but this “adverse interference instruction” might have sealed the outcome. The only thing that saved Samsung from an early and definitive loss was a subsequent decision by the court to sanction Apple for the same exact thing: e-discovery spoilation owing to Apple’s email quota policy, which effectively forces employees to manually delete older internal communications.
Don’t fear! Predictive coding is a killer technology even when you’re not being sued for billions. The goal of these projects is mass document review, which companies can use for almost anything. The technology is so versatile (and legally necessary) that Microsoft has built new e-discovery features into SharePoint 2013, making it possible for attorneys and other subject-matter experts to search across multiple SharePoint and Exchange repositories to find relevant documents.
Just this week at its Discover conference in Frankfurt, Germany, HP released an Internet appliance box for AppSystem, its ERP stack, which acts as a modular e-discovery component for corporate servers. Trade publication ChannelBuzz says:
While the value a company gets out of eDiscovery solutions varies widely depending on the litigiousness of a given country and/or vertical, it’s an attractive space for solution providers to play in, said Rafiq Mohammadi, general manager of HP Autonomy Promote, because of the margins involved with services around deployment, customization and optimization of eDiscovery. And it’s not an opportunity that depends entirely on company size–a fact that Mohammadi says he knows all too well from having been involved in a patent litigation when he was part of a five-person consulting organization earlier in his career.
Even with Microsoft and HP launching products, Looby says predictive coding (and the human process around it) are both still “untapped in the field of Information Governance.” He says most companies can use it to save bucks on storage without taking the legal risks that Apple and Samsung did.
“Companies can use it to identify and defensibly delete electronic documents,” he says. “Corporations are usually reluctant to delete anything because they may violate a legal hold or a law requiring document retention, but a computer model can be trained to delete documents that don’t resemble those requiring retention.” The benefit isn’t just cost savings on storage, but also cost savings on backup and easier “cloud” migrations away from data-bogged legacy systems.
In corporate M&A, Looby says, predictive coding can help companies find, review, and classify documents for the FTC, which reviews documents to assess whether a proposed merger will have anti-competitive effects, and ultimately harm the consumer. So how does it work?
To make its output more legally defensible, FTI’s predictive coding engine has two other components: a statistical analysis engine built by their in-house statistician, plus a “document mapper” technology that clusters documents with similar content. But if you’re not doing legal work, getting a predictive learning application off the ground is far easier thanks to an unusually named open-source library called Vowlpat Wabbit.
VW, as it’s abbreviated, is the basis for FTI’s Predictive Discovery product, and it was originally developed at Yahoo Research and later Microsoft Research. Its purpose: to allow a small team of non-specialist engineers to build a fast-learning, scalable, web-based learning engine that could take text input in almost any format and rank the results using a “pairing engine.” The project, maintained by Microsoft’s John Langford, can employ a single inexpensive Linux machine to process tens of millions of documents. You can find the code and documentation in the Github repository located here.
The process starts out the old-fashioned way; a team of attorneys pulls a sample set of documents from the whole, and reviews each one for relevance. The size of the sample is big enough that the attorneys can be sure it’s representative, with an acceptable tolerance of 1%.
When each sample document has been declared relevant or irrelevant, the documents are then fed into the predictive coding software, which looks at the whole sample and tries to figure out how the experts made their judgments. It does this by assigning a “weight” of importance to words and groups of words, which are called “hashes,” or groups of one to three words, and “tokens,” which are single keywords. Hashes and tokens which tend to indicate that a relevant document is assigned a positive weight, while features that tend to indicate that an irrelevant document has a negative weight.
Then the software checks how well it measures up to the humans. Going back to the sample set, the software selects a document and sums the weight of its features. The weight is then used to guess whether the document is relevant or irrelevant to the case, and the software compares the guess to the attorney’s conclusion. If the guess was wrong, predictive coding software rethinks how much weight should be assigned to each feature. As this is done for each document in the sample set, the software continuously “learns” how to assign the correct weight values to achieve results that best reflect the experts’ judgment.
The program can be run over the whole sample multiple times to further “train” its guessing accuracy, wherein lies the real advantage over searching a group of documents by keywords. A keyword search only looks for a few items, rather than looking at every one and assessing its importance.
When the weights have been properly tweaked, the software pores over the entire sample again, using the new criteria to sift out the relevant wheat from the chaff. The results still won’t match perfectly, but the errors can be adjusted. Let’s look back at weights: the total weight required for a document to be considered relevant, or the “judgment line,” is arbitrary. If a document needed a total weight greater than, say, zero to be deemed relevant, the software might select 90% of the documents deemed relevant by the experts, with the other 10% tossed in the irrelevant pile. The percentage of relevant documents selected is called “recall,” and tends to decrease as the judgment line is set at higher weight values. The percent correctness is called “precision,” and tends to increase as the judgment line is set higher.
If we set the bar a little higher, perhaps at +0.3, we might change our outcome to 70% precision with 80% recall. There is usually a trade-off between the two, because they inversely rely on the number of documents that the software selects. Predictive coding software will look at the results from placing the judgment line at every possible value, and choose the one that best fits the precision and recall parameters that the attorneys are asking for. Compared to other methods, though, predictive coding has higher precision and recall rates, and the trade-off can be easily examined and managed. For example, the precision and recall of a keyword search can only be adjusted by a trial-and-error process of changing inputs by the human.
Finally, the predictive coding software applies its refined weight values and judgment line to the entire collection of documents, reducing the amount of documents that need to be examined by a human from, say, 10 million to as low as a few thousand. To be fully confident in the quality of the results, attorneys can look at sets of documents (usually of a few thousand) from the relevant and irrelevant piles that the software generated, to evaluate how well the software met expectations.
Predictive coding is known in the annals of artificial intelligence as “supervised machine learning.” FTI can do it so effectively and defensibly because it adds in human training, human checking, and statistical mapping. “Could Watson do this job someday? Sure,” says Looby, referring to the natural-language IBM computer that beat top winners Brad Rutter and Ken Jennings at Jeopardy. “But you’d still have to train Watson.”
Thanks to Lance Johnson for his contributed reporting.
[Image: Flickr user Adam Ward]
Follow the author Chris Dannen on Twitter.