
Reprinted with permission from the February 18, 2014 "Predictive Coding Supplement" of The Legal Intelligencer. © 2014 ALM Media Properties, LLC. Further duplication without permission is prohibited. All rights reserved.
In the five-plus years since predictive coding first entered the vernacular of electronic discovery, the process has received remarkable attention from the e-discovery community. Nevertheless, many clients and their counsel remain cautious of the technology. The following facts and fictions should quell the anxieties of the uninitiated, while also curbing the unbridled enthusiasm of those who expect it to work miracles.
Fiction: A Partner Must Code the Training Set
Without appropriate guidance, predictive coding technologies cannot fulfill their promise to efficiently categorize vast volumes of documents. But consistency in training is just as important as subject-matter expertise, and coding the entire seed set can be a taxing endeavor for a single reviewer, particularly where additional training sets are needed to address shifts in the scope of responsiveness or the introduction of supplemental data sets. Realistically, finding a knowledgeable attorney (perhaps a mid-level to senior associate) with the time to dedicate to the exercise is more important than planting a rainmaker in the review room.
Where even that is not feasible, it may be possible to exploit a sufficiently large set of previously coded documents concerning the same or similar subject matter. More and more predictive coding tools are able to take precoded document sets — as opposed to random, judgmental or active-learning samples — as training inputs. The value of such coding for training can be enhanced by limiting the seed set to documents handled by the most accurate reviewers on the prior team or taken from the most substantively similar phase of the earlier review. Then, as in a normal predictive coding process, rounds of quality control sampling can be used to align the engine’s training with the contours of the current matter.
No matter the source of the substantive training, it would be a mistake to ignore other advanced analytical tools — including e-mail threading, near-duplicate detection and concept clustering — to generate more and better candidates for inclusion in the seed set. These technologies can magnify the efforts of a single subject-matter expert while mitigating the less consistent results inherent in using multiple trainers over time.
Fact: No Two Predictive Coding Tools Are Exactly Alike
Notwithstanding the tendency of most publications to refer generically to predictive coding and not to particular implementations, there are fundamental differences between various vendors’ predictive coding platforms, and the algorithm in the proverbial black box is perhaps the least significant. Among the more noteworthy distinctions to users is how the different engines generate new document samples for iterative training. Some pundits reject any predictive coding tool that does not incorporate active learning principles (where the tool selects particular documents in an effort to refine its understanding of responsiveness), while still others are vocal adherents to the school of iterative random sampling. Another disparity between platforms concerns whether document families (e.g., an e-mail and its attachments) are initially coded and subsequently categorized as a unit or instead as independent documents each assessed on its own content.
Whichever predictive coding engine is chosen, practitioners need to recognize and implement the requisite adjustments to the optimal workflow for the platform in use, from initial training and refinement to final validation and everything in between. Ignoring these idiosyncrasies could result in the application of an unsuitable e-discovery protocol that imposes either unattainable or inappropriate transparency and validation obligations. It could also make it impossible to achieve acceptable levels of recall and precision, cannibalizing the desired cost savings in the process.
Fiction: It Replaces the Human Element
Mark Twain might say that rumors of the demise of the human document reviewer have been greatly exaggerated. Indeed, predictive coding is designed to leverage attorney expertise, not replace it. In the ordinary course, a knowledgeable case attorney is tasked with conveying his or her understanding of responsiveness to the predictive coding engine by tagging a seed set of representative documents. In addition, a defensible predictive coding process customarily requires that attorneys with a comprehension of the claims and defenses of the case perform certain quality control steps against a sampling of the resulting documents to verify or correct the engine’s predictions.
Predictive coding also has practical limitations in areas in which trained document reviewers may be better positioned to excel. For example, predictive coding algorithms can struggle with highly specific document requests that require splitting hairs on responsiveness calls, particularly where the production of even a small number of false-positive documents could be damaging. Moreover, most predictive coding engines have yet to demonstrate reliable results in identifying privileged, highly confidential or “hot” documents. Even if the predictive coding workflow is confined to filtering out the clearly irrelevant material, doing so still permits the attorney review team to focus its efforts on the documents with highest value to the case. The convergence of predictive coding technology with human expertise invariably yields the best outcomes.
Fact: Final Cost Savings May Be Overblown
The primary allure of predictive coding to most clients rests with the potential cost savings associated with reviewing fewer documents than in a traditional linear review. Less often discussed, though, is the potential for those cost savings to be undermined by common side effects of employing predictive coding. Setting aside the technology charges associated with running predictive coding against the document population, clients may incur substantial upfront costs for outside counsel to negotiate predictive coding protocols, followed thereafter by mounting fees to resolve potential disputes — all at more seasoned attorney, not contract attorney, rates.
Additionally, parties facing document populations riddled with numerically dominant Excel spreadsheets, structured data exports and multimedia content may find predictive coding less effective on those document types, triggering the very attorney review fees they were looking to avoid. Privilege review and deposition preparation may also mandate attorney review. The cost-benefit analysis for using predictive coding should account for these variables.
Fiction: Search Terms Are Extinct
Where properly administered, one can argue that predictive coding offers an objectively superior means of culling a large document set than traditional search terms. A colorable counterargument suggests that if search terms were subjected to the same rigors of quality control and validation as are routinely applied to predictive coding, then search terms would compare more favorably. But a direct apples-to-apples comparison of the two approaches is not only inherently flawed, it is also unnecessary, because the two can coexist quite productively. Courts have already endorsed the use of search terms as a precursor to employing predictive coding, as in In re Biomet M2a Magnum Hip Implant Products Liability Litigation, No. 3:12-MD-2391, 2013 U.S. Dist. LEXIS 84440, at *5-8 (N.D. Ind. Apr. 18, 2013) . Also, certain predictive coding tools encourage the use of judgmental sampling—using keywords alongside other faceted culling techniques — to isolate highly relevant documents for inclusion in the seed set.
Furthermore, search terms can be used very effectively to help validate predictive coding results. For instance, documents predicted to be nonresponsive can be searched for keywords that correlated highly with responsive or hot documents. Search terms also still have a place in identifying potentially privileged documents for attorney review. And in the event an imminent production deadline provides insufficient time to train a predictive coding engine, targeted search terms may suffice to identify a likely responsive production set to be supplemented by post-production predictive coding and other analytics.
Fact: Comfort with Statistical Concepts Is Needed
A successful application of predictive coding does not require a degree in statistics. Nonetheless, practitioners would be well advised to have a lawyer on the team who is conversant in the handful of basic statistical concepts that play a central role in the defensibility of the process. Having such skills on hand may even be an ethical obligation, as Comment 8 to Rule 1.1 of the recently amended Model Rules of Professional Responsibility advises that lawyers “keep abreast of changes in the law ... including the benefits and risks associated with relevant technology.”
While a thorough description of the statistical concepts underpinning predictive coding is beyond the scope of this article, there are a few important terms to understand. Depending upon the tool being used, random sampling, judgmental sampling, active learning or a hybrid approach may be utilized to comprise the seed set of documents that trains the predictive engine. And the performance of the algorithm is typically measured by recall (how well the predictive coding algorithm is locating all of the sought-after documents) and precision (how well it is locating only the desired documents). Even the arithmetically challenged among us can and should gain comfort with these terms before embarking on a predictive coding-driven document review.
Armed with these facts and relieved of these fictions, practitioners can harness the power of predictive coding with more realistic expectations and a greater likelihood of success.
Jason Lichter and Michael I. Frankel
The material in this publication was created as of the date set forth above and is based on laws, court decisions, administrative rulings and congressional materials that existed at that time, and should not be construed as legal advice or legal opinions on specific facts. The information in this publication is not intended to create, and the transmission and receipt of it does not constitute, a lawyer-client relationship.