CAR: Finding the Needles in the Haystack Using Computers Rather Than Associates

August 8, 2016

Hal9000 In July, I posted on “computer assisted review” (CAR). After that, I had the pleasure of a meeting with Ben Legatt of Shepherd Data Services. Shepherd Data offers a “CAR product” and Ben offered to discuss how it works. The following is an edited version of my talk with Ben about some of the mechanics of how CAR works.

[Full Disclosure: Shepherd Data is a sponsor of Minnesota Litigator. This post, however, is NOT PAID ADVERTISING. Shepherd Data neither paid for this post nor requested that Minnesota Litigator write this post.]

ML: Help me understand how “CAR,” or computer aided review works.

Ben Legatt: Essentially, CAR software takes the entire universe of documents, or the documents that you actually want to analyze, and it treats all the documents as “a bag of words,” you might say. It doesn’t care what language documents are in. It doesn’t actually understand the words at all, but what it does is it looks at the relationship of words to other words in your data set and the proximity of different words to each other.

It creates essentially a matrix of concepts. What it can then do is reorganize, in a way, all the case documents in your database by how conceptually similar they are. The concepts are derived from that matrix.

Documents where the concepts are similar are grouped together in the matrix, in a sort of three-dimensional matrix. The documents that are less conceptually similar are grouped farther apart from each other in the matrix in a different area. Essentially what it does is it gives you different ways to search for your documents by concept as opposed to just what happens to be on the face of the document.

ML: I’ve actually seen this done in visual form – a graphic looking like a network, a spider web, or a “concept map,” where similar documents are grouped and similar groups are closer to one another, where irrelevant emails about Superbowl tickets or whatnot are grouped at the outside and documents about problems with the such-and-such a widget or manufacturing defect – the subject of the litigation – are centrally grouped.

Ben Legatt: That’s right.

ML: Shepherd Data offers a CAR product through Relativity, right?

Ben Legatt: Relativity Analytics is the main tool. Then Relativity Analytics has a bunch of other tools within it. One of them is the concept analytics. The concept analytics allows you to use CAR. Relativity calls it Relativity-assisted review. It also gives you other tools like clustering. Clustering is a way to look at documents by how conceptually similar they are. They’re concept clusters. You could do this for the entire database or you could do this for a subset. You can continually re-cluster these if you want.

This is really useful to get a high-level overview of what type of documents you have and what’s related to it. If you have no idea what you’ve collected from your client, this is a great way to start poking around. These clusters have sub-clusters within them.

ML: If the analytics tool does not even understand any of the language in the documents in the database, how does it name all of these clusters and subclusters?

Ben Legatt: That’s a really good question. Basically programmers have generated algorithms that generate the cluster names. There might be a cluster just called “Rubarski,” say, and, by looking at it, you won’t immediately know whether that is the sender’s name in a group of documents, the subject discussed in the documents, the name of a loan file, a competing business. It could be anything.

ML: This Relativity Analytics an add-on to your basic Relativity product that can be used for document review, correct? It is something you offer as an add-on for more money?

Ben Legatt: Yes. You always have access to use it, and you can certainly use it. The concept part is, I think, it’s really only useful if you have a data set of a certain size. I would say the bare minimum would be about 50,000 documents. The reason for that is if you don’t have enough documents to begin with, you don’t have enough conceptual richness to train the system and make it understand the concepts in there. Conceptually rich documents are essential to these systems working properly. If your data, by and large, is, say, spreadsheets or databases, these systems will not work. Even though these analytics systems do not actually “understand” language, they are based on this idea of “concepts” or “ideas” expressed in the data set.

ML: Relatvity Analytics is something that you can add on to your normal Relativity use, costs extra based on the size of the database (per gigabyte of data)?

Ben Legatt: Yes.

ML: Can law firms use Shepherd Data’s Relativity without the CAR/analytics piece and, if the data becomes too massive and unmanageable over time, upgrade to Relativity Analytics?

Ben Legatt: Yes. That’s one thing that people almost always do. The way that we do it, is if you decide to use Relativity Analytics, we include all the featured sets that you would want in one price.

ML: What does that include?

Ben Legatt: That includes the concept searching. It includes the clustering. It includes things like concept searching as well, so you can take a block of text, and you can essentially run that as a search where you’re not searching necessarily for the exact words, which you’re very familiar with, doing a keyword search. You’re actually searching for the concept that’s contained within a block of text.

Or you could take a document, and you can say, “I want to find documents that are similar to this. Not necessarily containing the same words, but containing the same concepts.” You can run a concept search on a document and bring back all documents that are conceptually similar.

You also get email threading. That doesn’t actually use the concept index, but it uses a different analysis of the email headers and the data within the emails to thread emails together, even if they’re from different parts of the database, so that you can look at the entire email thread. This can save a huge amount of time and make review much more efficient.

We also provide “near duplicate analysis,” too. When you duplicate analysis, you may have documents that have been … In fact, we have done that for you in the past many times, where you get a data set in and you want to de-dupe the documents. You only want to look at one version of the document, but you may have 16 different versions of a contract and they look very similar, and you want to be able to look at them together and you want to be able to see what the differences are. We can apply near duplicate analysis to that, and you can look at a group of these documents together, at least similar documents together.

Instead of finding the draft version of the agreement and then a subsequent version of the agreement later on in the database, you can look at these all together.

The final thing that you get with this is language identification. If you have a data set that contains many different languages, it will identify the primary language or languages that exist within that document. Then you can look for all the Korean languages or all the Japanese languages. Then you could have somebody else who is proficient in those languages review those without having to figure out where they are in the database.

ML: Have you encountered any pushback or any negative responses to this product?

Ben Legatt: The negative responses that we’ve received are not related to the product as much as to the workflow — the biggest drawback, and something we haven’t really talked about here yet. In addition to all this, what you get is the ability to do the computer-assisted review process, which is essentially, what it is, is it’s a review process. Very similar to what you’re used to in some ways, where you’re looking at groups of documents and making a decision about them. Instead of looking at all the documents in the database one by one, what it does is, it sets up batches for you. It’s usually based on either a fixed number or a percentage or a statistical sample size.

Let’s say, you’re looking at 500,000 documents. It’ll give you several iterations of maybe 1500 documents to look at. You’ll just make a simple decision. Yes, no, yes, no. Is it relevant? Is it not relevant? Do I use this as an example? Or do I not use it as an example? In other words, it might be relevant, but it might have three words, so it’s not very conceptually rich to pull out other documents. Then you go through several rounds of that, and then it will pull out other documents that are conceptually similar to that. Instead of looking at half a million documents, you can look at a much, much, much smaller percentage and have a high degree of certainty that you’re producing the relevant data set.

The biggest drawback to that though, and the biggest complaint that I’ve seen, at least initially, a few years ago when we really started to use this, was that especially for larger law firms who are used to having maybe associates or paralegals doing a lot of the initial document review, the success or failure of assisted review is really dependent on the knowledge of the attorney who’s doing the review. It’s a different mindset. You really want to have the attorney or attorneys, and it shouldn’t be more than a few people, or sometimes actually the subject matter experts, could be your client, if you want to open that up. The people who really understand the case, they need to be the ones that are doing the review. That’s the biggest, I would say, hurdle for people to use it.

When you’re doing a computer-assisted review, you’re teaching the computer what’s relevant and what’s not relevant. Each of these batches, each of these iterations that we’re viewing, is essentially training the computer what’s relevant and what’s not relevant. There are what’s called training rounds, and then there are quality control “QC” rounds as well. What you’ll do is you’ll train the computer, and once the computer has enough- You get reports back, and once you’re confident it has enough confidence level in what it’s done, then you can do a few QC rounds to validate your choices. It’ll give you documents that it thinks are relevant, and it’s going to see if you think that they’re relevant too. If you have a high degree of overturn on that, then you’re going to need to continue to do some more work on there.

That’s really where I see the biggest problems with it in certain teams. I had a case one time a couple years ago where we were getting a lot of “overturns.” In other words, the computer was fed documents coded as relevant documents or non-relevant documents, and the reviewers were overturning those determinations as they reviewed. Reviewers were saying, “No. This is not relevant.” Of course, it confused the system.

In one case, what happened was, the review ended up taking longer than they had wanted because there was all these overturns. Eventually they figured out that the two lead lawyers who were “teaching” the analytic system had different views of what was relevant and what wasn’t. One of them had a more broad understanding of what relevance was, and another one had a narrower view of it. They were constantly overturning each other’s decisions. Once they figured that out, then they got together and said, “Okay, this is actually how we’re going to treat these.” Then there was only a couple more rounds they had to do, and they were able to get it done.

ML: You say “rounds,” and that’s the training rounds to train the CAR analytics system?

Ben Legatt: Yes. Once it’s done, then you get a report. You predetermine the degrees of confidence and the margin of error. There’s certain preset margins of error and degrees of confidence that the system will recommend, but then you can tweak those. Then when you’re done, then you’ve got your data set. Then what you could do is you could maybe do some more, a little bit of QC on your own just to make sure. You could do some quick privilege checks, you know, just check, make sure that you’re not producing privileged documents that it might not have understood to be privileged. Then you’re good to go, and then you can produce your data set.

We just recently did a case, it was just a few weeks ago, where we had almost a half a million documents. It worked really well. We had people here at Shepherd coming in from an out of town law firm. The two main attorneys were working on it here. They also had two people from their client that they were actually working on the review. They were trained on how to use Relativity, so they were really good at that. All they were doing is just going through their rounds and clicking yes, no, yes, no. They were able to get their production set ready. I think it was about three days.

ML: That was 500,000 of their own documents?

Ben Legatt: Of their own client’s documents.

ML: Is this used on the other side’s documents in your experience?

Ben Legatt: It can be. The way that it’s done well, I guess, in reviewing the other side’s documents, you can do an issue review. You can set up a list of issues, and you can train the system about what documents are responsive to one or more issues. It’ll bring back other documents that are conceptually similar to those. It’ll essentially categorize your whole data set, once you’re done with it, with those issues. Then later on you can do prep for depositions and for trial and for things like that on the other side’s documents. That’s a way I think it really works well. Clustering is another great tool to use for reviewing the other side’s documents. You cluster them, and then you get a good sense of what they are.

ML: In terms of the training and the QC, are you or someone from Shepherd Data normally involved in that and billing at an hourly rate?

Ben Legatt: Yes. We’re usually involved with the technical aspects of it and then making sure that the process that we run makes sense for the project. We’re typically not involved in actually reviewing the documents.

ML: In terms of training the analytics system, are you involved in that?

Ben Legatt: We’re involved in the technical process.

ML: When you say the technical process, I guess I’m a little unclear what that means. Let’s say I collect my documents from my client. It’s twenty gigabytes of data. I say to myself, “This is way too much for my firm to review in the time we have, and I would like the data to be organized in some way.” I meet with Shepherd Data. We talk through the case and the concepts of the case and the issues in the case, and you help me devise … What happens next?

Ben Legatt: If you wanted to do assisted review, we’d set up a project for you. We would work together to set up what we call a “seed set” of documents, a pre-coded seed set. The lawyer or law firm certainly could do that itself. In fact, in the case that I was just talking about, they did start out doing an “eyes on every document review.” They reviewed several thousand documents. What we did is we used the results of that review to initially train the analytics system. Really, all that does is maybe save you a round or two. It doesn’t save you a ton of time to do that, but you can use the results, your own work product, to help train the system. The best way of doing that though is to find a group of those documents that are the most conceptually-rich to feed the system.

Let’s say you have 1,000 emails, and they all say one or two sentences. Those are not good examples to use as seed set documents. If you have really good conceptually-rich documents that are good examples of relevant documents, and as well, good examples of non-relevant documents. Conceptually-rich documents that really are not relevant for the case, but will train the system to know that those documents are not relevant, and you’ve already coded those. You can use those to feed the system and start what’s called a pre-coded seed set round. That just saves you a round of review if you’ve already done the work.

ML: I take it that that original seed set or whatever could include documents that aren’t even in the set of documents you’re looking at, for example, your complaint.

Ben Legatt: Not typically, no.

ML: It’s theoretically possible?

Ben Legatt: It’s theoretically possible. The problem with that is, if you did that, then it’s conceivable that if you had other pleadings type of documents, it would think that those are relevant as well.

ML: I’m just thinking your complaint more than anything talks about what you’re looking for and what you’re going to need to find.

Ben Legatt: It certainly does. One thing you could do, I guess in that case, is if you had a paragraph in your pleading that really had relevant concept, especially if it contained some sort of a quote from some email that you knew existed, then you could apply that as an example, just that excerpt. The other thing you could do with this Relativity-assisted review during that process where you’re doing that training the system … Besides just going, “Yes, is relevant. No is relevant,” you also decide whether it’s a good example. The whole document is a good example to use or not. By default, most of them are, and then you can turn it off if it’s not a good example, to train the system, but it’s still relevant.

The next thing you can do though is if, in a document, there is just one paragraph that is the key, you can just click and drag and copy and paste that text into a little box and use that excerpt as what it’s being trained on for that document. You’re telling the system this is what is really relevant in this.

CAR: Finding the Needles in the Haystack Using Computers Rather Than Associates

Leave a Reply Cancel reply