Industry Leading
eDiscovery Insight

Learn from renowned eDiscovery thought leaders

Learn More

Lexbe’s Erin Derby, Published in Paralegal Today

Erin Derby, Certified eDiscovery Specialist (CEDS), and a member of the Lexbe technical services team was recently published in Paralegal Today. Her article, Finding the Needle in a Data Haystack, featured Erin’s expertise on advanced search methodology and offers techniques on culling data, constructing quality search queries, uncovering personally identifiable information (PII) and provides instruction on how to keep records of search processes.

Erin has presented 2 webinars for Lexbe Best Practices: eDiscovery Search and Best Practices to Avoid Missing Key Evidence in Large Doc Review (Uber Index).

You can read Erin’s article here.

Understanding your eDiscovery Index and how it finds (or misses) evidence

How your eDiscovery platform parses and organizes your electronically stored evidence can be the difference between finding or missing that smoking gun. Or worse, unwittingly handing a smoking gun to opposing counsel. Pulling back the curtain on how an eDiscovery platform ingests electronically stored documents and makes the text within documents searchable reveals hidden places where evidence may be hiding. This article explains indexing and breaks down the types of search indexes used in eDiscovery software platforms, discusses the pros and cons of each, and offers solutions to ensure that you never miss crucial evidence.

Indexing occurs during the upload of your documents to your eDiscovery review platform. A number of processes run which separates and organizes your data. The text, in particular, is extracted from your documents and filtered into a database or index. When you enter a search query your software does not review each document searching for the word; that could take hours or days. Rather your software refers to the index (just as you would in a textbook) in order to quickly pull the relevant documents for your review. The process by which the text is extracted from your documents to be placed into that index is critical to the quality of search results.

There are 2 basic indexes used in eDiscovery software platforms, an OCR Index or a Text-based (also called Native extraction) Index.

OCR stands for Optical Character Recognition. In this process, your electronically stored documents could be originally scanned or saved from a native document through a virtual print driver. Specialty OCR software recognizes alpha-numeric text patterns. For example, a Word doc uploaded would be “printed” within the software engine and the text that appears on that virtual print would be lifted off the page and indexed.

Text-based Indexing is also called Native Extraction Indexing because instead of processing the document as a printed page it rather looks at all of the underlying code and data within a document. Where OCR sees the document as a print, Text-based indexing lifts the hood and extracts all of the computer-embedded text in a file and additionally will capture the data that you do not see, such as comments.

The pros of one indexing approach are the cons of the other and vice versa. Specifically, an OCR-based index may miss hidden fields, such as hidden columns on an Excel spreadsheet, while a text-based index would not. Conversely, a Native extraction-based index will not read (index) the text on an image, including scanned or PDF’d documents, where an OCR index will.

This is an example of a native PowerPoint document. When you receive this doc as a .ppt file an OCR-based index would create a virtual print of each slide and lift any text that appears on that print for indexing. The embedded images with text, like this chart titled “Load Growth Model”, would have all text that appears on the chart indexed. Speaker notes, however, like this one regarding “November Data”, could be missed as notes do not normally show on a print, by default.

Conversely, a native extraction-based index would only recognize the .jpg title of the image of the chart and index that file name as text. It cannot “read” an image (as OCR can) and so none of the text appearing on the chart would be indexed. It would, however, pick up the speaker notes regarding November Data. When you search for the company name “CAISO” an OCR-based Index would retrieve this document but a Native Extraction-based index would not. When you search for “November Data” the Native Index would retrieve this document, but an OCR index would miss it. If you were to perform a Boolean search for “CAISO AND November Data” neither index alone would return this document as responsive as it would only see one term or the other.

Some modern eDiscovery software providers will offer both indexes, however, they are siloed and so you would have to run your entire search twice, once through each index. This not only doubles your search time but still leaves you vulnerable to miss evidence when you are using Boolean searches to narrow results. Some eDiscovery vendors will instruct you to write additional language into your ESI order in an attempt to mitigate the loss of potential evidence. Unfortunately, the more complex an ESI request the more likely that mistakes will be made and evidence missed.

Lexbe has solved this false ‘index dilemma’ by creating the first concatenated eDiscovery search index, our Uber-Index℠. At ingestion, documents are run through both OCR and Native extraction indexing simultaneously. Then the OCR and Native-Extracted indices are compiled into one single, searchable database. All text is captured by these two complementary processes, and all evidence is searchable.

Additionally, Lexbe offers an integrated translation feature which is also included in our Uber Index for seamless search in either language. Whether you opt for Lexbe to perform your document translation or upload your own translated docs, our software will tie the original doc to the English translated one for integrated search and document review.

Finally, Lexbe also performs an advanced metadata extraction at ingestion for precision searches. Details such as the author of a document are extracted and will be searchable.

Features OCR Index Text-Based Index Lexbe Uber Index
Embedded Text
Scanned Docs
Hidden Cells/Sheets
Tracked Changes
BCC Field
Meta-Data Extraction
Translated Text

With the Lexbe eDiscovery platform, your search is faster and more complete than with any other index on the market. For more information on how indexing works watch our webinar Best Practices to Avoid Missing Evidence in Large Document Reviews, part of the Lexbe eDiscovery Webinar Series.

Protecting eDiscovery Privilege, the Case Against File Sharing Sites

File sharing services, such as DropBox, have become increasingly used as eDiscovery repositories for incoming data and outgoing productions. With easy sharing, via a simple URL link, it’s understandable why these tools appear to offer an optimal solution for sending and receiving massive amounts of data as one does with eDiscovery litigation. Unfortunately, this “solution” can become a massive liability and we caution clients against using these services because it is simply too easy to accidentally share privileged documents. In fact, there are several cases in which information has been inadvertently shared and the results were disastrous for the offending party.

What’s the problem?

It is not that it can’t be done correctly, it is more that one is asking for problems with an open platform like this. With default settings in place, the “owner” of a file relinquishes control of the data within the file when shared with other users. Once shared, the data within the file can be copied, changed and shared without the owner’s permission. New users can be added to the file to view the data and, with seemingly unlimited “cooks in the kitchen,” it is too difficult to maintain chain of custody and ensure responsible sharing. A few specific issues with file sharing services include:

  1. Shared files and folders are not static. This is not the equivalent of sending a document attachment via email. The shared file or folder remains “live”, thus any future additions or changes can still be seen by people with the link into perpetuity.
  2. On many platforms, user groups are created and can be duplicated to other folders with a simple click. For example, if several users have access to “Case X Final Production” folder, another attorney could grant access to all users in that file to “Case X Notes”- not realizing that opposing counsel was part of the original group.
  3. The link is not automatically password protected so anyone with the link can view the file unless proper authentication measures are manually enabled. This literally means that without setting up a password, anyone on the internet could potentially access your file.

What have the courts said?

In Harleysville Ins. Co v. Holding Funeral Home, Inc., Case No. 1:15cv00057 (W. D. Va. February 9, 2017), an insurance company refused a funeral home’s fire damage claim after determining the fire was caused by arson. An investigator for the insurance company uploaded video taken at the scene to a platform sharing site, The investigator sent the link to the insurance company attorney who then shared it with the funeral home attorney in order to substantiate their arson claim. Later, however, the insurance investigator uploaded additional files to that same folder, which the funeral home attorneys still had access to. The court found that because the link and files within were not properly password protected the insurance company had, in essence, “left the files on the park bench” in a virtual sense and thus waived privilege.

From the court:

Whether a company chooses to use a new technology is a decision within that company’s control. If it chooses to use a new technology, however, it should be responsible for ensuring that its employees and agents understand how the technology works, and, more importantly, whether the technology allows unwanted access by others to its confidential information.

What does Lexbe Recommend?

We have developed the Lexbe eDiscovery Platform to include a number of checks against inadvertent disclosure of privileged docs. We create a secure encrypted link specific to each production that can then be safely shared. By insulating exports with secure production links, we help prevent user error that could result in sharing documents not meant for opposing counsel or outside parties.

‘Above the Law’ features Lexbe for Cloud Document Management

Lexbe was recently featured in the leading legal blog: ‘Above the Law’ in a post entitled: Today’s Tech: A Litigation Attorney Uses Technology To Level The Playing Field.

A practicing litigator was interviewed and noted that technology in general, and cloud computing in particular, helps a small law firm to stay competitive in today’s constantly changing legal landscape. “Technology is the great equalizer between solo and small law firms and large firms. For smaller law firms it puts you on a level field with law firms many times your size. I can now handle 2 or 3 complex construction litigation cases at one time and it doesn’t take over my practice.  I can get through 30,000 documents quickly and narrow them down to what I really need to look at.”

Above the Law noted that Lexbe helps small firms handle the big cases. The interviewed attorney noted that “I handle a lot of cases where upwards of 30,000 docs are produced. These documents used to fill the room. Now I buy access to Lexbe on a case-by-case basis and then search the documents using Boolean logic.”

Lexbe speeds up review and allows for attorney, case and time leverage. Above the Law continues: “So a process that used to require six associates days to read through the documents can now be accomplished in minutes. Because of the Lexbe software, the entire playing field has been leveled for my firm.”

Read the entire post at Above the Law.

Choosing a Production Format for Your Case

There are a variety of acceptable production formats, each with their own benefits and drawbacks. To determine the best fit for your case, look down the road and consider the scope, goals, and methodology of your review.

An ‘electronic search’ approach to discovery requires that all documents be converted to an electronically searchable form and that a method of searching across all files is available. For electronic documents delivered in native file format, search is usually possible in some form or another. This is particularly true for standard Microsoft Office documents. Email presents more difficulties, as email attachments may need to be deconstructed from the electronic file holding the email to be searched. Paper-based documents must be scanned and OCRed to make them searchable as electronic files. The OCR process inevitably introduces OCR errors, which diminishes the effectiveness of the electronic search, as compared with the search of native files or electronic documents based on native files.

The ‘electronic search’ approach also requires that all documents are addressable as a collection from a single search query. Litigation document repositories may be established to make all documents accessible and searchable, often between multiple parties in different locations. These systems may be comprehensive and expensive. Alternatively, a law firm may make documents searchable from a file server on its local area network, or run LAN-based case management software, which may allow for indexing and searching of litigation files. For a very small case, all documents might be stored on a single CD or DVD, or kept on a portable hard drive, and searched from the Windows operating system.

Attorneys are now taking several approaches to e-Discovery when searchability or metadata are important. Each approach has its own advantages and disadvantages.


A TIFF file is a raster-based image most commonly used in the transmission of faxed pages. Many litigation document management programs were developed using TIFFs as a key part of their program architecture. TIFF files are images and usually do not store computer readable text within the file. Instead, litigation document management systems associate text from a separate text file as part of what is known in the litigation support industry as a ‘load file’.

Advantages of TIFF productions:

  • Ease of Bates Numbering: Bates Stamping is used to identify which documents have been produced, particular documents and pages in connection with wietness examinations, and which documents have been withheld for privilege. TIFFs can be single or multi-paged. Historically, litigation support vendors have often scanned paper documents, or convertd electronic documents into single-paged or multi-paged TIFFs, with each file name being the Bates Number or Bates Number Range. Each individual page in a production would have its own Bates Number.
  • Improved Redaction: Documents sometimes need to be partially redacted to remove references to privileged information, work product or trade secret information, identify which documents have been produced, particular documents and pages in connection with witness examination, and which documents have been withheld for privilege. As a raster image, TIFF files are relatively easy to redact, as compared with native files or PDF files. However the recent release of Acrobat Professional 8 with a built in PDF redaction tool has lessened this advantage of TIFF files.
  • Requirements of Legacy eDiscovery Platforms: Several legacy litigation support management systems work best or exclusively with TIFF files because these systems were designed when TIFF files were the only viable option. These systems predate the development and popularity of PDF and native file review tools.

Disadvantages of TIFF productions:

  • Complex Load Files: Because TIFF files are raster images, they do not retain computer readable text as part of the file
  • Not Very Usable Outside of Legacy Systems: Because of the complexities of the TIFF load file, these files are not very accessible or usable outside of the legacy litigation management systems for which they were designed.
  • Metadata Not Retained in TIFFs: Metadata is not retained as part of a TIFF conversion. To address this shortcoming, many e-Discovery providers now separately save file metadata in a database prior to a TIFF conversion.
  • Cost of TIFF Conversion and Load File Creation: Because of the shortcomings above, a TIFF production requires that the producing party pay to convert electronic files to TIFF images and create the associated text load file so that TIFF-based litigation management systems can read it. This can be very expensive in large productions.


A more modern approach is to convert electronic files to searchable PDF files for a discovery production. PDF files overcome many of the limitations of working with native files. Indeed, Adobe created both the TIFF and PDF formats and designed PDF as a more functional replacement for the TIFF. PDFs have become ubiquitous in business and in law.

Advantages of PDF Format:

  • Viewable in Adobe Acrobat: Files are searchable and easy to work with. Anyone with Adobe Acrobat can view a file without the need to worry about having the right application program or viewer installed.
  • Bates Stamping: Documents can be bates-stamped and pages specifically identified using a variety of software tools.
  • Redaction: Pages or specific passages can be redacted with Adobe’s latest version 8 of its Acrobat Professional program.
  • Some Metadata Retained: A PDF conversion can be set up to retain some of the metadata and then it can be viewed reviewing certain properties in the PDF file. Retention of metadata in a PDF file is not automatic, and is dependent on the conversion software used and settings used in the conversion process.

Disadvantages of PDF Format:

  • Conversion Cost: As with TIFF files, conversion of electronic files to PDF requires expenditures, as compared with simply delivering native file format.
  • Not all Metadata Available: A standard PDF conversion only captures some of the available metadata. Information such as the document author and title typically may be captured. The document creation date may be changed to the date the PDF is created. Other key metadata, such as last save, last print, edit time, deletions, comments and hidden text usually are not captured in the PDF copy.

Native Format

Some litigation professionals pursue discovery in native file format, the original file format in which the electronic file was produced, such as Word, Excel or Outlook. This has become more popular since the new federal e-Discovery Amendments as it provides the requesting party greater leeway in requesting files in native format.

Advantages of Native Format:

  • No Conversion Expense: Unlike TIFF or PDF productions, there is no conversion expense in delivering files in native format.
  • All Metadata Retained: All file metadata can be retained in a native production.
  • Text Searchable: Text is usually searchable the best in native format. There is no chance of text being lost or corrupted in a file conversion to PDF, or a TIFF load file, or the introduction of OCR errors.
  • Some Documents Don’t Display Well in other Formats: Native may be the only practicable format for some file formats, such as spreadsheets. Excel and other spreadsheet files are notorious for converting poorly to TIFF or PDF, often becoming unintelligible. Plus, spreadsheet formulas, hidden cells, and hidden text usually do not make the conversion to other formats.

Disadvantages of Native Format:

  • Difficulty of Pre-Release Review of Metadata: Metadata, by design, are not easy to review in native file format. Some metadata in Office files can been found by clicking through various property screens, but this is time-consuming, requires a consistent methodology to view all viewable metadata, and end the end does not access all available metadata available in the file. Newer litigation management systems will display metadata of native files.
  • Difficulty in Bates Stamping at the Page Level: Documents in native file format cannot be easily Bates-stamped, and any Bate stamping will change the metadata. Often Bates stamping of native files is handled instead through a file naming convention, in which the file name is modified to include a Bates designation. This can work well, but does not allow for page-level identification.
  • Inability to Easily Redact: Documents produced in native file format cannot be easily redacted. For this reason, in a native production, documents that need to be redacted are often handled in a different manner, such as converting redacted documents to another format that can be redacted, such as PDF.
  • Difficulty of Pre-Release Review: Attorneys for the party producing electronic files must review the files to see if they are responsive to the discovery request or include privileged information or trade secrets. This can be difficult as electronic files may have been created in multiple applications. Modern litigation support applications allow most native file formats to be reviewed without installing the applications that created the file. Plus, modern litigation support applications allow metadata of native files to be reviewed in an easy fashion.

Advances in technology are reshaping how litigation discovery is handled. Use and availability of electronic documents is changing how discovery is done, with an increasing emphasis on search. Additionally, metadata availability in electronic files requires that litigators find effective tools to review and analyze this new source of information. New discovery rules reflect the reality of available technology and prior paper-based approaches are ineffective and have become outmoded.

The best eDiscovery production format will usually turn on methodologies and workflows attorneys and litigation teams plan on using to review the files. Document management systems usually are optimal for files in certain formats. Plus, consideration should be given on how Bates numbering and redaction will be handled before choosing a format.

Latest Blog

Subscribe to LexNotes

LexNotes is our monthly newsletter of eDiscovery and legal document management and review tips and best practices.