Lexbe logo spacer Login
Lexbe Home Lexbe Online Electronic Discovery About Us Lexbe Home
Lexbe support

Lexbe OCR Explained
Optical Character Recognition of PDFs Technical Note

This technical note explains how Lexbe's free OCR service that runs on its servers works.  OCR, or 'optical character recognition', is the process of taking scanned images (from documents) and electronically converting them into searchable text. Lexbe uses a form of OCR to convert image PDFs into what is known as 'text-under-image' PDFs or searchable PDFs.  This means the the original document image (or scan) is saved and the text is added to the file in a hidden layer, so that the document can be searched (and 'copy and paste' is available), but the appearance of the document remains unchanged.  Lexbe offers this service for free, in that no additional charge is assessed for OCR beyond the normal monthly charges for the account.

Adobe's PDF format is complicated and ever-evolving.  For example, a PDF that has been created from a scan is very different programmatically than a PDF that has been created directly from a Word or Excel document.  Plus, PDF is an open standard, but different PDF creation programs may create PDFs in different ways with different characteristics.  PDFs optionally have a number of security features like password protection, print protection and text-extraction prevention, that can complicate or confound OCR.  Finally, PDF the standard is evolving and new features added by Adobe or other developers can impair the ability of files to be OCRed.

How OCR Works on Lexbe
This describes generally how Lexbe's OCR works:

  • Recognizable image-based PDFs will be converted to searchable, text-under-image PDFs. 
  • PDFs that contain text within them already, or PDFs that include text and images, will usually be skipped.  In this case our search index will use the original PDF text.  (OCR almost inevitably introduces OCR errors so we do not OCR a document that already contains text).
  • OCR will not be done if the option 'Perform OCR on PDFs' is not selected at time of upload in the upload window.
  • OCR works on simple 'flattened' image PDFs only, and not on some complex PDFs, including PDFs with embedded attachments in the Acrobat PDF Portfolio format.
  • OCR can take a substantial time to complete when many documents have been uploaded at or about the same time.  This is a shared service and is subject to server utilization.  Uploading a few documents is usually very fast, but the free OCR service can still be slow if there is high utilization.
  • After OCR is completed, the search index will be updated to include the new OCR text.  Until then the search results may be incomplete.
  • Lexbe OCR recognizes Unicode and will therefore apply OCR to many non-English languages.  However, our OCR engine uses an English-language dictionary look-up only to aid in OCR accuracy, and does not use a dictionary for other languages.  This reduces the accuracy of non-English OCR.

Other Limitations on What OCR on Lexbe Does
OCR is a highly useful tool, but is far from perfect.  OCR does best with clearly readable text from high-quality scans.  OCR quality degrades with copy quality.  OCR quality can also degrade, or OCR may not be done at all, with skewed or rotated pages, pages with unusual fonts, pages with dirty or specked backgrounds, pages scanned an low resolution, etc.

There are many reasons why OCR will not complete successfully on PDF files.   File corruption is one reason.  Even when a file will open, some pages may be corrupt and prevent OCR from running successfully.  File print security is another.  Producers of PDFs often place print or content extraction restrictions on PDFs.  This will prevent OCR from running.   File open passwords will also prevent PDFs from OCRing.

OCR almost always produces errors, and sometimes will produce many errors.  OCR is best thought of as an adjunct to actual review of the file itself in PDF format, rather than a complete substitution.

A non-exclusive list of other possible errors include: omitting materials to be OCRed, missing pages or files, skipping password protected files, skipping files with print, extract or other limitations on the file permissions, missing text in corrupted, of an unrecognized format, failing to recognize rotated or skewed pages.

Options to Increase Accuracy or Speed
If you need to gain greater assurance of accuracy or speed for your OCR, you might consider theses options.

  • To make sure searchable text is available quickly or at a certain time, you or your scan vendor can apply OCR before uploading documents to Lexbe.  Then you will not need to wait for the OCR to be performed and your documents will be searchable faster.
  • To ensure accuracy of results you or your scan provider can have the OCR done by someone who will conduct a manual review of the search accuracy and make corrections to the OCR text as necessary.  This is a time consuming process, but will provide greater accuracy.  Lexbe's free on-Server OCR is unmonitored and no one checks accuracy.
  • Lexbe provides a paid litigation OCR service if you have large numbers of documents and you need to make sure that  documents are OCRed and searchable in a specific timeframe.

Do you have other questions or does this help document not address your needs?  Please let us know at our Support Site.

All services described on this and related pages are subject to Lexbe's Services Agreement.