Lexbe OCR Explained
Optical Character Recognition of PDFs Technical Note
This technical note explains how Lexbe's free OCR service that
runs on its servers works. OCR, or 'optical character recognition', is the
process of taking scanned images (from documents) and electronically
converting them into searchable text. Lexbe uses a form of OCR to convert image
PDFs into what is known as 'text-under-image' PDFs or searchable PDFs. This
means the the original document image (or scan) is saved and the text is added
to the file in a hidden layer, so that the document can be searched
(and 'copy and paste' is available), but
the appearance of the document remains unchanged. Lexbe offers this
service for free, in that no additional charge is assessed for OCR beyond the
normal monthly charges for the account.
Adobe's PDF format is complicated and ever-evolving.
For example, a PDF that has been created from a scan is very different
programmatically than a PDF that has been created directly from a Word or Excel
document. Plus, PDF is an open standard, but different PDF creation
programs may create PDFs in different ways with different characteristics.
PDFs optionally have a number of security features like password protection,
print protection and text-extraction prevention, that can complicate or confound
OCR. Finally, PDF the standard is evolving and new features added by Adobe
or other developers can impair the ability of files to be OCRed.
How OCR Works on Lexbe
This describes generally how Lexbe's OCR works:
- Recognizable image-based PDFs will be converted to searchable, text-under-image PDFs.
- PDFs that contain text within them already, or PDFs that include text and images,
will usually be skipped. In this case our search index will use the original PDF
text. (OCR almost inevitably introduces OCR errors so we do not OCR a
document that already contains text).
- OCR will not be done if the option 'Perform OCR on PDFs' is not selected at time
of upload in the upload window.
- OCR works on simple 'flattened' image PDFs only, and not on some complex PDFs,
including PDFs with embedded attachments in the Acrobat PDF Portfolio format.
- OCR can take a substantial time to complete when many documents have been
uploaded at or about the same time. This is a shared service
and is subject to server utilization.
Uploading a few documents is usually very fast, but the free OCR service can still be slow if there is
high utilization.
- After OCR is completed, the search index will be updated to include the new OCR
text. Until then the search results may be incomplete.
- Lexbe OCR recognizes
Unicode and
will therefore apply OCR to many non-English languages. However, our OCR
engine uses an English-language dictionary look-up only to aid in OCR accuracy,
and does not use a dictionary for other languages. This reduces the accuracy of non-English OCR.
Other Limitations on What OCR on Lexbe Does
OCR is a highly useful tool, but is far from perfect. OCR does best with clearly readable
text from high-quality scans. OCR quality degrades with copy quality.
OCR quality can also degrade, or OCR may not be done at all, with skewed or rotated
pages, pages with unusual fonts, pages with dirty or specked backgrounds, pages
scanned an low resolution, etc.
There are many reasons why OCR will not complete
successfully on PDF files. File corruption is one reason. Even
when a file will open, some pages may be corrupt and prevent OCR from running
successfully. File print security is another. Producers of PDFs
often place print or content extraction restrictions on PDFs. This will
prevent OCR from running. File open passwords will also prevent PDFs
from OCRing.
OCR almost always produces errors, and sometimes
will produce many errors. OCR is best thought
of as an adjunct to actual review of the file itself in PDF format, rather
than a complete substitution.
A non-exclusive list of other possible errors include:
omitting materials to be OCRed, missing pages or files, skipping password protected files,
skipping files with print, extract or other limitations on the file permissions,
missing text in corrupted, of an unrecognized format, failing to recognize
rotated or skewed pages.
Options to Increase Accuracy or Speed
If you need to gain greater assurance of accuracy or speed for your OCR, you
might consider theses options.
- To make sure searchable text is available quickly or at a certain time, you or
your scan vendor can apply OCR before uploading documents to Lexbe. Then
you will not need to wait for the OCR to be performed and your documents will be
searchable faster.
- To ensure accuracy of results you or your scan
provider can have the OCR done by
someone who will conduct a manual review of the
search accuracy and make corrections to the OCR text
as necessary. This is a time consuming process, but
will provide greater accuracy. Lexbe's free on-Server OCR is
unmonitored and no one checks accuracy.
- Lexbe provides a paid
litigation OCR service if you have large numbers
of documents and you need to make sure that documents are
OCRed and searchable in a specific timeframe.
Do you have other questions or does this help document not address your needs? Please let us know at our
Support Site.
All services described on this and related pages are subject to Lexbe's
Services Agreement.