How to Scan to Searchable PDF
in Litigation Matters
The searchable PDF (portable document format) is becoming
increasingly popular and important for lawyers and litigation teams
in discovery, litigation and related legal
matters. Several factors are driving PDF's adoption in
legal matters:
-
PDF's
ubiquitous popularity in the business world
-
Court requirements in many
jurisdictions requiring that pleadings and motions be filed in PDF
-
Availability of low cost
scanners and multi-function copier/scanners that allow law offices to
inexpensively create PDFs
-
Release of the new Acrobat
Professional 8 from Adobe, that now supports Bates numbering and redaction.
When scanning paper documents to PDF, we offer these tips:
Choose the 'Text-Under-Image" Option. When scanning
a document, you will be presented with different options for types of PDF
files. You will usually want to choose the option that applies optical
character recognition (OCR) to make the document text searchable. This can be
implemented in different ways depending on your specific hardware and software, including a '"make
searchable (apply OCR)" option, or "text-under-image" or "searchable PDF" file
type options. This means that
your scanned document will be text searchable within the Acrobat viewer and many
other programs designed to search PDF files. The other type of PDF you could choose is called an
"image-only PDF", which is not text-searchable. When viewing a PDF file
you can tell if a file is searchable by looking for the 'select tool' on the top
bar in Acrobat Reader. This indicates that the file is text searchable.
Get the Resolution Right..
When scanning images to PDF for litigation purposes, 300 dpi (dots per inch)
will usually be best option. Scanning at a lower resolution (e.g. 200 dpi) is
usually OK, but legibility can suffer with smaller fonts (e.g. 6 pt. in
financial documents). OCR quality can also suffer from lower scan
resolutions. The trade-off is that larger scan resolutions results in
larger file sizes. File scans larger that 300 dpi usually do not
appreciably increase the readability of a document or its OCR quality.
Scan to B&W, Grayscale or Color.
For OCR purposes, 'Grayscale' is usually the safest choice as more information
is retained for the OCR engine to work with, usually resulting the highest
quality OCR. 'Black & White' is often good enough,
particularly with good quality originals, and creates a much smaller file than a
grayscale scan. Color or grayscale may be
required for photos (which do not display well with a 'black and white'
setting). For some documents, a color scan may be critical to
understanding the document, such as some charts (e.g. in Powerpoint
presentations) or CAD (computer aided design) documents. Color scans are
much larger than B&W or grayscale.
Watch the Other Settings.
Scanners will often have a number of other setting that can help improve scan
quality and OCR. These include 'deskew' (rotates any page that is
not square with the sides of the scanner bed, to make the PDF page align
vertically),
'background removal' (whitens nearly white areas of grayscale and color input), and
'edge shadow removal' (removes dark streaks that occur at the edges of scanned
pages, where the scanner light is shadowed by the paper edge). 'Deskew'
will help with OCR accuracy, while 'background removal' and 'edge shadow
removal' can improve readability, but can sometimes impair OCR accuracy.
For important documents, it's best to run some tests.
Get a Quality OCR Program. All
OCR is not created equal. The quality of optical character recognition
varies substantially based on the quality of the program and the various
settings chosen when running a program. Programs often have a 'fast' and
'slow' mode, with the slow mode usually delivering better quality OCR.
Some programs will auto-rotate pages when necessary, and others will not and
will make resultant OCR errors.
Pay Special Attention to the Numbers.
One secret of
OCR programs is that they routinely rely on dictionaries to recognize the text
of particular characters. This works pretty well with
words (if they are in the dictionary), but doesn't help with numbers or
other arbitrary characters not in a dictionary.
Expect to see lower quality OCR in financial reports and other number-intensive
documents.
Make Sure Your Litigation Support
Software Really Supports PDF.
Many legacy litigation software systems were designed around files saved
as
TIFFs (tagged image file format), an older type of file format that does not support integrated text
as part of the file like PDF does. These older software systems usually have added some support
for PDFs, but often the integration with PDF is incomplete and some features are
not supported with PDF files.
Do Redactions the Right Way. Redactions can be tricky in PDF and this has been a primary reason why
TIFF has survived as a popular format in legal matters. A trap for the
unwary is that it is possible in a PDF file to redact text on the image of a document, and
still have the redacted text be searchable! In a text-under-image PDF
file, the redaction must be done on the text and the image. This problem
has been fixed in the latest version of Acrobat (Acrobat Professional 8), and this program
can be used for PDF
redactions. Third party tools doing redactions should be released soon as
well. Many practitioners play it safe with redacted documents by printing,
marking out by hand, and rescanning. This method is manual but fool-proof, and works well if
number of documents to be redacted is limited.
Be Specific in Discovery Requests.
Litigators are increasingly
asking that documents produced in response to discovery requests be provided in
electronic form as PDFs. If you do this, be specific as to the matters
above. In particular, be sure to specify that the scan resolution be
300dpi and that the OCR be applied. You may also wish to ask what OCR
software is used and what settings will be applied. To be non-specific is
to invite an adversary to return documents scanned at 150dpi without OCR,
that may be unsearchable, illegible and unintelligible!
Lexbe.com fully supports searchable PDF files in a web-based
review and analysis format. For more information, click
here.