
"Working with image-only PDFs"

By Leah Lothringer
PDF Store Support Team
Issue 17 for 2006

When it comes to PDF extraction, there is some confusion over what an image only PDF
is. It is crucial to know what kind of source PDF you will be using when attempting to
extract data. PDF is a huge format, and it comes in several different "flavors": text only,
image plus text, and image only.
Image only PDFs can be created by an imaging application or scanning a document
directly into a new PDF. In such files, text is not recognised as individual letters but
rather a single flat image. This can be sufficient for some purposes, but if you want to
select, search or extract text, then you will need to perform an Optical Character
Recognition (OCR). OCR is the process of comparing images on screen with characters
in a database to determine which shapes represent text. Over at Planet PDF, Ernest
Svenson expands on the benefits and complexities of OCR technology in OCR, PDFs,
and bates-numbered documents.
With Acrobat's Paper Capture plug-in, it's possible to perform an OCR and add an
invisible layer of text (known as "hidden text") to the image PDF. In effect, this makes it
an image plus text PDF document. Comparatively, Gemini boasts a character
mapping facility that can be used to convert image only PDFs into a variety of editable
formats such as HTML and RTF.
For automating OCR on an unlimited number of PDF files, take a look at
AutoCaptureX4. Hot folder support is included.
These are just some of the many tools available from PDF Store's range of PDF
Create/Convert and Edit/Prepare products.

|