|

"OCR and Image-Only PDFs"

By Rowan Hanna
PDF Store Support Team
Issue 23 for 2005

When it comes to PDF Extraction, it's crucial to know what kind of source PDF you will be using. PDF is a huge format, and it comes in several different "flavors": text only, image plus text, and image only.
Image only PDFs can be created by imaging applications or scanning. In such files, text may not be recognized: while the resulting PDFs look like the printed originals, they are in fact flat images. This can be sufficient for some purposes, but if you want to select, search or extract text, then you will need to perform an Optical Character Recognition (OCR). OCR is the process of comparing images on screen with characters in a database to determine which shapes represent text.
With Acrobat's Paper Capture plug-in, it's possible to perform an OCR and add an invisible layer of text (known as "hidden text") to the image PDF. In effect, this makes it an image plus text PDF document. The AdLib OCR Add On for the AdLib eXpress range takes this approach, handling large volumes of such documents. Gemini, on the other hand, boasts a character mapping facility that can be used to convert image only PDFs into a variety of editable formats such as HTML and RTF.
These are just some of the many tools available from PDF Store's range of PDF Extraction products.
|