PDFlib Text Extraction Toolkit (TET) is a developer product for reliably extracting text and raster images from PDF documents. TET makes available the text contents of a PDF as Unicode strings, plus detailed glyph and font information as well as the position on the page. Raster images are extracted in common formats. TET optionally converts PDF documents to an XML-based format called TETML which contains text and metadata as well as resource information. TET contains advanced content analysis algorithms for determining word boundaries, grouping text into columns and removing redundant text. Using the integrated pCOS interface you can retrieve arbitrary objects from the PDF, such as metadata, interactive elements, etc.
The new version of TET extracts text from PDF documents, retrieves raster image data and tables, and converts PDF documents to XML.
In addition to low-level text retrieval TET contains advanced content analysis algorithms for determining word boundaries, removing redundant duplicate text (such as shadows and artificial bold). Using the auxiliary pCOS interface you can retrieve arbitrary objects from the PDF, such as metadata, hypertext, etc.
Note: Please review the following Platform and Product Requirements, Limitations of the Evaluation version, Supported Programming Languages and License Definitions before downloading or purchasing this software.
Feature Summary
- Implement the PDF indexer for a search engine
- Repurpose the text and images in PDFs
- Convert the contents of PDFs to other formats
- Process PDFs based on their contents, e.g. splitting based on headings (requires PDFlib+PDI in addition to TET)
Major Features in Depth
Accepted PDF Input
TET supports all relevant flavors of PDF input:
- All PDF versions up to Acrobat 9, including ISO 32000-1
- Protected PDFs which do not require a password for opening the
document
- Damaged PDF documents will be repaired
Unicode.
Since text in PDF is usually not encoded in Unicode, PDFlib TET normalizes the text in a PDF document to Unicode:
- TET converts all text contents to Unicode. In C the text will be returned
in the UTF-8 or UTF-16 formats, and as native Unicode
strings in all other language bindings.
- Ligatures and other multi-character glyphs will be decomposed
into a sequence of their constituent Unicode characters.
- Vendor-specific Unicode assignments (PUA characters) are identified,
and mapped to characters in the common Unicode area if
possible.
- Glyphs without appropriate Unicode mappings are identified as
such, and are mapped to a configurable replacement character.
- TET implements various workarounds for problems with specific document creation packages, such as InDesign and TeX documents or PDFs generated on mainframe systems.
Image Extract
Images on PDF pages can be extracted as TIFF, JPEG, or JPEG 2000 files. Precise geometric information (position, size, and angles) are reported for each image. Fragmented images will be combined to larger images to facilitate repurposing. Image fidelity is guaranteed since no downsampling or color space conversion occurs. This ensures the highest possible image quality.
Content Analysis and Word Detection.
TET includes advanced content analysis algorithms:
- Patented algorithm for determining word boundaries which is required to retrieve proper words
- Recombine the parts of hyphenated words (dehyphenation)
- Remove duplicate instances of text, e.g. shadow and artificially bolded text
- Recombine paragraphs in reading order
- Correctly order text which is scattered over the page
Geometry
TET provides precise metrics for the text, such as the position on the page, glyph widths, and text direction. Specific areas on the page can be excluded or included in the text extraction, e.g. to ignore headers and footers or margins.TET provides precise metrics for the text, such as the position on.
Configuration Options for problematic PDF
TET contains special handling and workarounds for various kinds of PDF where the text cannot be correctly extracted with other
products. In addition, it includes various configuration features to improve processing of problem documents:
- Unicode mapping can be customized via user-supplied tables for mapping character codes or glyph names to Unicode.
- PDFlib FontReporter is an auxiliary tool for analyzing fonts,
encodings, and glyphs in PDF. It works as a plugin for Adobe Acrobat. This plugin is freely available for Mac and Windows.
- Embedded fonts are analyzed to find additional hints which are useful for Unicode mapping. External font files or system fonts are used to improve text extraction results if a font is not embed- ded.
XMP Metadata
TET supports XMP metadata in several ways:
- Using the integrated pCOS interface, XMP metadata for the document, individual pages, images, or other parts of the docu- ment can be extracted programmatically.
- TETML output contains XMP document and image metadata if
present in the PDF.
- Images extracted in the TIFF or JPEG formats contain image metadata if present in the PDF.
TET Command-Line Tool and TET Library
TET is available as a programming library (component) for various development environments, and as a command-line tool for batch operations. Both offer the same features, but are suitable for different deployment tasks. Here are some guidelines for choosing among both TET flavors:
- The TET programming library can be used for integration into your desktop or server application. Examples for using the library with all supported language bindings are included in the TET packages.
- The TET command-line tool is suited for batch processing PDF documents. In addition to creating plain text it can convert PDF
documents to XML. It doesn’t require any programming, but offers command-line options which can be used to integrate it into complex workflows.
TETML represents PDF contents in XML
TET optionally represents the PDF contents in an XML flavor called TETML . It contains a variety of PDF information in a form which can easily be processed with common XML tools. TETML contains the actual text plus optionally font and position information, resource details (fonts, images, colorspaces), and metadata. TETML is governed by a corresponding XML schema to make sure that TET always creates consistent and reliable XML output. TETML can be processed with XSLT stylesheets, e.g. to apply certain filters or to convert TETML to other formats. Sample XSLT stylesheets for processing TETML are included in the TET distribution.
Document Domains
PDF documents may contain text in other places than the page contents. While most applications will deal with the page contents only, in many situations other document domains may be relevant as well. TET extracts the text from all of the following document domains:
- page contents
- predefined and custom document info entries
- XMP metadata on document and image level
- bookmarks
- file attachments and PDF portfolios can be processed recursively
- form fields
- comments (annotations)
- general PDF properties can be queried, such as page count, conformance to standards like PDF/A or PDF/X, etc.
TET Connectors
TET connectors provide the necessary glue code to interface TET with other software. The following TET connectors make PDF text extraction functionality available for various software environments:
- TET connector for the Lucene Search Engine
- TET connector for the Solr Search Server
- TET connector for Oracle Text
- TET connector for MediaWiki
- TET PDF IFilter for Microsoft products is available as a separate product. It extracts text and metadata from PDF documents and makes it available to search and retrieval software on Windows (see separate datasheet for details).
Limitations of the Evaluation version.
- The evaluation version enables all features of the product and produces fully valid output but will only process PDF documents with up
to 10 pages and 1 MB size.
Supported Programming Languages
- COM for use with VB, ASP, Borland Delphi, etc.
- C and C++.
- Java, including servlets.
- .NET for use with C#, VB.NET, ASP.NET, etc.
- PHP hypertext processor.
- Perl.
- Python.