PDF Store Planet PDF
PDF news, in-depth articles and tips.
Planet PDF
HomeShop by CategoryShop by VendorProduct SpotlightsPlanet PDFPlanet PDF ForumWeb Expo
Product Catalog View Cart Checkout Newsletters Help & Support Contact Login  
Create/Convert Edit/Prepare Split/Merge Stamp Prepress/Print Impose Forms Security Server Developer

PDF Software > Create PDF & Convert PDF

PDFlib TET


Price: $199.00 Buy now or download the free trial
Platform:
Licensing:
Version: 3.0



PDFlib Text Extraction Toolkit (TET) is a developer product for reliably extracting text and raster images from PDF documents. TET makes available the text contents of a PDF as Unicode strings, plus detailed glyph and font information as well as the position on the page. Raster images are extracted in common formats. TET optionally converts PDF documents to an XML-based format called TETML which contains text and metadata as well as resource information. TET contains advanced content analysis algorithms for determining word boundaries, grouping text into columns and removing redundant text. Using the integrated pCOS interface you can retrieve arbitrary objects from the PDF, such as metadata, interactive elements, etc.

The new version of TET extracts text from PDF documents, retrieves raster image data and tables, and converts PDF documents to XML.

In addition to low-level text retrieval TET contains advanced content analysis algorithms for determining word boundaries, removing redundant duplicate text (such as shadows and artificial bold). Using the auxiliary pCOS interface you can retrieve arbitrary objects from the PDF, such as metadata, hypertext, etc.

Note: Please review the following Platform and Product Requirements, Limitations of the Evaluation version, Supported Programming Languages and License Definitions before downloading or purchasing this software.

Feature Summary

  • Implement a Search Engine for Processing PDF
  • Extract Text from PDF
  • Extract Image from PDF
  • Convert Text Contents to other Formats
  • Process PDFs based on Content
  • Supported PDF Input
  • Unicode
  • Full CJK Support
  • Content Analysis and Word Detection
  • Geometry
  • Configuration Options for problematic PDF
  • pCOS Interface for simple Access to PDF Objects
  • Programming and Performance
  • TET Command-Line Tool and TET Library
  • TETML represents PDF contents in XML
  • TET Connectors

Major Features in Depth

Supported PDF Input.

  • all PDF versions up to PDF 1.6 (Acrobat 7)
  • all font and encoding types: base 14 fonts, TrueType, PostScript, OpenType, CID fonts
  • encrypted PDF with 40- and 128-bit encryption (appropriate permission settings or password required)

Unicode. Although text in PDF is usually not encoded in Unicode, PDFlib TET will normalize the text from a PDF document to Unicode:

  • TET converts all text contents to Unicode. In C the text will be returned in the UTF-8 or UTF-16 formats, and as native Unicode strings in all other language bindings.
  • Ligatures and other multi-character glyphs will be decomposed into a sequence of their constituent Unicode characters.
  • Vendor-specific Unicode assignments (PUA characters) are identified, and mapped to characters in the common Unicode area if possible.
  • Glyphs without appropriate Unicode mappings are identified as such, and are mapped to a configurable replacement character.

Image Extract. Images on PDF pages can be extracted as TIFF, JPEG, or JPEG 2000 files. Precise geometric information (position, size, and angles) are reported for each image. Fragmented images will be combined to larger images to facilitate repurposing. Image fidelity is guaranteed since no downsampling or color space conversion occurs. This ensures the highest possible image quality. Page Layout and Table Detection. The page content is analyzed to determine text columns. Tables are detected, including cells which span multiple columns. This improves the ordering of the extracted text. Table rows and the contents of each table cell can be identified.

Full CJK Support. TET includes full support for extracting Chinese, Japanese, and Korean text. All predefined CJK CMaps (encodings) are recognized; horizontal and vertical writing modes are supported.

Content Analysis and Word Detection. TET can be used to retrieve low-level glyph information, but also includes advanced algorithms for content analysis:

  • Detect word boundaries to retrieve words instead of characters.
  • Recombine the parts of hyphenated words.
  • Remove duplicate instances of text, e.g. shadow and artificial bold text.
  • Recombine paragraphs into reading order.
  • Reorder text which is scattered over the page.
  • Reconstruct lines of text.

Geometry. TET provides precise metrics for the text, such as the position on the page, glyph widths, text direction. Specific areas on the page can be excluded or included in the text extraction, e.g. to ignore headers and footers or margins.

Configuration Options for problematic PDF. TET contains special handling and workarounds for various kinds of PDF where the text cannot be correctly extracted with other products. In addition, it includes various configuration features to improve processing of problem documents:

  • Unicode mapping can be customized via user-supplied tables for mapping character codes or glyph names to Unicode.
  • PDFlib FontReporter is an auxiliary tool for analyzing fonts, encodings, and glyphs in PDF. It runs as a plugin for Adobe Acrobat 5, 6, or 7. This plugin is freely available for Mac and Windows.
  • Parses embedded fonts to find additional hints which are useful for Unicode mapping. External font files or system fonts can be used to improve text extraction results if a font is not embedded.

pCOS Interface for simple Access to PDF Objects. TET includes the pCOS (PDFlib Comprehensive Object System) interface for retrieving arbitrary PDF objects. With pCOS you can retrieve PDF metadata, hypertext, or any other information from a PDF document outside the actual page content with a simple query interface without the need for low-level programming.

Programming and Performance. TET has been developed with portability, performance, and robustness in mind. TET is thread-safe for deployment in multi-threaded server applications. The core library is written in highly optimized C code for maximum performance and small overhead. Language bindings are available for COM, C, C++, Java, and .NET.

TET Command-Line Tool and TET Library. TET is available as a programming library (component) for various development environments, and as a command-line tool for batch operations. Both offer the same features, but are suitable for different deployment tasks. Here are some guidelines for choosing among both TET flavors:

  • The TET programming library can be used for integration into your desktop or server application. Examples for using the library with all supported language bindings are included in the TET packages.
  • The TET command-line tool is suited for batch processing PDF documents. In addition to creating plain text it can convert PDF documents to XML. It doesn’t require any programming, but offers command-line options which can be used to integrate it into complex workflows.

TETML represents PDF contents in XML. TET optionally represents the PDF contents in an XML flavor called TETML . It contains a variety of PDF information in a form which can easily be processed with common XML tools. TETML contains the actual text plus optionally font and position information, resource details (fonts, images, colorspaces), and metadata. TETML is governed by a corresponding XML schema to make sure that TET always creates consistent and reliable XML output. TETML can be processed with XSLT stylesheets, e.g. to apply certain filters or to convert TETML to other formats. Sample XSLT stylesheets for processing TETML are included in the TET distribution.

TET Connectors. TET connectors provide the necessary glue code to interface TET with other software. The following TET connectors make PDF text extraction functionality available for various software environments: The TET Plugin for Adobe Acrobat is a free utility for extracting text and images from PDF. > It offers better functionality than Acrobat’s built-in tools, and can be used to evaluate TET interactively.

  • TET connector for the Lucene Search Engine
  • TET connector for the Solr Search Server
  • TET connector for Oracle Text

Platform and Product Requirements

  • Windows, Mac OS X, Linux, FreeBSD and other operating systems.

Limitations of the Evaluation version.

  • The evaluation version enables all features of the product and produces fully valid output but will only process PDF documents with up to 10 pages and 1 MB size.

Supported Programming Languages

  • COM for use with VB, ASP, Borland Delphi, etc.
  • C and C++.
  • Java, including servlets & Java Application Server
  • .NET for use with C#, VB.NET, ASP.NET, etc.
  • Perl.
  • PHP hypertext processor.
  • RPG (IBM eServer iSeries).

License Definitions

PDFlib TET is licensed regardless of the number of CPUs .

Find PDF Software

The best and broadest range of PDF software

more searching options...
Get Nitro PDF Professional
Download Nitro PDF Pro


Need Advice?

Free advice on workflow, installation, features, compatibility, anything.

PDF Store Top 5

  1. Nitro PDF Professional
  2. ARTS PDF Aerialist
  3. Quite Imposing Plus
  4. PitStop Professional
  5. ARTS PDF Crackerjack

About Us | Become a Vendor | Customer Protection Policy | Privacy and Security Policy | Refund Policy
Buy PDF Software | Find PDF Software | Free PDF eBooks

To connect with us: Read Nitro's PDF Blog, follow nitro pdf on Twitter, or join the Nitro PDF LinkedIn group. Planet PDF, PDF Store, Nitro PDF Software and ARTS PDF are all copyright © 2009 Nitro PDF, Inc. and Nitro PDF Pty Ltd. All Rights Reserved.

PDF Store