Businesses around the world have for the past few years been using PDF as an archival format - not necessarily
because they want a paperless office - but rather because they recognize the benefits of an electronic document
archive, namely convenient accessibility to to anyone with
permission. The PDF format is particularly popular because it preserves the exact look and feel of the original
physical document.
Strengths
Longevity Paper documents age with time and use. Electronic documents don't. Even though programs written to view electronic documents do age, the basic specification of the PDF file format is in the public domain, and so the health of your electronic archive is not tied to the health of the company that developed the PDF viewer.
Preserves look and feel PDF was originally created to duplicate paper
documents, and as such is second to none for preserving the physical character of the original source. PDF's support
for multi-page documents is another advantage.
Metadata Metadata (information that describes a document, such as author, date created, keywords and so on) is one of the big advantages electronic documents have over paper. The latest version of PDF (1.5) includes support for XMP (the Extensible Metadata Platform); so while you still need to develop processes to identify the metadata you want and to ensure that that metadata is preserved, at least the format supports it, and supports it quite well in fact.
Searchable Properly-created PDF documents contain all their content in machine-readable form. This, combined with PDFs metadata capabilities, makes it possible to search a indexed PDF-based archive in ways simply impossible with a conventional paper-based archive.
Weaknesses
Bells and whistles While the basic features of PDF are appropriate for archival applications, some of the extended features of PDF are not - specifically, functionality that depends on extensions to the public-domain specification or external helper applications. For example, embedded video or audio files can be encoded using standards that are not part of the published PDF specification, which undermines the longevity of the document in which they are included. For this reason, care needs to be taken to ensure that PDFs created for archival purposes do not include unnecessary frills.
Getting down to business
Creation The first step in establishing a PDF-based archive is in acquiring the PDF documents themselves. If you are creating your PDFs from other document types then you'll need some form of creation or conversion application. Creation applications appear in the source program as an additional printer; to create a PDF document, the user simply prints the file using the PDF printer, and a PDF file is created. Metadata can then be added via document properties. Popular examples of PDF creation products are DocuCom PDF Driver, Amyuni PDF converter or Adobe Acrobat.
Scan and OCR The situation is a little different if you are converting physical paper to PDF. You will need to scan and OCR the documents, a process that involves scanning your physical documents to the TIFF image format, converting the TIFF images to PDF and then OCR'ing the PDF documents. This process is somewhat labour intensive if done manually but if automated the results are surprisingly good, if not quite perfect. Note: Image-based PDFs must be OCR'd if you intend to index them for searching. Applications suited to this process are AdLib eXpress and AdLib OCR.
Batch conversion Batch conversion While simple creation tools are useful, if you have large numbers of documents to convert then they tend to become unwieldy. An alternative choice would be one of the tools that allow batch-processing of documents. With one of these tools, all you need to do is specify the PDF creation settings you want, click ok and the tool starts crunching away while you go for lunch. Typically products that support batch processing also offers 'hot' folder functionality - any file copied to a hot folder is automatically converted to PDF using whatever settings you specified when you set up the folder. A popular application that falls into this category is AdLib eXpress.
Using your archive
Searching PDFs and PDF-based archives are searchable. If you open a PDF document then you can search for text within this document, but if you create an index of your PDF archive then you can search your entire archive of PDFs without having to open each PDF individually. An indexed PDF archive offers the option of having your PDF archived available for searching on networks, web-sites, CDs or DVDs. A popular application that falls into this category is ARTS PDF Search.
Conclusions
The PDF file format has a number of features that make it a good choice for archival applications, even in situations that don't involve setting up a full-blown indexed, searchable document repository. Hopefully this discussion has given you a good first impression of the possibilities associated with PDF archives; if you have any questions please feel free to get in touch.