Tuesday, September 11, 2012

Optical Character Recognition: Digital Tools for a Digital Age

Each year the Dartmouth College Library, and indeed, libraries in general, expand further and further into digital territory. The ease with which the modern student can potentially access materials online is nothing short of astonishing, especially for those of us who remember the days of card catalogs and microfilm. As traditional publishers look to e-publishing and online journals so too does the Library, consistently finding new and better ways to provide our services to students and faculty.

When Dartmouth sets out to create a digital version of an item in our collection, there are many factors to consider. Is it visual material or textual? If it's a combination of both, which is more important and/or useful? How should it be displayed? How should it be cataloged?

One example of a digitized item that includes both visual and textual material.

These are all large questions, ones currently being figured out by libraries and repositories around the world. The Dartmouth College Library has recently created a Digital Production Unit under the Preservation Services Department to tackle these issues. While each project the Digital Production Unit takes on is unique in its requirements, each one also gives us new tools and skills to approach the next one.

One of the most important tools is Optical Character Recognition software (OCR), a program designed to "read" text from an image. This technology has been explored since the earlier parts of the 20th century, originally intended to assist blind readers without requiring a costly conversion to braille for each individual book. Early models proved too expensive for general use until 1965, when the American Postal Service began using it to great effect in mail sorting. As the technology advanced we began to see more and more applications for these machines, however, it still remained severely limited by the amount of hardware required.

Fast-forward forty years: the technology needed for image capture and storage is ubiquitous, even at the consumer level. Now we begin to see OCR technology made widely available, and with only a few basic pieces of equipment and software the Dartmouth College Library has been able to capitalize on this technology.

The first steps in any OCR project are image capture and processing. Using document scanners we are able to create digital versions of original materials. The exact process varies from project to project, but in the case of a project needing OCR we will generally apply image processing at this stage too; generally adjusting contrast and sharpening. What this will hopefully do is assist the OCR software in "reading" the text by improving its legibility.

A digitized book that has been run through OCR to create an HTML-based version for web viewing.

Next, the document is run through an OCR program. These programs come in many varieties and have become quite advanced over the years, however they are still not all-powerful. A huge variety of factors affect the document's legibility, such as the age and condition of the document and the overall quality of the type. While the OCR software has very advanced algorithms for sorting problematic or indistinct characters, for better quality control we also have the option to "teach" the program, transcribing the type manually whenever the program is unsure of its results. The OCR then incorporates this information into future readings.

This represents a collaboration between the great advantages of modern digital information and the inimitable visual character of the historical object. It is our hope to provide the Dartmouth community with the best of both worlds.

Written by Ryland Ianelli.

No comments:

Post a Comment