In my previous post on preserving subscription-based electronic resources, I talked about some of the challenges of ensuring long-term access to these materials. I talked a little about our participation in Portico and LOCKSS, and I also mentioned that for some of our e-resources, we manage the archival content ourselves. Here are the details about how we’re handling that here at Dartmouth.
For many electronic resources, including e-journals and digital collections, the publisher sends us a backup copy of the content, which usually arrives on DVDs or on external hard drives. This content consists of a large set of files, generally in XML, PDF, and/or image formats, although we occasionally receive other formats as well, ranging from document files to proprietary database files.
We run this data through a series of programs to check for viruses (just in case…it’s better to be safe than have our machines corrupted by a virus) and identify and validate the file formats of the content. We’ve been experimenting with tools for the file format identification, including DROID, JHOVE, JHOVE2, and FITS, which actually combines DROID, JHOVE, Exiftool, the National Library of New Zealand Metadata Extractor, and the Windows File Utility. We’ve found FITS to be great because it provides such comprehensive information about our data.
Then we use the BagIt packaging tool to create packages, or “bags”, containing all of the content. BagIt creates a manifest for each bag, which lists all of the files within that bag, along with a checksum for each file that we can use in the future to verify that none of the data in the file has changed over time. We also add some additional metadata about the content…mostly to help our future selves remember what the content is, where it came from, and what we’ve done to it.
Finally, we move the content to redundant external hard drives and store them in secure locations. The purpose of the redundant hard drives is to ensure that there are multiple copies of every file we’re preserving, in case something should happen to one copy (basically, backups of our backups).
Of course, having a copy of these files sitting on hard drives across campus is hardly a sound preservation strategy! So we take some steps to ensure that the content will be accessible and usable over time. This includes periodically retrieving the hard drives and checking the data to make sure it’s still valid, accessible, and usable. Over time we might migrate some of the data to new file formats, add new information to the metadata files, or even de-accession the backup files if we no longer need to preserve them.
It’s a time-consuming process, but an important one if we want to make ensure that we can provide access to these materials 10, 20, or even 100 years from now. Looking forward, we’re hoping to automate the majority of this process to save time. Digital preservation is an ongoing work-in-progress!
Written by Helen Bailey