Tuesday, March 17, 2015

File Validation Woes

Over the last few months I have been preparing and ingesting the master TIFF files for the Photo Files collection into our local repository system for safe keeping. The first step is to package the files using the BagIt specification. BagIt was developed by the Library of Congress and the California Digital Library as a way to package files along with some basic metadata that can be used to validate the bags contents. It's the digital equivalent of putting a bunch of things in a box, along with a list of the box’s contents and a unique identifier that can be used to identify each item. Since our Photo Files collection is enormous (so far I’ve deposited over 45,000 images, and we’re not even half way through the collection), I break the bags into manageable chunks for uploading and processing in our repository.

Once a bag is uploaded onto the server, it is validated using the BagIt tool. This is a programmatic way of checking that all the files are still exactly as they should be, and no file has been altered or gone missing or snuck in on the sly. Finally, the contents of the bags are run through the File Information Tool Set, or FITS. FITS brings together a bunch of open-source tools that identify file types, check to see if those files are valid, and extract technical metadata. So, for instance, when I deposit a bag from the Photo Files collection, FITS produces a report that says “These files are TIFFs! These TIFFs are well formed and valid! Here’s some technical info you might want to have around!”, only with less exclamation marks:

Sample FITS report

So, this process has been going along just swimmingly until a few weeks ago. Like I said, I’d made it through about 45,000 images, and then suddenly, BAM! an error report for every single image:

page-masters/Icon1647-0875-0000010A.tif is not valid: "Type mismatch for tag 700; expecting 1, saw 7"


All about the Tagged Information File Format (TIFF):

The first thing I discovered was that this error message had something to do with the T part of the TIFF. The TIFF file format has what’s called a header that uses tags to describe the content of the file. These tags, and the information in them, can be manipulated using various types of tools. The capture software we use to create our master images automatically inserts certain tags. As part of our process, we add additional information into the headers of our TIFFs. This is called embedded metadata, or information about the file that is part of the file itself.

The problem with these images was the 700 tag. From the Library of Congress’ super useful guide to TIFF tags I learned that this tag has something to do the XMP metadata within the file. XMP is a data model for structuring embedded metadata. Data models for metadata help standardize how metadata is stored. For instance, I could edit an image to say “Author: Jane Doe”, while someone else might edit it to say “Photographer: Jane Doe” and we could both mean the same thing. A data model would say, “Ok, everyone, we’re going to use the term Creator.” This makes it easier for both humans and computers to make use of embedded metadata, making digital objects more discoverable and easier to maintain.

So, now I knew that there was a problem with the metadata we were embedding in the files. Something about a 1 and a 7? Deep inside the Photoshop user forums, I found that I was not the first one to run across this problem. These numbers refer to the type field in the XMP, with 1 meaning “byte” and 7 meaning “unknown”. So these files said "unknown" when they should have said “byte”, right? Well, not really. According to David Franzen (Employee)’s response in the user forum, both the 1 and the 7 were valid values. So why was I getting this error message?

JHOVE and FITS:

As mentioned above, FITS packages together a number of tools. The tool that was giving this error message was Jhove, or JSTOR/Harvard Object Validation Environment. According to wikipedia, Jhove tells us whether or not objects are “well-formed (consistent with the basic requirements of the format) and valid (generally signifying internal consistency).” The version of Jhove that is packaged in FITS says that in order for a TIFF to be well formed, tag 700 needs to have a “1”, and anything else is invalid. But it also seems that the "7" is also a valid value for this tag. So, why is there this discrepancy in what makes a valid TIFF? Well, it turns out that when Jhove was first developed, the TIFF format specifications weren’t exactly easy to decipher. The TIFF specifications encoded in the tool were based on confusing, incomplete and scattered documentation. When others started getting the same error message as I got, they turned to Adobe for clarification. As a result, Jhove’s code was updated in version 1.8 to accept both “byte” and “unknown” as valid values in the 700 tag.

However, the updated version of Jhove didn’t make its way into FITS. Apparently, there were some other changes to Jhove 1.8 that would make integrating the newer version into FITS a rather large job. Making the necessary changes to FITS to accept newer versions of Jhove currently isn’t a priority for the FITS developers.

The Real Culprit:

Now that I knew what was causing the error message, I circled back to the big question- why now? The first 45,000 files had been just fine. What changed? In discussion with our digital production team, I learned that there had been a significant change to the production workflow, specifically in how they were adding embedded metadata. What before had been a time consuming process was greatly simplified by using Adobe Bridge to quality check images and add metadata. In researching this error message, I had seen people mention Bridge as the culprit in changing the 700 tag.


                          

Testing embedded metadata settings:

To be sure, I decided to play around with the settings in both our capture software and Bridge to see if I could get a different result. I created a number of test images with different metadata settings using our capture software, then ran these through FITS. All checked out okay. Next, I played around with the metadata setting in Bridge, and made changes to the embedded metadata in my test files. I ran the files through FITS again, and all failed to validate. No matter what settings I used in Bridge, the 700 tag was changed.

So Now What?

Now that we knew what was causing the error, there were a number of different approaches we could take. To find out what we did, stay tuned for my next blog post...



Written by Jenny Mullins

1 comment:

  1. I'm in a similar predicament. FITS seems to be using JHOVE 1.5 rather than 1.11. JHOVE 1.11 validates my XMP enhanced files just fine. Certainly interested in next post.

    ReplyDelete