I managed to attend the International Digital Curation Conference in Amsterdam last week. It was a useful conference to present my work on The Skeleton Test Corpus at as it allowed me to follow up on a presentation I made at last year’s on behalf of former colleagues at The National Archives, calling for a comprehensive file format test suite. I’ve already covered much of the rationale behind The Skeleton Test Corpus in my previous post and GitHub wiki pages. The presentation from the conference can be viewed on Prezi, or below:
The paper will be published in the International Journal of Digital Curation in a few months from now but for those attending the conference I was able to add a few more thoughts to my presentation that have come as a result of doing this work, namely, in considering different permutations of skeleton file that can be created from a single DROID signature and what the implications might be for generating a comprehensive test corpus for evaluating the format identification, feature extraction and validation capabilities of various tools.
What The Skeleton Test Suite Generator doesn’t do…
Put simply, generate all possible permutations of skeleton file. That is, given syntax that suggests multiple options, e.g. (01|02|03|04) – hex 01 or 02 or 03 or 04; the generator will always choose the first option instead of creating four files (in this instance) with each different hex value. This is by design as I didn’t feel it would have a significant impact on the initial testing of the tool and suite and could easily be added to the generation code later.
There are other syntactic forms in the DROID regular expression syntax that suggest multiple configurations of file too. These are better understood by reviewing the help section of the PRONOM Signature Development Utility.
You can cherry pick most signatures in PRONOM and find examples that would result in multiple skeleton files. What we begin to discover is that the number of files we start to deal with grows quite rapidly. This is something I have been considering since I handed in the final paper last year; and for my presentation this week I stepped through the first handful of signatures in PRONOM and selected the Still Picture Imaging File Format (SPIFF), for its composition, to analyze a little deeper.
The signature, as of 19 January 2012 (Signature File v66), looks like this:
Focusing on the round-bracketed syntax, I could naively generate 5 x 6 skeleton files from this string – 30.
Dissecting the SPIFF format header, however, and we discover that the first section of syntax (00|01|02|03|04) relates to Profile ID, the second (00|01|02|03|04|05), relates to Compression Type. In the specification, a single Profile ID applies to bi-tonal images, similarly, five Compression Types are associated with bi-tonal images and two with continuous-tone images. This limits the number of real-world configurations of file to (1 x 5) + (4 x 2) skeleton files – 13.
This demonstrates a flaw in the automated generation of the skeleton test suite as the signature doesn’t contain enough metadata to tell us this information. It might be easier and more effective to create 13 skeleton files by hand.
As I’ll demonstrate, I believe this also has quite a large impact on the effort involved in creating a fully featured, comprehensive test-suite, the likes of which the community might benefit from to test more than just format identification techniques.
Permutations, permutations and configurations
Given the SPIFF format header above, we can begin to draw out other components that describe different configurations of file. If I select four components that I feel are easiest to comprehend and list the number of options in each of those fields; those associated with bi-tonal images; those associated with continuous-tone images and those associated with both – we can begin to calculate the number SPIFF files that can be generated:
|Profile ID||1 bi-tonal||4 continuous-tone|
|Bits Per Sample||1 bi-tonal||5 continuous-tone|
|Compression Type||5 bi-tonal||2 continuous-tone|
|Resolution Units||3 bi-tonal||3 continuous-tone|
This is a simplified break-down and interpretation of components within the header and we can calculate the number of permutations as follows:
(1 x 1 x 5 x 3) bi-tonal + (4 x 5 x 2 x 3) continuous-tone:
135 different permutations
That’s 135 exemplar test files we might want to create before we take into account the ‘S’ parameter in the header which describes 16 different color spaces that can be attached to the SPIFF file format, putting this number up even higher. And of course before we take into account the many thousands of file formats in existence.
“U” and “umption” and concluding remarks
I’ve made an assumption that a format corpus might include all permutations of every file format. That is before we make any mention of content we are interested in. Creating a corpus like this I am sure would prove to be a meticulous task – the likes of which would require detailed metadata for each file and a suitable storage and retrieval mechanism – nothing short of a complete, web enabled, digital archive.
It begs the question – is this a solution the digital preservation community needs? My response is not really. I’m an advocate of sampling and believe that with quite a small sample of files and up-to-date format specifications we can learn enough about the structure of a file format to create the tools we need for feature extraction and validation. After we’ve found those, sampling a real world population from our own collections or around the internet can provide us with enough information about exceptions to help improve those tools and make them more robust.
So what is the point of this post?
Well, it’s fun to think about the numbers! And more importantly, if we are gathering collections of files, there are books on sampling and measuring techniques that describe how small collections of measurements can provide us with more information than we might think. Following that idea, perhaps well-managed corpora of samples, such as the one maintained by the Open Planets Foundation with its simple licensing and contribution model are all we require when it comes to format test-suites for feature extraction and validation; and perhaps more comprehensive test-suites are simply something that belong in a museum (albeit a digital one!).
Note: The numbers and figures I have presented in this blog post are as I understand them to be, and as I presented them at the conference. Please take time in the comments section to suggest any corrections that might need to be made. I fully appreciate the part interpretation has to play in understanding format specifications and so I value the additional input. Thank you.
Van Gogh: The featured image and painting from this post can be seen at the Van Gogh Museum in Amsterdam; a highly recommended experience. For more information about Vincent Van Gogh, check out Artsy’s page about him: https://www.artsy.net/artist/vincent-van-gogh.