Hacking the DROID Signature File: Keep It Simple Stupid!
One of the nicest features introduced by the DROID 6.01 development team (Matt Palmer, Richard Flitcroft, Richard Brennan and Alok Dash) was a simplified signature file format. The format cuts out DROID shift bytes reducing the pre-processing required to output a signature file by any system that chooses to do so. It is a feature that is yet to be adopted as it results in the generation of signature files that are only compatible with DROID 6.01 onward and would require a change to PRONOM stored procedures but you can see traces of it in DROID container signature files which appear in 6.01.
Once again, using the Eyeglass sample format as our example format and making use of the signature files created last week, we can demonstrate the difference this simplification makes like so:
The contraction of the shift bytes using DROID 6.01 syntax can turn a byte sequence (within any particular signature) from this:
<ByteSequence Reference="BOFoffset"> <SubSequence MinFragLength="0" Position="1" ... > <Sequence>BB0D0A657965676C6173731A0AAB</Sequence> <DefaultShift>15</DefaultShift> <Shift Byte="BB">14</Shift> <Shift Byte="0D">13</Shift> <Shift Byte="0A">2</Shift> <Shift Byte="65">9</Shift> <Shift Byte="79">10</Shift> <Shift Byte="67">8</Shift> <Shift Byte="6C">7</Shift> <Shift Byte="61">6</Shift> <Shift Byte="73">4</Shift> <Shift Byte="1A">3</Shift> <Shift Byte="AB">1</Shift> <RightFragment MaxOffset="1" MinOffset="1" Position="1"> [!00:01] </RightFragment> </SubSequence> </ByteSequence>
To this:
<ByteSequence Reference="BOFoffset"> <SubSequence MinFragLength="0" Position="1" ... > <Sequence>BB0D0A657965676C6173731A0AAB{1}[!00:01]</Sequence> </SubSequence> </ByteSequence>
As you can see, the DROID regular expression syntax, ‘{1}[!00:01]’, is now encoded in the primary ‘<Sequence>’ tag and the shift bytes no longer exist.
Converting all the signatures in the original signature file created last week results in a new, more simplified signature file. When we run the signature file across the Eyeglass sample files in DROID we get the same format identifications and characterizations. The difference in size between the two pretty-printed files is 6410 vs. 3570 bytes. At almost half the size, using extremely basic signatures, the effect on the entire DROID signature file (currently at 1,339,853 bytes) will be even bigger.
Having worked through the pre-processing algorithm to create the PRONOM Signature Development Utility (mirrored here), I can say that the amount of string manipulation required is phenomenal. It creates a barrier for anyone who wants to create a DROID signature file or who is attempting to interpret one or reuse it in their own identification engine.
The simplified approach:
- Reduces string manipulation required in tools creating signature files.
- Makes signature files more hand codeable.
- Reduces complexity required to interpret and parse a DROID signature file.
A DROID signature is just a bit of data that says something about how to identify a file format. PRONOM, in pre-processing this data, made a assumption about the implementation of any system that was going to consume it. It was the skill of the DROID 6.01 team to recognize this, and in doing so, separate out the architectures of both PRONOM and DROID and open up the potential for a greater number of systems to create signature files that DROID can use and also enable more tools to consume the DROID signature file format as they choose.
The former example is perhaps the biggest boon to the digital preservation community as systems like UDFR continue to appear that store DROID sequences such as these for PDF 1.3:
- PDF 1.3: Beginning of File Sequence
- PDF 1.3: End of File Sequence
As these systems begin to store a larger number of sequences that don’t appear in PRONOM, they begin to add value to the community and to the identification tools that are out there. It would be an achievable hack to take the UDFR dataset and begin to output signature files using this simplified format. Whether UDFR or someone else takes up that challenge it is hoped that as other format registries appear in the near future the creators aren’t put off by a seemingly complex signature file syntax and that they adopt a DROID 6.01+ syntax as at least one of their available data representations.
—
Notes: The pre-processing of DROID signatures is described in The National Archives’ Digital Preservation Technical Paper 1: Appendix 2.
Hi Ross,
Your links to the UDFR beginning of file sequence and end of file sequence don’t seem to be working.
Great posts though!
Thanks,
Euan
Curious. They don’t work for me either unless I go through their ‘Start Here’ links. I’ve tweeted to see if I get a response until I find a better way to contact their team. The form of the links should work:
[http:] // [udfr.org] / [ontowiki] / [view] / [r] / [u1r3039]
Thanks for the comment!
Ross