Hacking the DROID Signature File for Characterization
Identification of a format can be approached from many angles. Often a magic
number will be used at the beginning of a file. This may be strengthened by the addition of similarly consistent bytes at the end of a file or indeed any part of the bitstream inbetween. Using the sample file format we created last week the magic number to identify it is specified as:
'\xBB\x0D\x0A\x65\x79\x65\x67\x6C\x61\x73\x73\x1A\x0A\xAB'
Alone this translates to a DROID signature as follows:
Position: Absolute beginning of file (BOF) Sequence: BB0D0A657965676C6173731A0AAB Max Offset: 0
To strengthen such a signature we may also make use of the end of file (EOF) sequence and as the file is of fixed size (582 bytes) we could also add an {n}-bytes wildcard parameter between the BOF and EOF sequences to ensure DROID only matches files of a particular size which contain both signatures. My usual approach is to create two separate sequences to ensure the EOF is found as expected at the end of the bit stream. Given the specification of the Eyeglass format contains a version field I will also include that in the BOF sequence:
Eyeglass Version 1.0: Signature --- Position: Absolute beginning of file (BOF) Sequence: BB0D0A657965676C6173731A0AAB01 Max Offset: 0 Position: Absolute end of file (EOF) Sequence: BB656F66 Max Offset: 0
Identification aside, signatures can encapsulate a lot of information. The detail of which we lose but which could potentially be useful. Manipulating DROID signatures allows us to use DROID to return this more detailed characterization information. A common scenario might be the spotting files without their terminating end of file sequences, PDF and JPEG2000, two examples where EOF sequences feature prominently.
If a format specifies that it must begin with an BOF sequence and must terminate with an EOF sequence to be considered ‘valid’ there is a preservation risk in a tool making an assertion about the identification of an ‘invalid’ format where either of these sequences do not appear. With validation existing in a preservation workflow to mitigate this risk, identification tools really just provide clues as to what digital objects we are looking at; even though strict identification of formats might be preferred (a commentary on which is beyond the scope of this blog entry).
An alternative to simply encapsulating combinations of sequence, with, or without end of file markers into a single signature is to use the <HasPriorityOverFileFormatID> element in the DROID signature file to create two signatures, one for each scenario. Priorities enable us to return more detailed characterization information about our results as appropriate:
<FileFormat ID="1" Name="Eyeglass Format: BOF and EOF" PUID="id/1">
<InternalSignatureID>1</InternalSignatureID>
<Extension>eygl</Extension>
<HasPriorityOverFileFormatID>2</HasPriorityOverFileFormatID>
</FileFormat>
<FileFormat ID="2" Name="Eyeglass Format: No EOF" PUID="id/2">
<InternalSignatureID>2</InternalSignatureID>
<Extension>eygl</Extension>
</FileFormat>
DROID will correctly identify each file but it won’t output multiple identifications:
We can delve deeper into characterization using DROID and start creating signatures for other features we might want information about and might cause issues interpreting the file in future. Given the eyeglass format specification talks about endianness we can imagine three scenarios for which characterization signatures could be created:
- Eyeglass format with big-endian encoding
- Eyeglass format with little-endian encoding
- Eyeglass format with invalid endian flag value
To ensure this information is read accurately in Eyeglass files specifically I will need to duplicate our BOF sequence, however, these entries can be safely duplicated using different PUIDs to denominate them. Our three new sequences will be as follows:
Eyeglass Format: Big Endian --- Position: Absolute beginning of file (BOF) Sequence: BB0D0A657965676C6173731A0AAB{1}01 Max Offset: 0 Eyeglass Format: Little Endian --- Position: Absolute beginning of file (BOF) Sequence: BB0D0A657965676C6173731A0AAB{1}00 Max Offset: 0 Eyeglass Format: Invalid Endian Flag --- Positon: Absolute beginning of file (BOF) Sequence: BB0D0A657965676C6173731A0AAB{1}[!00:01] Max Offset: 0
With other formats exhibiting greater variability we could begin to gather even more characterization data using the simplicity of DROID signatures.
We might not want all of this data but my main message is that we have to be careful not to loose too much information about our formats by wrapping it all into a single identification encapsulating multiple combinations of a file format.
Currently the data model used by PRONOM and DROID that would enable this freedom of expression is limited, however, the data is certainly there for some formats. Take the results of this SPARQL query on the prototype PRONOM endpoint for Portable Network Graphics (PNG) 1.1:
PREFIX pronom: <http://reference.data.gov.uk/technical-registry/> PREFIX format: <http://reference.data.gov.uk/id/file-format/> PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#> PREFIX dc: <http://purl.org/dc/elements/1.1/> select * where { format:12 rdfs:label ?formatName . format:12 pronom:version ?version . format:12 pronom:internalSignature ?signatureID . ?signatureID rdfs:label ?internalSignature . ?signatureID dc:description ?description . }
Results:
The results show three signatures attached to PNG 1.1. Each represents a different characteristic of the PNG color space definition. When DROID identifies a PNG with any of these features it will blankly identify it with PUID: fmt/12, however, because of the granularity with which these signatures have been expressed and because the prototype triplestore makes them information resources in their own right we can use and talk about them as first class citizens in whatever end-use we decide.
With flexible data models and methods of accessing data like this, users in digital preservation will be able to cherry-pick the data consumed by tools such as DROID thus allowing for deeper analysis of files in the earliest stages of our preservation workflows. It is my hope that we see approaches like this appear sooner rather than later, and that we no longer need to ‘hack’ our own resources so much as build them with appropriate production quality tools made available to us.
Until then, hopefully some of the approaches demonstrated here prove useful. It is conceivable that as identification tools become less database dependent and become more streamlined we can point at multiple signature files during characterization to return additional information as required. Unfortunately the effort involved in creating these custom files is still quite high.
—
Notes: The signature files created for this blog and their demonstration files can be found in a dedicated folder in the Eyeglass GitHub repository.