Prototype

PRONOM: Signature Development Utility









Help using the utility


In PRONOM, an internal signature is composed of one or more byte sequences, each comprising a continuous sequence of hexadecimal bytes values and, optionally, regular expressions. A signature byte sequence is modelled by describing its starting position within a bitstream and its value.

The starting position can be one of two basic types:

Absolute: The byte sequence starts at a fixed position within the bitstream. The position is described as an offset from either the beginning or the end of the bitstream. The byte sequence can therefore be located by moving to the specified offset, counting from either the beginning of file or end of file position. If counting from either the EOF position, the offset is to the final byte in the sequence.

Variable: The byte sequence can start at any offset within the bitstream. The byte sequence can be located by examining the entire bitstream.

The value of the byte sequence is defined as a sequence of hexadecimal values, optionally incorporating any of the following regular expressions:

??: wildcard matching any pair of hexadecimal values (i.e. a single byte).

e.g.: 0x0A FF ?? FE would match 0x0A FF 6C FE or 0x0A FF 11 FE.

*: wildcard matching any number of bytes (0 or more).

e.g.: 0x0A FF * FE would match 0x0A FF 6C FE or 0x0A FF 6C 11 FE.

{n}: wildcard matching n bytes, where n is an integer.

e.g.: 0x1C 20 {2} 4E 12 would match 0x1C 20 FF 15 4E 12.

{m-n}: wildcard matching between m-n bytes inclusive, where m and n are integers or *.

e.g.: 0x03 {1-2} 4D would match 0x03 3C 4D or 0x03 3C 88 4D.

e.g.: 0x03 {2-*} 4D would match 0x03 3C 88 4D or 0x03 3C 88 3F 4D.

(a|b): wildcard matching one from a list of values (e.g. a or b), where each value is a hexadecimal byte sequence of arbitrary length containing no wildcards.

e.g.: 0x0E (FF|FE) 17 would match 0x0E FF 17 or 0x0E FE 17.

[a:b]: wildcard matching any sequence of bytes which lies lexicographically between a and b, inclusive (where both a and b are byte sequences of the same length, containing no wildcards, and where a is less than b). The endian-ness of a and b are the same as the endian-ness of the signature as a whole.

e.g. 0xFF [09:0B] FF would match 0xFF 09 FF, 0xFF 0A FF or 0xFF 0B FF.

[!a]: wildcard matching any sequence of bytes other than a itself (where a is a byte sequence containing no wildcards).

e.g. 0xFF [!09] FF would match 0xFF 0A FF, but not 0xFF 09 FF. Digital Preservation Technical Paper 1: Automatic Format Identification Using PRONOM and DROID Page 9 of 33

[!a:b]: wildcard matching any sequence of bytes which does not lie lexicographically between a and b, inclusive (where a and b are both byte sequences of the same length, containing no wildcards, and where a is less than b).

e.g. 0xFF [!01:02] FF would match 0xFF 00 FF and 0xFF 03 FF, but not 0xFF 01 FF or 0xFF 02 FF.
Note: In the examples above, spaces are included between byte values for reasons of clarity, but are omitted in actual byte sequence values. The signature is processed left-to-right if the signature is measured relative to BOF and right-to-left if measured relative to EOF. The endian-ness of the signature is only relevant for sequences inside square brackets. A byte sequence must contain a fixed subsequence of at least one byte between each occurrence of *, or between the beginning or end of the sequence and an occurrence of *. Thus, sequences of the following form are not permitted:

[BOF] (a|b)*

*(a|b) [EOF]

*(a|b)*...