Genesis of a File Format
Working from the bottom up and ignoring formal preservation descriptions of a file format (see: File Format a rdfs:Class) – a file format represents a place to store information outside of a software application. It mirrors internal memory but promotes the use of the same data outside of, and in between applications.
Formats can be taken for granted. As a software developer I will have written a number of my own in the past; usually adopting a de-facto standard mirroring the data structures used by the software application, (as discussed in part in a previous blog for The National Archives), to be loaded back into memory to be acted upon at a later date.
To explore the concept of the file format in more detail and to understand it from genesis, or, ‘creation to preservation’, this blog entry goes through the process of creating a new file format. It looks at some of the considerations which can be made to make it robust and hopefully make it suitable for preservation. It asks the question of readers – what other considerations do you think I could have made in creating this format which might make it more amenable to future preservation.
In the beginning…
Having recently attended an eye examination I decided a reasonable format to create would be one that stores the results of the prescription ordered by the optician. If we look at the sample document below we can see that its structure may lend itself to being encoded in a formal specification:
Coding this information into a format specification will allow me to create files that opticians can use to store patient data, read it back in future and render it in various output formats such as the one that created prescription document above.
Dissecting our data
Taking the complete document we can separate it into structured fields and data types. Without going into the full detail regarding the data type choices, the specification for the first version of my format is described below.
Eyeglass Format Specification 1.0 --- Magic number - 14 bytes - String Version - 1 bytes - Unsigned Char Big-endian - 1 byte - Bool Date/time - 19 bytes - String #YYYY-MM-DDTHH:MM:SS Expansion room - 88 bytes - Undefined Sphere - 8 bytes - R: Float L: Float Cylinder - 8 bytes - R: Float L: Float Axis - 8 bytes - R: Integer L: Integer Prism - 8 bytes - R: Float L: Float Base - 8 bytes - R: Float L: Float Distance acuity - 8 bytes - R: Float L: Float Near acuity - 8 bytes - R: Integer L: Integer Purpose - 140 bytes - String Observation - 255 bytes - String Next checkup - 4 bytes - Float End of file - 4 bytes - String
At this point I should state that the formal realization of this work is maintained in a GitHub repository dedicated to the project. More information about eye prescriptions can be found on Wikipedia, and also in the main eyeglass class implementing the specification described above.
The numerical data types selected for the specification were constraints set by the prescription itself. They are either integer or float data types and simply need to encode values up to the correct magnitude. Other points of features of interest are described below:
Magic number (BOF) / End of file (EOF)
The ‘magic number’ identifying the file format will be as follows:
The ‘end of file’ sequence terminating the stream will be as follows:
EOF and BOF sequences were selected in concert as I feel the existence of both can aid in the unambiguous identification of a file format by tools such as DROID.
The design of the BOF sequence is based on elements described in the JPEG2000 specification (ISO/IEC 15444-1) and the Portable Network Graphics Rationale document (12.11. PNG file signature):
The type of the JPEG 2000 Signature box shall be ‘jP\040\040′ (0x6A502020). The length of this box shall be 12 bytes. The contents of this box shall be the 4-byte character string ‘<CR><LF><0x87><LF>’ (0x0D0A870A). For file verification purposes, this box can be considered a fixed-length 12-byte string which shall have the value: 0x0000 000C 6A50 2020 0D0A 870A. The combination of the particular type and contents for this box enable an application to detect a common set of file transmission errors. The CR-LF sequence in the contents catches bad file transfers that alter newline sequences. The final linefeed checks for the inverse of the CR-LF translation problem. The third character of the box contents has its high-bit set to catch bad file transfers that clear bit 7.
This signature both identifies the file as a PNG file and provides for immediate detection of common file-transfer problems. The first two bytes distinguish PNG files on systems that expect the first two bytes to identify the file type uniquely. The first byte is chosen as a non-ASCII value to reduce the probability that a text file may be misrecognized as a PNG file; also, it catches bad file transfers that clear bit 7. Bytes two through four name the format. The CR-LF sequence catches bad file transfers that alter newline sequences. The control-Z character stops file display under MS-DOS. The final line feed checks for the inverse of the CR-LF translation problem.
The eyeglass format adopts features including a non-ASCII value at the beginning of the signature, this byte has its high-bit set to catch bad file transfers that clear bit 7. The signature also adopts additional elements of both signatures described above which help detect file-transfer problems. It also contains the ctrl-zcharacter (0x1A) to prevent the display of the file in MS-DOS.
The EOF sequence simply consists of the same starting character as the BOF sequence followed by the ASCII characters ‘eof’.
The big-endian flag exists for the purpose of experimentation and to highlight the difference between formats, and their interpretation when encoded either as big-endian or little-endian.
As a developer I am torn between removing redundancy from my format and ensuring that it is as future-proofed as much as realistically possible. Redundancy affects the maintenance of code and file sizes and I never really want to transmit or store more data than absolutely necessary. Redundancy introduces costs that I don’t need to introduce, however, if a customer comes to me in a couple of weeks with a new requirement or the specification I have been working from (in this case the nature of eye glass prescriptions) changes, I need to be able to incorporate those changes flexibly.
88 bytes is the equivalent of 22×4-bytes worth of data. This may store 22 float values or integers or indeed any other data type. The expansion room is free to be used at any point by any amount or type of data formally specified. It ensures future file sizes remain consistent (582 bytes) and may be the least destructive way of adding information while pointers to existing data remain consistent.
PNG describes a different approach to format extensions. Other formats may adopt different mechanisms too. I’ve simply adopted a method familiar to my previous development background in C++. I may use future blog posts to explore the relative merits of various format expansion mechanisms.
Features I’ve avoided in the format specification include separator characters between fields. Given the rigidity of my file and that it mandates the existence of each field I expect to be able to do some rudimentary checking on the validity of the file before reading it into memory in one operation. Responsibility for the accuracy and validity of values encoded in the format are left up to creating and rendering applications.
And there was light…
There’s our file format! As mentioned above this work exists in a GitHub project entitled ‘eyeglass’ and it will be added to as the blog evolves to study this format in more detail. Specific files of interest are the eyeglass class which describes constructors for building the structured output which we are calling the eyeglass file format and various methods for adding data and saving the digital objects.
The other file of interest is an application which builds three of these objects. The first file created is the object put into memory using just the default constructor. The second and third objects are big- and little-endian representations of a sample prescription denoted by the naming suffixes ‘-be’ and ‘-le’ respectively.
All three generated files are in the repository and have been given the .eygl file extension for those who are unable to run the scripts but are interested in looking at the output.
After all that… Where next?
In dissecting this work to create this blog entry I’ve already spotted that I may need to extend the format to include a ‘patient identifier’. This will be more effective than file naming conventions, however, I will need to understand what format this should take. Further, while my float values necessarily are four-bytes long, I believe 2-byte shorts (255 possible values) will be enough to store values for axis and near field acuity – the implications of changing these data types to create a smaller file size will need to be considered.
Time and resource permitting, future blog posts will use this work to explore the identification and validation of this file format among other digital preservation concerns such as rendering and definition of ‘the record‘.
For now, my question to anyone who has read this far is: ‘In designing this format what other considerations could I have made to make it more suitable for future preservation?’ Given that this work potentially mirrors the initial processes and thinking of the software applications and file formats we are dealing with today and have to continue dealing with long into the future, what could we get more right at this point in time to make formats easier to handle 20 years down the line? …Answers (and any other comments) on a postcard, in the comments below or on indeed on twitter, thank you!