Shattering the eyeglass: Using Kaitai Structs to dissect the eyeglass’ contents

In my post from 2012: Genesis of a File Format, I created a new file format – the eyeglass file format. The format provides a mechanism to persist information about a patient’s eye health following a checkup at an opticians. NB. I am not actually a doctor, so the format is purely illustrative for those interested in understanding binary formats in digital preservation.

A recent post on digipres.club prompted me to revisit this work, starting with refactoring the original GitHub repository and leading to this blog post where I thought it might be an interesting idea to break down the binary format that is output by the tool.

Thus, in the first blog we create the binary format from a specification. In this blog we try and understand the format from the binary itself – more akin to a transfer and digital preservation workflow in an archival institution.

Breaking down the eyeglass

The binary file output in the eyeglass project looks as follows (xxd allows me to pretty print the hex (hexadecimal)):

xxd -g1 prescription-sample-1.0-be.eygl

00000000: bb 0d 0a 65 79 65 67 6c 61 73 73 1a 0a ab 01 01 ...eyeglass.....
00000010: 32 30 31 32 2d 31 31 2d 30 38 54 31 32 3a 33 37 2012-11-08T12:37
00000020: 3a 35 30 00 00 00 00 00 00 00 00 00 00 00 00 00 :50.............
00000030: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
00000040: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
00000050: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
00000060: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
00000070: 00 00 00 00 00 00 00 00 00 00 00 c0 56 66 66 3f ............Vff?
00000080: 00 00 00 be 80 00 00 bf 80 00 00 00 00 00 82 00 ................
00000090: 00 00 50 00 00 00 00 00 00 00 00 00 00 00 00 00 ..P.............
000000a0: 00 00 00 3f 28 f5 c3 3f 00 00 00 00 00 00 0c 00 ...?(..?........
000000b0: 00 00 0c 44 69 73 74 61 6e 63 65 20 61 6e 64 20 ...Distance and
000000c0: 43 6c 6f 73 65 20 57 6f 72 6b 2e 00 00 00 00 00 Close Work......
000000d0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
000000e0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
000000f0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
00000100: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
00000110: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
00000120: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
00000130: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 50 ...............P
00000140: 61 74 69 65 6e 74 27 73 20 65 79 65 73 69 67 68 atient's eyesigh
00000150: 74 20 6e 65 65 64 73 20 63 6f 72 72 65 63 74 69 t needs correcti
00000160: 6f 6e 2e 20 48 69 73 74 6f 72 79 20 6f 66 20 64 on. History of d
00000170: 69 61 62 65 74 65 73 20 69 6e 20 66 61 6d 69 6c iabetes in famil
00000180: 79 20 62 75 74 20 69 6e 64 69 63 61 74 6f 72 73 y but indicators
00000190: 20 66 6f 75 6e 64 2e 20 53 74 61 6e 64 61 72 64 found. Standard
000001a0: 20 63 68 65 63 6b 75 70 20 69 6e 74 65 72 76 61 checkup interva
000001b0: 6c 20 72 65 63 6f 6d 6d 65 6e 64 65 64 2e 00 00 l recommended...
000001c0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
000001d0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
000001e0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
000001f0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
00000200: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
00000210: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
00000220: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
00000230: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 3f 80 ..............?.
00000240: 00 00 bb 65 6f 66 ...eof

The specification was first described as follows:

Eyeglass Format Specification 1.0
---
Magic number - 14 bytes - String
Version - 1 bytes - Unsigned Char
Big-endian - 1 byte - Bool
Date/time - 19 bytes - String #YYYY-MM-DDTHH:MM:SS
Expansion room - 88 bytes - Undefined
Sphere - 8 bytes - R: Float L: Float
Cylinder - 8 bytes - R: Float L: Float
Axis - 8 bytes - R: Integer L: Integer
Prism - 8 bytes - R: Float L: Float
Base - 8 bytes - R: Float L: Float
Distance acuity - 8 bytes - R: Float L: Float
Near acuity - 8 bytes - R: Integer L: Integer
Purpose - 140 bytes - String
Observation - 255 bytes - String
Next checkup - 4 bytes - Float
End of file - 4 bytes - String

 

Between the binary object and this specification we have the two lenses through which we see file formats in digital preservation.
Reading the original blog is a good idea for background, but a summary is that a file format is persisted by writing sequences of bytes to disk. The sequences are bounded by the the data type and respective data length that they represent (a boolean is one byte, a float is 4-bytes, and so on — their own specifications are dictated by the software encoding them, e.g. a 16-bit software vs. 32-bit software, golang vs. rust) – in total, the eyeglass format is written to disk taking up 582 bytes. If we draw the boundaries between the different data types described in the specification it looks as follows:

The different data types and their lengths are like fields, each field can be interpreted in some way that may be useful to the original rendering application, and of course to the digital preservation researcher trying to gleam information from a file format.

But what does that meaning look like?

Manual decoding of the eyeglass

Lets take a small number of the fields in the eyeglass format and take a look.

Date/time

Date/time is the date and time that the file was written. It’s a plain old string and uses the ISO standard date format. In hexadecimal it looks as follows:

32 30 31 32 2d 31 31 2d 30 38 54 31 32 3a 33 37 3a 35 30

Which renders as: 2012-11-08T12:37:50

When we talk about endianness shortly, this field, as a string, like other string fields in this format doesn’t change.

Big-endian

The big-endian field was intended to be a flag that tells the reader of the format if they are looking at a big-endian version, or a little-endian version. This tells the reader how to interpret the values.

In hex we simply have a one-byte “boolean” reading as:

01

This should evaluate as “1” or “true” and tells us that we do need to interpret this file’s contents as big-endian.

Sphere

Sphere is documented as containing two values, one for the right eye, and one for the left. The value are floating point numbers. In hexadecimal we have:

c0 56 66 66 3f 00 00 00

The two values: “c0 56 66 66” and “3f 00 00 00”

The easiest way to understand what these values are, knowing that they are floating point numbers is to use an online hexadecimal converter, e.g. online-hex-converter.

We know it is a big-endian number, we can look up the corresponding fields after data entry:

The interpretation “Float – Big Endian (ABCD)” provides the two results we expect here: -3.35; and 0.5 respectively.

You can get a feel for the issues that might arise interpreting these values without knowing the correct endianness by looking at some of the other results in this hexadecimal converter tool.

Near acuity

Near acuity contains two values which in the specification are two integers, each four bytes: “00 00 00 0c” and “00 00 00 0c”. Knowing these are integers makes  their interpretation a bit easier, we can type these into a search engine as follows:

“0x0000000c in decimal” (the “0x” denotes hexadecimal or base-16)

A result of 12 will be returned. 0x0c is simply 12 in hex. In big-endian this is an easy conversion no matter how many hex values are here.

In the hexadecimal converter we can confirm our result:

Endianness in a file format

Endianness changes layout of the bytes and thus how they need to be interpreted. There’s not a huge visual difference between the big-endian version of the eyeglass format and the little-endian version, but the differences are big enough to change the meaning we try to extract from the file.

We can see the fields that have been changed in one rendering of the big-endian file vs. the little-endian one using biodiff (click on the image below if you need to see it better).

Interpreting the fields in little endian

Taking the example fields from before we can see the differences, and how they look.

Field Big-endian Little-endian Value
Date/time 32 30 31 32 2d 31 31 2d 30 38 54 31 32 3a 33 37 3a 35 30 32 30 31 32 2d 31 31 2d 30 38 54 31 32 3a 33 37 3a 35 30 2012-11-08T12:37:50;
Big-endian 01 00 1=True; 0=False;
Sphere c0 56 66 66; 3f 00 00 00 66 66 56 c0; 00 00 00 3f R:-3.35; L:0.50;
Near acuity 00 00 00 0c; 00 00 00 0c 0c 00 00 00; 0c 00 00 00 R:12; L:12;

Let us take the value for near acuity here: “0c 00 00 00”. It is reversed compared to the big-endian version – but if we ask a search engine for “0x0c000000 in decimal” we will return “201326592” which is definitely not the right value! We need to know that little-endian puts the least significant byte at the first memory address (which equates to an earlier position in a file), and in hexadecimal, like decimal, least significant values are the 1s and the 10s. To interpret this value correctly we need to reverse the bytes, so 0x0000000c like we had in big-endian.

Kaitai

Kaitai is an example of tooling we can adopt in our work to automatically parse the structures of a binary format.

The animated graphic here shows how kaitai parses the eyeglass format and recognizes the data type boundaries (fields) we’ve been discussing in the examples above.

Kaitai is described as a domain specific language concerned with the parsing of arbitrary binary formats.

Further, in their introductory text:

Parsing binary formats is hard, and there’s a reason for that: such formats were designed to be machine-readable, not human-readable. Even when one’s working with a clean, well-documented format, there are multiple pitfalls that await the developer: endianness issues, in-memory structure alignment, variable size structures, conditional fields, repetitions, fields that depend on other fields previously read, etc, etc, to name a few.

Kaitai Struct tries to isolate the developer from all these details and allow them to focus on the things that matter: the data structure itself, not particular ways to read or write it.

Parsing eyeglass in Kaitai

We know how the eyeglass file format is structured. We helpfully (!) have a list of the data-types and sizes and what those mean. We can translate those in to the “ksy” YAMLkaitai struct language” format consumed by kaitai to allow kaitai conformant tools to parse the binary format and to identify the field boundaries and attach meaning to those.

Given the right instructions kaitai is clever enough to understand how to interpret endianness and the values within certain fields. The format incorporates metadata in the form of a header, and additional documentation strings that can also be used by kaitai tools to output more information to the user.

A small snippet from the kaitai structs created for the eyeglass format:

meta:
  id: eyeglass_format
  file-extension: eygl
  xref:
    wikidata: Q105858419
  endian: be
  tags:
    - digipres
  license: CC-BY-SA
doc: |
  The eyeglass format...
seq:
  - id: magic
    size: 14
    type: str
    encoding: utf8
  - id: version
    size: 1
  - id: endianness
    type: s1
  - id: datetime
    size: 19
    type: str
    doc: "the date this file was created"
    encoding: utf8

Field types can be defined as well as field lengths.

In the snippet above we have strings (str); signed 1-byte integers (s1); and encodings (utf-8). If a field wasn’t known but a field length could be determined then size would be enough to start to parse the data and kaitai’s tooling can start to guess at the values to help with reverse engineering efforts.

We can parse the format with kaitai using the tool kaitai-struct-visualizer.

and the command:

  • ksv <eygl-file> eyeglass_big_endian.ksy

The result of parsing looks as follows:

[-] [root]
  [.] magic = "\xBB\r\neyeglass\u001A\n\xAB"
  [.] version = 01
  [.] endianness = 1
  [.] datetime = "2012-11-08T12:37:50"
  [.] format_expansion_room = 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 …
  [-] sphere_right_left (2 = 0x2 entries)
    [.] 0 = -3.3499999046325684
    [.] 1 = 0.5
  [-] cylinder_right_left (2 = 0x2 entries)
    [.] 0 = -0.25
    [.] 1 = -1.0
  [-] axis_right_left (2 = 0x2 entries)
    [.] 0 = 130
    [.] 1 = 80
  [-] prism_right_left (2 = 0x2 entries)
    [.] 0 = 0.0
    [.] 1 = 0.0
  [-] base_right_left (2 = 0x2 entries)
    [.] 0 = 0.0
    [.] 1 = 0.0
  [-] distance_acuity_right_left (2 = 0x2 entries)
    [.] 0 = 0.6600000262260437
    [.] 1 = 0.5
  [-] near_acuity_right_left (2 = 0x2 entries)
    [.] 0 = 12
    [.] 1 = 12
  [.] purpose = "Distance and Close Work.
  [.] observation = "Patient's eyesight needs correction. History of diabetes in family
[.] next_checkup_years = 1.0
[.] eof = "\xBBeof"

The kaitai structs for the eyeglass format are available in GitHub:

With sample files:

And they can be used immediately with tools such as the online kaitai IDE (integrated development environment).

Kaitai IDE online.

How we can use kaitai iteratively in digital preservation

The skills required to interpret file formats through kaitai are similar to those required for file format signature development. As signature developers and digital preservationists start to understand new file formats (or old as yet undocumented ones) as they go through their daily work reverse engineering or finding snippets of file format specifications they can build kaitai struct definitions piece by piece with the data that is immediately known to them, and slowly add to them over time as new fields are understood. As these definitions are generated and shared we may collectively get a better understanding of the files through a lens such as that presented here in this second eyeglass blog today.

Steps to take

  • Document the field boundaries of the header that you can identify, e.g. the first field you can document might be the magic number and its size.
  • Try to identify header fields one-by-one and name those and provide their byte-sizes.
  • Mark unknown fields with a suitable name, and document how many bytes they use.

Questions

Kaitai does support some complicated use-cases such as fields positioned based on variable offsets but it doesn’t seem to offer much help for documenting the unknowns, or matching based on incomplete information such as the space between a header and fields at the end of a stream. Also, it isn’t clear to me that Kaitai supports relative positioning, e.g. relative from end of file. So I would like to understand that more.

Kaitai’s purpose goes beyond documenting file format structures and its tooling provides the ability to output working source code for parsing file formats; this could be an exciting prospect for the field, but it also means the tooling might not be a perfect fit for digital preservation. The eyeglass example is a simple one after all, but I will be interested to read about other experiments in this area.

Further reading

I have discussed collecting community built knowledge bases for file format signatures previously, perhaps we can do something for kaitai struct definitions? (NB. check out the PRONOM Research repository that the PRONOM team established that achieves something similar for PRONOM file format work)

Check out the awesome-kaitai list for more guides around the definitions and how to use them.

fq (jq for file formats) is another promising tool for this sort of effort that others should take a look at.

The Eyeglass file format has its own Wikidata entry which is pretty cool (thank you to the editors who put that up!). It can also be found on the ArchiveTeam Just Solve It Wiki (thank you Dan for that as well).

Prologue: ksdump

5 November, 2023

Mattias Wadman pointed out on Mastodon that ksdump can be used to output YAML or JSON — as Mattias realized, this was actually difficult for me to get hold off for this blog as I had found a workaround using the kaitai visualizer output.

ksdump works well, see the little endian JSON output below:

{
   "axis_right_left": [
      130,
      80
   ],
   "base_right_left": [
      0.0,
      0.0
   ],
   "cylinder_right_left": [
      -0.25,
      -1.0
   ],
   "datetime": "2012-11-08T12:37:50",
   "distance_acuity_right_left": [
      0.6600000262260437,
      0.5
   ],
   "endianness": 0,
   "eof": {
      "eof_1": "BB",
      "eof_2": "eof"
   },
   "format_expansion_room": "00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00",
   "magic": {
      "magic_1": "BB 0D 0A",
      "magic_2": "eyeglass",
      "magic_3": "1A 0A AB"
   },
   "near_acuity_right_left": [
      12,
      12
   ],
   "next_checkup_years": 1.0,
   "observation": "Patient's eyesight needs correction. History of diabetes in family but indicators found. Standard checkup interval recommended.\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000",
   "prism_right_left": [
      0.0,
      0.0
   ],
   "purpose": "Distance and Close Work.\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000",
   "sphere_right_left": [
      -3.3499999046325684,
      0.5
   ],
   "version": "01"
}

This could be very powerful paired with tools such as JQ. It could also be parsed into technical metadata and committed into a digital repository system.

A little more on debugging

ksdump provided me with another layer of debugging for the definition I wrote as I asked the tool to parse the format and the structural data.

I received an error when trying to output the JSON (or YAML) because I had incorrectly defined the beginning and end of file (BOF and EOF) sequences as UTF-8 encoded string types.

ksdump prescription-sample-1.0-le.eygl kaitai/eyeglass_little_endian.ksy 
Compilation OK
... processing kaitai/eyeglass_little_endian.ksy 0
...... loading eyeglass_format.rb
Classes loaded OK, main class = EyeglassFormat
/usr/lib/ruby/3.0.0/psych/visitors/yaml_tree.rb:268:in `visit_String': invalid byte sequence in UTF-8 (ArgumentError)
	   from /usr/lib/ruby/3.0.0/psych/visitors/yaml_tree.rb:136:in `accept'
	   from /usr/lib/ruby/3.0.0/psych/visitors/yaml_tree.rb:330:in `block in visit_Hash'
	   from /usr/lib/ruby/3.0.0/psych/visitors/yaml_tree.rb:328:in `each'
	   from /usr/lib/ruby/3.0.0/psych/visitors/yaml_tree.rb:328:in `visit_Hash'
	   from /usr/lib/ruby/3.0.0/psych/visitors/yaml_tree.rb:136:in `accept'
	   from /usr/lib/ruby/3.0.0/psych/visitors/yaml_tree.rb:118:in `push'
	   from /usr/lib/ruby/3.0.0/psych.rb:513:in `dump'
	   from /usr/lib/ruby/3.0.0/psych/core_ext.rb:13:in `to_yaml'
	   from /var/lib/gems/3.0.0/gems/kaitai-struct-visualizer-0.7/bin/ksdump:122:in `<top (required)>'
	   from /usr/local/bin/ksdump:25:in `load'
	   from /usr/local/bin/ksdump:25:in `<main>'

The two fields in the specification are meant to encode some binary data as well as some plain-text information. Mikhail of the Kaitai project was super helpful pinpointing what I did wrong here and I understand the difficulty of writing arbitrary binary strings in text based formats such as either YAML or JSON from other projects.

I changed the structure of the objects to use custom types for the BOF (magic number) and EOF. The structure gives me more granularity to annotate the binary and plain-text data accordingly.

Read more about the error on GitHub and take a look at how this now manifests in the ksy definition for eyeglass here.

Loading

1 thought on “Shattering the eyeglass: Using Kaitai Structs to dissect the eyeglass’ contents

Leave a Reply

Your email address will not be published. Required fields are marked *

Follow

Get every new post delivered to your Inbox

Join other followers: