File format building blocks: primitives in digital preservation

A primitive in software development can be described as:

a fundamental data type or code that can be used to build more complex software programs or interfaces.

– via https://www.capterra.com/glossary/primitive/ (also Wiki: language primitives)

Like bricks and mortar in the building industry, or oil and acrylic for a painter, a primitive helps a software developer to create increasingly more complex software, from your shell scripts, to entire digital preservation systems.

Primitives also help us to create file formats, as we’ve seen with the Eyeglass example I have presented previously, the file format is at its most fundamental level a representation of a data structure as a binary stream, that can be read out of the data structure onto disk, and likewise from disk to a data structure from code.

For the file format developer we have at our disposal all of the primitives that the software developer has, and like them, we also have “file formats” (as we tend to understand them in digital preservation terms) that serve as our primitives as well.

Take for example plain-text (x-fmt/111) – if I write this blog in plain-text we have a document, or record, something recognizable as the written language, literally, it’s a blog. Yet, if I use this same foundational “file format” and instead write:

#! /usr/bin/bash
set -eux
echo “hello world”

I have a shell script that’s going to be evaluated as at least three commands in a Linux environment.

If I am asked what are the significant properties of either my blog, or the shell-script, through the lens of plain-text it would make very little sense.

Asked however, what the properties of the thing that is the blog are important (the record), or the properties of the shell script that I need to case about then, we would have a very different conversation.

In the second instance I haven’t a simple record or document any longer, I have something approaching code, and with that, an exponentially larger number of dependencies, including what platform it might run on, what commands might be invoked, and something that when run, could succeed or fail depending on different conditions.

The first take away here is that plain-text is a file format primitive, it helps us to build more complex formats.

We can take this further, like in something approaching research data in XML, or JSON to pick two examples.

These “formats”, to take a very high-level viewpoint, are built upon plain-text. In PRONOM we have the classification “text” and “text (structured)” (structured text) and this is an important delineation, structured text means that we have a format that is likely to have very specific instructions about how to process the data that is encoded.

XML for example is a textual representation of a tree structure, and we talk about its elements in terms of roots, nodes, namespaces, parents, and children.

JSON encodes, platform agnostically, the language primitives, of programming languages, (ECMA-404, RFC7159) – slices/arrays, keys, values, dictionaries, and so on.

XML and JSON both have important properties, both rely on being well-formed and valid, for example, to be functional, e.g. piping JSON in to JQ will allow it to be queried – but if the JSON is missing a single curtly-bracket, it can’t really be accessed or queried at all – at least not without finding the problems with the file and fixing them, e.g. through the lens of JSONLint.

Access and query are interesting terms here – what does accessing and querying something like XML or JSON mean?

Well, it depends.

If we extend the concept of a file format primitive to structured text such as XML and JSON, we can begin to understand that we don’t simply “write XML” for the sake of writing XML. We write XML alongside its own semantics either implied by some code or programmer’s purpose, or intention; or written explicitly, through the use of a schema – this adds the property of validation to XML’s already very specific ideas of being well-formed and valid.

The point is to access and query the XML means understanding its structure and what it encodes, and how it encodes it.

Likewise, when we write JSON it is usually to be conformant to some sort of specification as well; again, implied, say, through code, or explicit through a technology like JSON schema.

Stepping back for a second, our technology stack is growing, we have a plain-text file, on which we have Unicode code points for each character in an XML or JSON document, and all of the rules associated with those languages, and alongside those, we have have another plain-text file schema document, encoded in its own format, with its own rules, and describing the rules of the document we have written. It’s getting complex.

Given this view we can see that preservation of the XML can’t be done in isolation, but, perhaps less clearly, understanding the XML can’t be understood without context.

But what context?

If I give you a PRONOM record you need to understand PRONOM’s XML schema. If I give you a METS document you have to understand the METS schema. Both documents have very different semantics that are needed for the preservation and understanding of either as records.

Lets briefly look at the Eyeglass format again. This time, here are three JSON representations of the data:

Representation 1: An array of tuples and two strings

[
    [
        "-3.35",
        "+0.50"
    ],
    [
        "-0.25",
        "-1.00"
    ],
    [
        130,
        80
    ],
    [
        0,
        0
    ],
    [
        0,
        0
    ],
    [
        0.66,
        0.5
    ],
    [
        12,
        12
    ],
    "Distance and Close Work.",
    "Patient's eyesight needs correction. History of diabetes in family but indicators found. Standard checkup interval recommended."
]

Representation 2: One long array

[
    "-3.35",
    "+0.50",
    "-0.25",
    "-1.00",
    130,
    80,
    0,
    0,
    0,
    0,
    0.66,
    0.5,
    12,
    12,
    "Distance and Close Work.",
    "Patient's eyesight needs correction. History of diabetes in family but indicators found. Standard checkup interval recommended."
]

Representation 3: A collection of name-value pairs

{
    "sphere": [
        "-3.35",
        "+0.50"
    ],
    "cylinder": [
        "-0.25",
        "-1.00"
    ],
    "axis": [
        130,
        80
    ],
    "prism": [
        0,
        0
    ],
    "base": [
        0,
        0
    ],
    "distance_acuity": [
        0.66,
        0.5
    ],
    "near_acuity": [
        12,
        12
    ],
    "purpose": "Distance and Close Work.",
    "observations": "Patient's eyesight needs correction. History of diabetes in family but indicators found. Standard checkup interval recommended."
}

At a human level, “representation 3” might be the easiest to understand as the “keys” are self-describing; we just need to understand when a value is for the left-eye or right-eye, and we might also want to understand the language of the text, and its encoding. Both representations “1” and “2” are terse in different ways and it is probably best to assume that the rules for decoding these will be in someone’s code for the Eyeglass format somewhere.

All three representations encode exactly the same information and they’re not the only ways of encoding it, they are just the first three I landed on. Using JSON as a building block for the Eyeglass format we can be more or less expressive as desired, we can even add more complexity, like Base64 for text, we can even sign, or encrypt values (if patient data is sensitive, e.g. like the encryption used in SOPS).

The question is, which version will I receive at my research data archive if we haven’t reasonable standards around design and documentation of research data? And does it matter if I haven’t a schema to work with that will allow me to reuse any of this information?

The answer is no, it doesn’t matter. At a human level it is very likely that I haven’t anywhere near enough information to decode what is here without the background information.

This is why a holistic, records based analysis of digital objects is important in digital preservation. You can’t simply take one object out of context and on merit and understand what preservation means.

With the JSON above, for example, in the context of research data it is far more important for me to know, what the data was, what its schema was, whether the output was repeatable as part of another script, i.e. for the recreation and verification of results. Even understanding if the data could be replayed byte-for-byte is an incredible property, if present. I might have one version of the JSON on disk, and then another I can output in code, and if they align bite-wise, i.e. their checksums match, I have a very strong link between the research project and its outputs (we’re getting close to those diplomatics again!).

When asked to analyse a dataset recently, most of the above was missing in the collecting methodology and documentation. These are bigger problems than the preservation of the data itself.

When we wrote Re-imagining the Format Model. Introducing the work of the NSLA Digital Preservation Technical Registry (and iPRES 2014) we spoke about four structures that we can use to understand file formats:

Specification,
Implementation,
Composition
Aspect

The main take-away from that paper in the context of this blog is a recognition that file formats are building blocks and context all the way down.

It’s unlikely that this paper will catch fire a decade after it was written, but while it doesn’t, perhaps some terminology to consider when we talk about file-formats is the concept of primitives. When we dissect a file to understand how it might be preserved, we ask what its primitives or foundational elements are. In the case of data formats like XML or JSON, these so-called “formats”, even with their own specifications, are only primitives of something much more complex, the record.

Digital preservation is almost always in support of another field, be that archives, museums, or research libraries, or something else. You can save “data” but without taking the holistic view you are almost certainly not preserving it.

At the point of transfer to the archive we’re already at risk of losing context, we probably want to understand what we can do at point of creation (create-to-maintain), including providing researchers with the digital literacy to understand the impact of their decisions about the use of data structures, and the impact of not having documentation, or repeatable outputs.

In research data, it seems like we might need to get past this first problem, understanding the building blocks, versus the record, to be able to understand the specific issue of preservation.

TLDR; If you’re ever asked by someone what the significant properties of JSON are: let them know JSON certainly has interesting properties but it’s only part of the story. What’s its context? Is there a schema available? What does the data represent? How should it be interpreted? Because those “properties” really are significant.

Born-networked

It strikes me that some of the language used in Emily Maemura’s recent writing about describing and understanding web-archives in aggregate, specifically, the idea of being born-networked, has potential application in the research data world.

— Conceptualizing aggregate-level description in web archives (2025)

Nestor

Nestor have a call-for-papers (CFP) open until May 12 for a special issue on the linking of research data management and long-term archiving. Those reading this blog might be interested in submitting something.

Nestor CFP: Linking of research data management and long-term archiving.

As Nestor describe it:

From the perspective of long-term archiving (LZA), one special feature and challenge of research data is that it cannot be treated as just another type of data or media. “Research data” is not just another type alongside, for example, “audio/video data” or “text data”. The research context itself is what is special, and the creation or use of data in the research context turns the data into research data. This means in particular that data can become research data even after it has been archived if it is used for research.

Which very much aligns with the message of this blog.

Welcome to the information layer cake

Context as we know, is everything to the archive and the archivist. Without it, you just have something, but you don’t know what.

Stephen Clark describes pretty much all of what I try to touch on above, he describes the information layer cake to describe data and information (and records) with one very effective image.

Image Credit: Stephen Clarke MRIM via LinkedIn

Create to maintain

I also just wanted to amplify this blog that I found while writing this one. I like the message about always writing code with a maintenance mindset, the practice it gives you for larger projects, and its other benefits.

— The coding disappointment: create to maintain (2017)

Born-networked

Nestor

Welcome to the information layer cake

Create to maintain

1 thought on “File format building blocks: primitives in digital preservation”

Leave a Reply Cancel reply

Follow ross spencer :: exponentialdecay.digipres :: blog