Versioning as memory?

So, it turns out my theme of the moment is code hygiene (or maybe memory?).

Today I am thinking about versioning, especially in relation to its impact on digital preservation; both software preservation and the impact of versions on long-term preservation efforts in other contexts.

The more well rounded my software development practices become, the more I develop code I share with individuals and work with them on, and the more I combine it with my knowledge of digital preservation, the more versioning becomes an important concern of mine.

Maybe it is obvious to some, and in some ways of course, it was to me as well (I’ve made use of versions and needed to track them in other software); but it’s easy to be laissez-faire about it, e.g. when you’re scripting, you’re often changing things around quickly, you’re not always committing code to source control, you can often get away with what you last saved.

Also, it’s not straightforward. It relies on developing very good practices when doing things without source control, or deepening your knowledge of source control systems and integrating new tooling when you do.

I have been using Goreleaser for a few years in Golang and this allows me to embed GitHub tag and commit information directly into the “executable” (program) itself (this is like having fixity with super-powers (more below)). And for Python I make use of a build tool (setuptools_scm) that allows me to package my code with version information directly from Git tags that I can access at runtime.

In my Go code I can create snapshots (in-between release releases) that I can reliably refer to with versioning, but in Python, I haven’t anything so strong.

But I am glad to have something for Python, and providing I use something like the requirements.txt file format to describe and pin my dependencies for my code, and I use good Git commit and tagging practices I am able to trace code down to the version.

Fixity with super powers

For a long time I worked for a vendor that didn’t have a reliable way of versioning its software releases. You would often have a separate ticket to be actioned sometime during the release cycle that would be to bump a version of the software via its own database script or via a string somewhere in the source code.

This version information was, therefore, only especially useful if done correctly at the correct time in the release cycle, and of course while it helped to provide some indication of date ranges between one bug and another, e.g. determine some bug appeared it appeared after version X and before release Y, but only during what could be a very large number of changes.

We can encode so much more when we use source control management systems (SCM) such as Git which create a SHA1 sum for a commit. As a fixity value with super powers it contains:

In a commit: When you make a commit, a commit object is created. This commit object points to the top-level tree object representing the state of the repository at that commit. The commit itself is also identified by a unique SHA-1 hash, based on its content and metadata (including the tree it points to, the parent commit hash, author, and message).

NB. While I try to always start projects from good principles of versioning today, there are good reasons why it’s easier for me to start a greenfield project thinking about these things than trying to address it in sprawling legacy systems with a huge number of dependencies so there’s no judgement here about why it was difficult at the time.

So, the fixity for a commit refers to software as it existed at a very specific state with a very specific set of metadata including the date, the commit message, and author. (If you have access to a Git repository you can see how it changes with a single action (amend) using the git commit --amend command:

commit 9bbfa33df3a76c4704bd15e361defa46c6a974f8 (HEAD -> main, origin/main, origin/HEAD)
Author: ross-spencer <all.along.the.watchtower2001+github@gmail.com>
Date: Tue Dec 3 23:12:27 2024 +0100

Fixup workflow


commit d6edc9763e9be470531273a98bb5c290b898eada (HEAD -> main)
Author: ross-spencer <all.along.the.watchtower2001+github@gmail.com>
Date: Tue Dec 3 23:12:27 2024 +0100

Fixup workflow

I can also tag the commit so it gets even more meaning, e.g. git tag -a 0.5.0 -m "0.5.0 performance enhancements". The tag can be created on different branches on different commits and it provides a short name for me to refer to the software as it was “fixed” at that point in time.

If I then tie the build system to Git (or other SCM) I can embed this information somewhere as already discussed and it means I have an unambiguous way to refer to version of a piece of code. My app’s version command could look as follows:

{
"myapp": {
"user-agent": "myapp/0.5.0 (0.5.0)",
"commit": "6378e9d597beee195e02cb4b27923626514d5b67",
"date": "2024-11-20T07:46:08Z"
}
}

If someone comes to me and says version 0.5.0 isn’t working for them I can recall it from source control using: git checkout 0.5.0. Likewise if all they have is a commit, e.g. git checkout 6378e9d597beee195e02cb4b27923626514d5b67 or git checkout 6378e9 for short.

Tracking things in digital preservation

We need to be able to track things in forensic detail in this discipline. It might be an object that we are preserving, it might be a interconnected set of dependencies that form an archival object, it might be a part of our workflow that’s using some software that’s outputting something incorrectly, it might be the version of something we store, or something that output something we store, e.g. an AIP.

Things go wrong, but the more forensic the version information we can create for people when we write our software (version its code, and its outputs) the greater the fighting chance people investigating issues have, for example, being able to say, “I have an issue here, but I have been able to determine it only affected outputs between commit badf00d1 and ba5eballor between tags 1.0.11 and 1.1.0-rc1“ (we can pinpoint the number of other commits or release that were in the public sphere). It makes specific where remediation efforts should be focused and it reduces the cost of remediation, and it gives users something concrete to talk to others about when they have a problem,

Versioning as memory

I hope with this writing I am stating the obvious to many writing code in the discipline today. My primary motivation recently has been to put down some things that I felt extremely frustrated weren’t accessible to me in the past but which I am incredibly proud to have strategies for today that I feel might be beneficial or useful for others to read about. It makes me happy the more normal and straightforward it becomes.

Between my writing about testing earlier in the week, and this, I guess I am writing about the ways that software can be appropriately anchored in time so that it can be reasoned about in its different roles as an agent that might impact something in our digital repositories – as an artifact that might one day be preserved in our repositories, or something that might have a long lifespan within our systems and workflows. In that regard, like the more traditional memory (as experience) we encode in unit tests in the previous blog, versioning is memory (a fixed point in time), which is why I guess developers call them snapshots! Given the tooling to be very precise about these snapshots, using it will only benefit us.

Addendum: okay Mister Freelance Indie Dev

I do a lot of development for myself and I develop a lot of my strategies there nowadays. I am also fortunate to be in a gig where I can direct a lot of my efforts for myself, and so I bring my strategies into that role and I can implement and perfect them in that role most of the time.

I have also been in a position where I wasn’t being trained to write code better. I was just writing it using the skills I had taught myself, and even in the software house I worked for, while I earned “bonus” skills, there wasn’t a training program, a robust mentoring scheme, or even guidelines, and time and money and tooling was limited.

If time is too hard to find to practice writing tests, or integrating code-bases with source control systems (that you might not even have direct access to given the restrictions in your org) then I am simply not going to judge. It’s a complicated field and difficult to secure these things.

What I will say is organisations should support you as much as possible to develop code bases with good hygiene early, and testing, linting, and versioning are all part of that. It will save immeasurable amounts of time in the future and might even pay off in reputation, or who knows, funding streams somewhere down the line. They will also be investing in you and your future whether you start writing more and more intricate software for your department or start to share your knowledge with team members, improving their skills — or maybe you simply one-day move on, having given all you can, but at least you will move on with fond memories of the organisation that helped lift you up.

For the expense of a bit more setup time, maybe you can find a template library like the one I have for python (I know there are other templates with good tooling for documentation and other languages) and you can start your code from scratch with at least some baseline tooling needed to start and keep code moving in the right direction.

Fixity with super powers

Tracking things in digital preservation

Versioning as memory

Addendum: okay Mister Freelance Indie Dev

Leave a Reply Cancel reply

Follow ross spencer :: exponentialdecay.digipres :: blog