Contributing back to the commons in digital preservation hasn’t been for everyone.
We know the famous XKCD that touches on the underappreciated work of maintainers in obscurity. When you, or your institutions, or services are using free and open source software, or other information and data in the commons, and you’re not contributing back, you’re perpetuating this, and what’s more, there’s a virtuous cycle that we’re missing out on.
I read something the other day and it felt like a red flag.
When you move country you have to be prepared to change quite a lot about your life. Back at the end of 2020, apart from literally everything else going on my partner and I also moved from Canada to Germany.
For me, this was my fifth or so international move (including shorter temporary stays) in as many years.
Being able to pick up sticks and move like that means living a drastically minimized life. Most of the things you have fit in a suitcase. Most of the things you have are small, and largely not overly whimsical. Sure, you can fit a few treasures into your bag, but you learn to value small ones, not things you might otherwise use to decorate an entire apartment!!
So, what do you do when you do have an apartment to decorate?
For a while back then I was into space flight again. Scientists, science communicators, and engineers were all excited for a new era of rocket launches and the potential unification of the human race as we look towards the future.
During that time I discovered Colin Fries’ work in the NASA History Division to document all NASA “Wake-up calls”. A wake-up call is simply a piece of music used to wake astronauts on missions, a different piece of music, daily, for the duration of the flight.
Take, for example, the last Space Shuttle mission (Space Transportation System) STS-135; it was in flight for 13 days, and the wake-up call on day one was Coldplay’s Viva la Vida, while on day 13 it was Kate Smith singing God Bless America.
As a huge music buff who has the radio or music television on 18 hours a day, I really wanted to delve into this further. While Colin’s work is great, it’s just a PDF file (@wtfpdf). A PDF is not an ideal file format for querying data and gleaning new insights. So, while I wanted to explore it, I first decided to turn it into a true dataset. The result was a set of resources, a website, a JSON, a CSV, and an SQLite database which are each more functional and more maintainable over time.
Like bricks and mortar in the building industry, or oil and acrylic for a painter, a primitive helps a software developer to create increasingly more complex software, from your shell scripts, to entire digital preservation systems.
Primitives also help us to create file formats, as we’ve seen with the Eyeglass example I have presented previously, the file format is at its most fundamental level a representation of a data structure as a binary stream, that can be read out of the data structure onto disk, and likewise from disk to a data structure from code.
For the file format developer we have at our disposal all of the primitives that the software developer has, and like them, we also have “file formats” (as we tend to understand them in digital preservation terms) that serve as our primitives as well.
Today I want to showcase a Wikidata proof of concept that I developed as part of my work integrating Siegfried and Wikidata.
That work is wikiprov a utility to augment Wikidata results in JSON with the Wikidata revision history.
For siegfried it means that we can showcase the source of the results being returned by an identification without having to go directly back to Wikidata, this might mean more exposure for individuals contributing to Wikidata. We also provide access to a standard permalink where records contributing to a format identification are fixed at their last edit. Because Wikidata is more mutable than a resource like PRONOM this gives us the best chance of understanding differences in results if we are comparing siegfried+Wikidata results side-by-side.
I am interested to hear your thoughts on the results of the work. Lets go into more detail below.
It reminds me of an unrealized article I wasn’t able to get written and into the wild, but it’s an important thought I would like to share nonetheless.
Proposed for James Lowry’s ACARM Symposium in 2015, I wanted to discuss when government is unable to adequately fund day-to-day effort, and research and development in the archive sector, leading to inefficient and potentially ineffective processing pipelines for records of archival value accessioned from government agencies and commissions.
It was just an abstract, but maybe folks have thoughts about this? Have we moved on since the early to mid 2010’s? What modern metrics do we have available to us today to see the progress? What does the advent of the new US administration mean for issues like this? As well as increasing worldwide authoritarianism?
Wikidata is a good service, Wikibase (on which Wikidata is built) is a better platform.
I have spoken before about its potential to be added into the file-format registry ecosystem in a federated model.
If we are to use it as a registry that can perhaps complement the pipelines going into PRONOM, e.g. in vendor’s digital preservation platforms such as the Rosetta Format Library, a Wikidata should be able to output different serializations of signature file for tools such as Siegfried, DROID or FIDO.
In March I was invited by the LD4 Wikidata Affinity Group to talk about my experiences using Wikibase with Siegfried, the file format identification tool. I don’t think I’ve talked about that work on here before but you can find links to my iPRES talk on my ORCID page.
Let’s look at the abstract and the content of the talk below.