In March I was invited by the LD4 Wikidata Affinity Group to talk about my experiences using Wikibase with Siegfried, the file format identification tool. I don’t think I’ve talked about that work on here before but you can find links to my iPRES talk on my ORCID page.
The abstract for the talk was as follows:
In 2020 Ross Spencer worked alongside the team at Yale University Library (Dr Kat Thornton, Euan Cochrane) and Siegfried Developer Richard Lehane to develop an integration for the popular file format identification tool, and Wikidata. The work has also been expanded with a little careful configuration to enable the integration to work with custom Wikibases as well. In this LD4 Wikibase working hour Ross will describe the Siegfried/Wikidata/Wikibase integration in more detail; engagement between the digital preservation and wikidata community; and the benefits he sees in this work for the discipline.
Wikibase is a plugin component to the Wikimedia Software platform – the platform that Wikipedia uses. Wikidata is the most famous Wikibase implementation but it is also possible for the casual user to run one for themselves.
Okay, previously casual would mean with a bit of development and software deployment experience, but soon, more users should have access via Wikibase Cloud.
While I was running three instances of Wikibase this time last year (each for different purposes), my major experience so far came from developing an integration between the file format identification tool Siegfried and Wikidata.
The work is currently idle, though there is lots more work to be done, and you can get a view of that work on GitHub.
I used the opportunity at the LD4 Wikidata event to describe the work and the remaining gaps, and effort needed by the digital preservation community to make more of what we developed connecting file format identification to Wikidata.
The video via Google Drive: https://drive.google.com/file/d/1sAIfHjY3p33e7e467tTBW9EtYLZ7sK-I/view
I will come back to the topic on here I am sure, all I want to emphasize today is that the Siegfried/Wikidata integration is a great win for the digital preservation community, giving folks access to an open platform for the development and use of file format signatures. It is both a sandbox to work from, and a staging area that can be used to improve PRONOM, and it can, potentially exist on merit as its own as a registry but it requires massive amounts of data clean-up, and some data-modelling work to make it fit for production use. The former issue can be worked around with filters I have developed and made recipes available for.
My intermediate conclusion is that actually, what we need to do is create a federated network of Wikibase instances that take a subset of the Wikidata graph to act as a schema to enable integration with format identification. On top of that, we fill in the data modelling gaps that enable other format identification methods such as container identification – the modelling for which is unlikely to be compatible with Wikidata.
The modelling issues run a little bit deeper, but none are insurmountable. Except…
I don’t have any sponsorship to continue working on Wikibase on my own. I do not have an institution behind me and so I don’t have access to the kind of resources that I need. In lieu of backing I need the community to step up behind me to help build interest and to help iron out what’s left. Also despite the outreach I have been able to muster, folks who have mooted some sort of collaboration or furthering of the effort have remained pretty quiet on the subject, recently opportunities have been missed in both Glasgow and The Netherlands.
This is a cool project I would love to remain involved with and as an independent contractor there are options to make that happen. What I don’t want to see is the wheel reinvented, and would love to see the model developed in a decentralized and open way, but that’s a lot to ask of a field that packed with institutions who have tried to own “the registry” problem for years now. Learn more about my views on that particular subject in my Glasgow short-talk as part of “Registering our Preservation Intentions” at iPRES 2022 (and my slide-deck for that one here).