We love our fiefdoms: The DPC and Wikidata

Andy Jackson is doing some positive work for the DPC, especially in his low-level look at Wikidata recently (external blog).

Specifically, his work on a SPARQL query we use in Siegfried might just solve a problem we have with downloading Wikidata results in the United States (and maybe elsewhere).

In his blog post (above) he describes some issues with the results returned from the existing file formats SPARQL query, specifically gaps in the results. Issues having full knowledge of some of the syntax used in the query language (very relatable); and ends with a request for help, which we will get to below.

But first, there are two red flags I’ve noticed in this work:

When the Wikidata File Format Query snippet was first crimped from Siegfried without acknowledgement or engagement: https://github.com/digipres/sentinel/issues/13#issuecomment-1235451598
The suggestion in Andy’s post above someone is ‘doing it wrong’ (I read the someone as the Wikidata community).

On 1. as a developer there are things that take minutes to work out, and things that take days (Andy’s own getting to grips with Wikidata probably reflects some of my own learning efforts). The query for Siegfried took days to develop and overall weeks to test, refine, iterate, and make sensible for use in the Siegfried Harvest process. While it is only an example of declarative programming, it’s license unclear in the scheme of things, a more overt shout-out to the project and more open communication about aligning efforts would be nice – and given the role of the Regstries of Good Practice Project, more visibility about extensive existing efforts.

Wikidata: A Magic Portal for Siegfried and Roy

On 2. I have shared my views about owning the file format problem previously.

I have also been part of a more specific call to the community in the iPRES paper Wikidata: A Magic Portal for Siegfried and Roy.

Specifically under “Creating a together community”:

The corollary to engagement with the integration is that the digital preservation community also needs to learn how to engage with Wikidata in ways that are sympathetic with the Wikidata community guides and guidelines. We are potentially two communities that need to come together, not a digital preservation community squatting in a Wikidata world.

Wikidata makes it very easy to edit entries. It offers a discussion platform to discuss those changes too. Where new properties are added to the Wikidata schema then those can be discussed in forums of in creasing importance as well. While the service is community governed, that governance is somewhat different to the idea of a central point of focus to the registries of the past.

There is a certain amount of “Wikidata literacy” that needs to be developed alongside this resource to help make it work. In this sense two communities can become a “together” community with that of “digital preservation” branching out beyond our national memory institutions or vendors as leaders in our community registry work, to that of the wider technology ﬁeld. An unintended consequence of that effort too is that the wider technology world may have resources to help bolster “our” efforts.

People who would like to contribute ﬁle format sig-nature information to Wikidata are welcome to add it. Digital Preservationists could share signature data via Wikidata, and could even use the platform to discuss with one another. Each item in Wikidata has an associated “Discussion” page” just as each article in Wikipedia does. If someone needs clariﬁcation about how a signature was determined it is possible to ask that question on Wikidata and see if others reply.

In earlier drafts and maybe my iPRES presentation on this, it was very much framed as we cannot simply colonize Wikidata. We must recognize that there are many smart eyes looking at this shared service, and who may indeed think ‘we’ are doing it wrong.

Avoiding our colonization of the service does not begin with us saying someone is doing it wrong, we need to recognize it’s a conversation, and a process.

Wikidata anxiety

With some of my own anxiety as a professional (trying to have my own work visible to institutions and a community bigger than myself) exposed above; I recognize in Andy’s request for help, an anxiety – specifically in how to engage with the Wikidata community (Wikidatans).

I have huge amounts of sympathy for this. It’s a different world from digital preservation. There are thousands of people engaged with this service daily, and of course, they’re not all engaged with file formats like we are.

My first reference for Andy is to look at the work of one of my colleague’s on the Siegfried Wikidata project Dr. Thornton in Wikidata for Digital Preservationists a technical guide they wrote for the Digital Preservation Coalition (DPC). The guide should provide some sense of confidence about engaging with Wikidatans.

My second is to refer back to the similar questions myself and colleagues raised in Wikidata: A Magic Portal for Siegfried and Roy; and some of the tips I have around discussions in Wikidata and getting started.

My third, is to refer Andy to a repository I created for some of these issues, WikiDP-Issues. I created this repository back in 2020 so that digital preservationists could have a space to talk about Wikidata safely outside of the Wikidata community and then consider the best approach to take matters to Wikidata — or the other way, to our own registry maintainers and so on. I hoped to compartmentalize the anxiety about the breadth of the Wikidata dataset and its participants into manageable chunks and tackle them piece by piece — I also wanted to create this space to ask others for help, but my bigger hope was that the community would see GitHub as a community space, for building a community together, in this case, tackling the Digital Preservation Wikidata issues together.

Linting and the quality of Wikidata data

What metric is most important when we look at file formats in Wikidata?

With the changes made in the most recent pull-request to Siegfried.

Linting ./roy build -wikidatadebug still shows 195 issues outstanding in the data Siegfried tries to consume.

{
"AllSparqlResults": 9284,
"CondensedSparqlResults": 8118,
"SparqlRowsWithSigs": 9284,
"RecordsWithPotentialSignatures": 8118,
"FormatsWithBadHeuristics": 46,
"RecordsWithSignatures": 8072,
"MultipleSequences": 12,
"AllLintingMessages": [
"Use the `-wikidataDebug` flag to build the identifier to see linting messages"
],
"AllLintingMessageCount": 195,
"RecordCountWithLintingMessages": 156
}

These all need to be addressed some way to improve what we’re doing with Siegfried and Wikidata. Linting is used as a term in the programming sense and just provides a way to statically analyse the data we want to use. Linting is part of the process of rejecting data that is unsuitable or might cause other harm to Siegfried users, e.g. output completely incorrect format identification results.

Andy’s blog, describes one specific issue out of many that need addressing. Some more serious than others (from a quality assurance perspective).

We need both a (relatively) complete output of the records that Wikidata hosts, as well as an understanding of the quality of the data in those records — the former is rendered less useful without the latter.

Andy’s request for help

Andy ends his blog with the following questions:

Is futzing with the query a reasonable approach? Or should I just work on improving the data?
How should instance of and subclass of be used in these cases? Is the existing documentation clear enough on this point?
Is it okay for me to just go in and change that stuff? Do I need to point to a policy?
Who decides what the form should be?
Do I need to tell someone they are doing it ‘wrong’?
If these policies change, how would I know?
Should I try to set something up that will find and flag ‘broken’ File Format records?
Are there other ways I can help?

My answer to all of these, yes, no, and it depends. My takeaways are as follows:

This remains as not a singular problem for Andy or the DPC to solve. It’s important to educate and build a sustainable community approach.
They are not the first asking these questions, but contrary to their project principles are not amplifying existing work already asking them.
Futzing with a query is one approach, but it’s only an entry point to a dataset that will always be less than uniform (unless we work on custom Wikibases). Based on the Pareto Principle and reasonable handling of edge cases we can probably push Wikidata query output to 85-90% coverage of what’s available, and output more file format records than we can even imagine in resources like PRONOM right now.
Tracking changes and additional sentinel-like tooling can follow the same principle.
Do go in and improve the data, and look at the Wikidata discussions:
- the discussion entry on Photoshop (Q2141903) is empty and waiting to be used.
- We can explore the extensive provenance of the record where it is shown PSD was first changed to an instance of file format family in 2016. We can ping user Pixeldomain who changed this, and open a discussion as to why? and how we might modify the record? or whether we should add new (instance of) file format specific ones?
- Let’s also look at the potential of existing Wikidata Groups for engaging with Wikidatans, e.g. Wikidata Informatics.
Use the resources of the DPC to host a workshop on working with Wikidata and bring in actual Wikidata experts to help us each to understand how we become good participants in that community — let’s listen.
Please try and throw some resources behind the analyses output through Siegfried’s linting (borne out of necessity and identifying gaps and suitability of the resource), and those listed as specific issues in Wikidp-Issues (and maybe let’s not try and reinvent the wheel). These efforts can always be discussed, changed, evolved, but perhaps it starts by documenting they are there.
Tooling is always good, more tooling in support of the above would be great. We can even extract some of these functions into separate modules and executables to compliment any new initiatives.

And finally: 9. please review the literature, and stop siloing information – from the exclusion of the Siegfried thread here in the blog post which points at answers from the community (and clues to more information for others!); to the lack of visibility in the initial creation of the Sentinel task. We have a growing body of information that can help Andy, and others post- the DPC Registries of Good Practice project to really get to grips with this, but if it’s not good enough, it should also be evaluated and exposed as such so that there’s an opportunity to improve — I certainly for one am tired of screaming into a void and trying to garner community involvement and seeing bigger institutions ignore it to justify their own ownership of a space; a space that needs to be open, and together, and complete, and go beyond the bounds of our #digipres world.

Suggested resources that go back in time and look at Wikidata in even greater depth:

Some of my other work on Wikidata that may also be relevant:

Resources that go back even further:

Wikidata as a digital preservation knowledge base was first promoted circa. 2016 by Katherine Thornton and Euan Cochrane of Yale University Library: https://openpreservation.org/blogs/wikidata-as-a-digital-preservation-knowledgebase/ and the portal is maintained in part by the Open Preservation Foundation https://github.com/WikiDP/wikidp-portal.

Andy’s response

2024-07-21: Andy responded thoughtfully here shortly after this blog. While there are a few points being talked around and decoupled from their original meaning there is also the addition of some important project context that may be added to the goals of the Registries of Good Practice. I applaud the desire to work in the open and look forward to continuing to read the outputs. I also hope that the quest to be open means that questions and engagement can continue to not just be open, but contextualized (cited), and direct.

IDCC2025

2024-11-16: The IDCC25 conference (February 2025) will be one to watch, with the DPC presenting Twenty Years Of Format Registries: Are We Ready To Preserve The Born-Digital World?