wikidata + mediawiki = wikidata + provenance == wikiprov

Today I want to showcase a Wikidata proof of concept that I developed as part of my work integrating Siegfried and Wikidata.

That work is wikiprov a utility to augment Wikidata results in JSON with the Wikidata revision history.

For siegfried it means that we can showcase the source of the results being returned by an identification without having to go directly back to Wikidata, this might mean more exposure for individuals contributing to Wikidata. We also provide access to a standard permalink where records contributing to a format identification are fixed at their last edit. Because Wikidata is more mutable than a resource like PRONOM this gives us the best chance of understanding differences in results if we are comparing siegfried+Wikidata results side-by-side.

I am interested to hear your thoughts on the results of the work. Lets go into more detail below.

A (go)lang and winding road…

A lot of the work I do flies under the radar, but especially in this instance where my work on two tools spargo and then wikiprov were specifically about supporting my work on siegfried, a tool for format identification developed by Richard Lehane.

We’ve gone into the work before, but Yale University Library asked whether Richard or myself could add a Wikidata identifier to the tool. We could, and so we did!

Because siegfried uses no external dependencies, relying wholly on the Golang standard library whether in siegfried itself or the dependencies created by Richard, I didn’t want to be the first to add anything external to the tool that couldn’t be directly maintained and so I created two SPARQL packages spargo and wikiprov. While both libraries sit in my repositories, they only use standard library features and they can always be handed to Richard in the future.

spargo and wikiprov are both packages created for querying SPARQL endpoints in Golang. spargo is generic library that can potentially be adopted in most SPARQL cases. Wikiprov is the title of this blog and adds functionality specific to Wikidata, specifically MediaWiki+Wikidata, and I hope is far more interesting.

wikidata + provenance

We know that there is a complex history of edits that go into making resources like Wikipedia. In fact, developers have created ways to visualize or even listen to these edits, such as this by Stephen LaPorte and Mahmoud Hashemi.

Listen to WIkipedia.

It is no different with Wikidata.

Wikidata isn’t that much different to Wikipedia. In fact, both sit on a technology called MediaWiki. Wikidata extends Mediawiki to create a graph of the underlying data as linked open data, this extension is also called Wikibase and gives users the ability to stand up their own Wikidata like service with their own knowledge graph.

The linked open data in Wikidata is easy to access via its query service (Wikidata Query Service, or WDQS). It’s history is less easy to access via query. Where triples are used to represent data as it is now, as a graph, a concept called named graphs exist in other linked open data models that promotes detailed provenance and versioning by extending linked data statements over a second dimension (a separate graph). The named graph can be queried like the original graph, but it provides entirely meta information allowing us to understand the source of the data.

For different reasons Wikidata doesn’t offer this functionality. Although clever folks are looking at it, I wanted a practical approach that had meaning in the context of siegfried+Wikidata, and so I found a different way around the problem.

wikidata as a snapshot

Using a named graph suggests to me some sort of dynamic querying, i.e. I am exploring a graph, and I want to retrieve different properties about that graph as I query. I might be querying live, or asking to update a query when I access another web resource, or something like that.

siegfried, like DROID needs to access signature definitions that enable it to identify file formats. For siegfried, we really want to download a set of definitions once and then continue to access those as we identify our collections. As Wikidata is updated, another version of those signatures may be downloaded with new definitions, and we can use those – just like DROID uses PRONOM which has 119 versions of its own signature definitions, we’re doing something less dynamic with Wikidata and something more analogous to taking a snapshot in time, a version, albeit a slightly more granular one depending on how often this ‘snapshot’ is taken.

This frees us.

We can take this snapshot out of the endpoint, or the WDQS, and dissect the contents of the file as data, and for the purposes of this blog, I think of it as a document — a JSON document — and we can create our own rules about how to parse and understand that document.

Mediawiki, notre je ne sais quoi

I have mentioned Wikidata extends Mediawiki. The Wikidata extension provides the ability to create and query triples as linked open data.

Mediawiki on the other hand provides the underlying capability to create articles, and alongside those articles, record edit history.

Take for example my own file format Eyeglass (*.eygl), when a user clicks on view history, they can see the history here: https://www.wikidata.org/w/index.php?title=Q105858419&action=history

It is possible to access this history programatically, piped into JQ:

curl \
-s "https://www.wikidata.org/w/api.php?action=query&format=json&prop=revisions&titles=Q105858419&rvlimit=5&rvprop=ids|user|comment|timestamp|sha1" \
| jq

The results:

{
  "continue": {
    "rvcontinue": "20210412143356|1400469290",
    "continue": "||"
  },
  "query": {
    "pages": {
      "101213417": {
        "pageid": 101213417,
        "ns": 0,
        "title": "Q105858419",
        "revisions": [
        {
          "revid": 1866845623,
          "parentid": 1757596329,
          "user": "Renamerr",
          "timestamp": "2023-04-02T16:29:43Z",
          "sha1": "3f12a3371bbd490bb74dd4402283e8a897411e91",
          "comment": "/* wbsetdescription-add:1|uk */ формат файлу, [[:toollabs:quickstatements/#/batch/151018|batch #151018]]"
        },
        {
          "revid": 1757596329,
          "parentid": 1533276464,
          "user": "A particle for world to form",
          "timestamp": "2022-10-25T04:53:04Z",
          "sha1": "31182afa67fd562b5c138ca5e0a41f865c643f3a",
          "comment": "/* wbeditentity-update-languages-short:0||ru */ формат файла"
        },
        {
          "revid": 1533276464,
          "parentid": 1533275566,
          "user": "Beet keeper",
          "timestamp": "2021-11-24T13:40:17Z",
          "sha1": "f8bc9eec0e7d14910d11784b13ea0e1f464d5735",
          "comment": "/* wbsetclaim-create:2||1 */ [[Property:P973]]: https://exponentialdecay.co.uk/blog/genesis-of-a-file-format/"
        },
        {
          "revid": 1533275566,
          "parentid": 1423307021,
          "user": "Beet keeper",
          "timestamp": "2021-11-24T13:37:33Z",
          "sha1": "e02a27d006766652038c91c758c148fe6534d875",
          "comment": "/* wbmergeitems-from:0||Q28600778 */"
        },
        {
          "revid": 1423307021,
          "parentid": 1400469290,
          "user": "Edoderoobot",
          "timestamp": "2021-05-18T08:14:57Z",
          "sha1": "45521c5e1bd6bc3a8dfc70e7e0506d946bb48df7",
          "comment": "/* wbeditentity-update-languages-short:0||nl */ nl-description, [[User:Edoderoobot/Set-nl-description|python code]] - fileformat"
        }
        ]
      }
    }
  }
}

We can see users who last edited this record, and we have some indication of what those edits were.

Given programmatic access to this data, as well as programmatic access to the triple data via the query service, we can begin to see opportunities to combine the two data sets.

wikiprov file format

I describe this in more detail in the wikiprov README.

The standard response from any SPARQL endpoint looks something like as follows in JSON:

{
  "head": {},
  "results": {
    "bindings": [{}]
  }
}

The head describes the different parameters requested in the query. The results the different values of the triples that are returned.

I assumed most will access these values using the keys head and results and with no desire to break compatibility, I felt instead of adding provenance somewhere within the existing results format, I could safely add another key to this structure, provenance creating:

{
  "head": {},
  "results": {
    "bindings": [{}]
  },
  "provenance": {}
}

Where provenance could now hold revision history for each of the objects returned in any given SPARQL query.

In siegfried, the bindings might describe format Q105858419 and in the provenance array, we end up with a snippet of revision history for the format as follows:

{
  "Title": "Q105858419",
  "Entity": "http://wikidata.org/entity/Q105858419",
  "Revision": 1866845623,
  "Modified": "2023-04-02T16:29:43Z",
  "Permalink": "https://www.wikidata.org/w/index.php?oldid=1866845623&title=Q105858419",
  "History": [
    "2023-04-02T16:29:43Z (oldid: 1866845623): 'Renamerr' edited: '/* wbsetdescription-add:1|uk */ формат файлу, [[:toollabs:quickstatements/#/batch/151018|batch #151018]]'",
    "2022-10-25T04:53:04Z (oldid: 1757596329): 'A particle for world to form' edited: '/* wbeditentity-update-languages-short:0||ru */ формат файла'",
    "2021-11-24T13:40:17Z (oldid: 1533276464): 'Beet keeper' edited: '/* wbsetclaim-create:2||1 */ [[Property:P973]]: https://exponentialdecay.co.uk/blog/genesis-of-a-file-format/'",
    "2021-11-24T13:37:33Z (oldid: 1533275566): 'Beet keeper' edited: '/* wbmergeitems-from:0||Q28600778 */'",
    "2021-05-18T08:14:57Z (oldid: 1423307021): 'Edoderoobot' edited: '/* wbeditentity-update-languages-short:0||nl */ nl-description, [[User:Edoderoobot/Set-nl-description|python code]] - fileformat'"
  ]
}

And this will look pretty consistent for each file format returned by our in-built query.

Each time we get a result from siegfried we also get access to a permalink to the last update of the record providing the source data for the signature file. In the case of eygl today:

filename : 'example.eygl'
filesize : 14
modified : 2025-01-24T13:59:31+01:00
errors : 
matches :
- ns : 'wikidata'
id : 'Q105858419'
format : 'Eyeglass format'
URI : 'http://www.wikidata.org/entity/Q105858419'
permalink : 'https://www.wikidata.org/w/index.php?oldid=1866845623&title=Q105858419'
mime : 'application/octet-stream'
basis : 'extension match eygl; byte match at 0, 14 (Wikidata reference is empty)'
warning :

wikiprov the package and command

While wikiprov was created for siegfried it can be used for any Wikidata (or Wikibase) query.

wikiprov package

The package is documented using Go best practices.

A runnable example might look as follows:

package main

import (
	"fmt"

	"github.com/ross-spencer/wikiprov/pkg/wikiprov"
)

func main() {
	var qid = "Q105858419"
	res, err := wikiprov.GetWikidataProvenance(qid, 10)
	if err != nil {
		panic(err)
	}
	fmt.Println(res)
}

Command line

I had a little fun creating the command line apps for this.

There are two apps including a provenance enhanced version of spargo discussed below.

I created a new text-based executable format for SPARQL queries. The format utilizes a shebang like in Unix to allow the format to be interpreted when the spargo executable is in a suitable place on the path.

An example of a file that can be used to query Wikidata and return provenance:

#!/usr/bin/spargo

ENDPOINT=https://query.wikidata.org/sparql
WIKIBASEURL=https://www.wikidata.org/
HISTORY=3

# subject, predicate, or object can all be used here. I have elected for
# ?subject as it outputs more information.
SUBJECTPARAM=?subject

# Describe JPEG2000 in Wikidata database.
describe wd:Q931783

A simplified version can be called without provenance by removing the WIKIBASEURL and HISTORY fields.

Another simple example:

#!/usr/bin/spargo

ENDPOINT=https://query.wikidata.org/sparql
WIKIBASEURL=https://www.wikidata.org/
HISTORY=5
SUBJECTPARAM=?item

# Default query example on Wikidata:
SELECT ?item ?itemLabel
WHERE
{
?item wdt:P31 wd:Q146.
SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". }
}
limit 1

With /usr/bin/spargo correctly setup these can be run with ./path/to/file.sparql and results will be output to the command line and can be parsed with jq.

The file format is documented and there are examples in the wikiprov spargo directory.

An alternative to adding executable permissions to your queries, you can also pipe the query into the app, so, with spargo on the $PATH you can do something like:

cat /path/to/my/query.sparql | spargo

And the results will be output to the terminal.

wikiprov command line

The wikiprov command line utility is a simple utility for returning the latest information about a given Wikidata ID (QID). It can be studied as a brief example of how to call the wikiprov library.

Example CLI options:

wikiprov: return info about a QID from Wikidata
usage: wikiprov <QID e.g. Q27229608> {options} 
                                     OPTIONAL: [-history] ...
                                     OPTIONAL: [-version]

output: [JSON] {wikidataProvenace}
output: [STRING] 'wikiprov/0.0.0 (https://github.com/ross-spencer/wikiprov; all.along.the.watchtower+github@gmail.com) ...'

Usage of ./wikiprov:
  -demo
    Run the tool with a demo value and all provenance
  -history int
    length of history to return (default 10)
  -qid string
    QID to look up provenance for
  -version
    Return version

Command line releases

The command line tools can be found in GitHub under releases.

Inspecting siegfried

Richard has documented the $HOME folder for siegfried here.

If you have built a Wikidata identifier, or you have run sf -update wikidata or sf -update deluxe you should be able to inspect the wikidata results using tools like jq or your own libraries.

My wikidata definitions, for example are at: /home/ross-spencer/.local/share/siegfried/wikidata.

Improvements to the format

Currently, the snippet of revision history returned from MediaWiki is not formatted in any way. Were there interest, it may be interesting to write a prettier routine for this data so that it can be understood more easily.

Trade-offs with complexity

Ethan Gates first reported an issue in 2022 and Tyler has also been suffering with the same issue until now.

Because we’re not just going out to a single endpoint, we’re going out to two, we have two potential sources of failure. In-fact, my first response to Ethan assumed this was going to be a Wikidata query — I worked on this premise for a good while, even going so far as to start writing a mirror service to make results available from different sources than the WDQS. None of my ideas worked.

It turned out the problem was the revision history from MediaWiki. Essentially, both services return the same error when a process takes too long, e.g. requesting 8000+ records, but my own experience had taught me Wikidata was more likely to take too long. It may have been the case once, but now I was seeing the same with MediaWiki.

My analysis in January 2025 taught me to be more kind to the MediaWiki API. Their instructions include a directive to use a Retry-After value in their HTTP response when their server was overloaded or busy. I ignored this the first time around but I have implemented this now. I have also made sure that Siegfried can use a -noprov flag when downloading Wikidata so that testing is never impacted by the inability to download revision history. We still get revision history by default, and this should pretty much always work now, but still, it’s good to have both options out there.

A note of thanks

I also want to thank both Ethan and Tyler for reporting the issue, and giving me the excuse today to write up this part of the siegfried+Wikidata process in more detail for those who may be interested. While I wish I had realized the problem sooner, I am grateful to have been able to make my libraries and tooling more robust as a result.

Bei der Buche

Today’s image is one I photographed in 2021 in Warberg near Stuttgart.

The image was selected to conjure the image of a trail (provenance trail) it turns out it also represents a false impression — where I labored under the idea Stuttgart had what I thought were ancient ruins that folks could still visit, I found out that instead it was this vast, architectural art installation that was built as part of the International Horticultural Exhibition in 1993.

The image of Bei der Buche translates roughly to “At the Beech” and is an installation incorporating a Beech Tree (the mother among the trees) by the architect and photographer Karina Raeck.

More about the Beech Tree: https://www.europeanbeechforests.org/world-heritage-beech-forests/germany.