Logo for wddroidy

Making DROID work with Wikidata

Wikidata is a good service, Wikibase (on which Wikidata is built) is a better platform.

I have spoken before about its potential to be added into the file-format registry ecosystem in a federated model.

If we are to use it as a registry that can perhaps complement the pipelines going into PRONOM, e.g. in vendor’s digital preservation platforms such as the Rosetta Format Library, a Wikidata should be able to output different serializations of signature file for tools such as Siegfried, DROID or FIDO.

And what about DROID?

Conversion to DROID

It’s not straightforward to say to a Wikibase/Wikidata Query Service, “output XML in the shape of a DROID signature file”, but it is straightforward to write a converter script.

I had this very thought last week while presenting with colleagues at a File Format Workshop at iPRES in Ghent.

It dawned on me that the conversion script would actually be simple thanks to a change in format to DROID whereby it can process all its own signatures, where previously it required DROID to pre-process them. It’s a long story, a more simple rendition is that DROID no longer requires DROID byte-code to record information about an identification pattern, and can instead store signatures in the attribute of a byte sequence element as-is, i.e. a PRONOM formatted regular expression from PRONOM itself, or Wikidata.

This realization resulted in my writing a conversion script (it took just over a half-day) during some down-time on the train home this past weekend.

The script is called wddroidy (after WD-40 🙄🥁) and can be found here.

Results

We can see using the skeleton suite from Richard Lehane’s Builder that we can positively identify files using the new signature file.

Screenshot of DROID showing results for skeleton files using the Wikidata signature file.A screenshot showing more results in DROID, this time for the X-FMT puid type.

Links can also be made to work with Wikidata identifiers by modifying the PUID URL pattern in the DROID configuration, e.g. to:

http://wikidata.org/entity/%s

The screenshot below shows where in the dialog that setting is:

Screenshot highlighting the PUID URL Pattern in DROID's configuration dialog.

Reference signature file

A reference signature file can be found in the wddroidy repository here. There are approximately 8119 file formats listed and 8195 file format signatures for those.

NB. We know there are different issues with Wikidata including how to identify a “format” and the quality of the signatures. We capture some of these in a global repository: https://github.com/ffdev-info/wikidp-issues/issues

DROID simplified format

The real headline here might be how easy it was to create the output using the DROID simplified format.

I have spoken about it briefly before but not in any detail.

In-short DROID no longer uses its own byte-code encoding that included strange terms such as DefaultShift, Shift Byte, and SubSequence (instructions to DROID about how to perform Boyer Moore Horspool search). See below and note especially how the bytes are split in Shift Byte attributes and elements:

<?xml version="1.0" encoding="UTF-8"?>
<FFSignatureFile xmlns="http://www.nationalarchives.gov.uk/pronom/SignatureFile" Version="1" DateCreated="2024-09-23T18:16:09+00:00">
  <InternalSignatureCollection>
    <InternalSignature ID="1" Specificity="Specific">
      <ByteSequence Reference="BOFoffset">
        <SubSequence MinFragLength="0" Position="1" SubSeqMaxOffset="0" SubSeqMinOffset="0">
          <Sequence>255044462D312E34</Sequence>
          <DefaultShift>9</DefaultShift>
          <Shift Byte="25">8</Shift>
          <Shift Byte="50">7</Shift>
          <Shift Byte="44">6</Shift>
          <Shift Byte="46">5</Shift>
          <Shift Byte="2D">4</Shift>
          <Shift Byte="31">3</Shift>
          <Shift Byte="2E">2</Shift>
          <Shift Byte="34">1</Shift>
        </SubSequence>
      </ByteSequence>
    </InternalSignature>
  </InternalSignatureCollection>
  <FileFormatCollection>
    <FileFormat ID="1" Name="Development Signature" PUID="dev/1" Version="1.0" MIMEType="application/octet-stream">
      <InternalSignatureID>1</InternalSignatureID>
      <Extension>ext</Extension>
    </FileFormat>
  </FileFormatCollection>
</FFSignatureFile>

The updated format was made possible via Matt Palmer via his ByteSeek work, and can now except a regularly encoded PRONOM formatted regular expression (regex) in an attribute in the ByteSequence element. See here for a signature file equivalent to the above:

<?xml version="1.0" encoding="UTF-8"?>
<FFSignatureFile      xmlns="http://www.nationalarchives.gov.uk/pronom/SignatureFile" Version="1" DateCreated="2024-09-23T18:16:09+00:00">
  <InternalSignatureCollection>
    <InternalSignature ID="1" Specificity="Specific">
      <ByteSequence Reference="BOFoffset" Sequence="255044462D312E34" Offset="0" />
    </InternalSignature>
  </InternalSignatureCollection>
  <FileFormatCollection>
    <FileFormat ID="1" Name="Development Signature" PUID="dev/1" Version="1.0" MIMEType="application/octet-stream">
      <InternalSignatureID>1</InternalSignatureID>
      <Extension>ext</Extension>
    </FileFormat>
  </FileFormatCollection>
</FFSignatureFile>

The format is much easier to read, and after a bit of time sitting with the DROID signature file format you realize it is fairly easy to output as well. I use some very rudimentary templates in wddroidy using  Python’s f-strings.

It means other sources of PRONOM encoded signatures can output much simpler signature files and they can be used by DROID. I myself need to add it to the signature development utility – this would allow the utility to run standalone on anyone’s PC.

One next step for this approach might be to confirm that it does work entirely as expected by extracting all of PRONOM’s signatures proper and performing a mapping to the simplified format – if we can match against all the skeleton files in the latest Builder release then we should be looking good!

Priorities

I am always reminded, but always forget about priorities! This is part of how DROID resolves a file format into a single identifier, e.g. where SVG can match XML, we often want the more specific format returned, and so a priority is used to prioritize that one over the other, resulting in a single unambiguous identification for the DROID user. It manifests in the signature file as:

<FileFormat ID="634" MIMEType="image/svg+xml"
Name="Scalable Vector Graphics"PUID="fmt/91"Version="1.0">
  <InternalSignatureID>24</InternalSignatureID>
  <Extension>svg</Extension>
  <HasPriorityOverFileFormatID>638</HasPriorityOverFileFormatID>
</FileFormat>
More work needs to be done with Wikidata to understand if priorities can be properly applied to a DROID signature file. They are not written into the reference signature file above.

Using the results

Using the results can be done for two things:

  1. (Probably) There are a greater number of patterns in the Wikidata output than in PRONOM. If you have a file that remains unidentified, you can try the reference file for clues as to what it may be. I’d only use caution and investigate the exact byte sequence used for a match and understand its properties. I’d also check that the mapping also looks accurate, I’ve tried one or two runs using the identifier and it looks good, but there may still be mistakes.
  2. For improving the quality of the sources in Wikidata. As you can see from the Skeleton suite there are a lot of gaps. We a) have a rough idea what these are, and b) know the identification doesn’t work via Wikidata. Why is that? Is the signature in Wikidata simply not good enough? Are patterns missing? Is there another error or issue we can help with given our expertise in file format identification?

Hacking wddroidy

You can hack wddroidy. Currently it allows you to limit the number of results returned, and also modify the ISO language code used by the tool. You can see this in the command line arguments:

python wddroidy.py --help
usage: wddroidy [-h] [--definitions DEFINITIONS] [--wdqs] [--lang LANG] [--limit LIMIT] [--output OUTPUT] [--output-date] [--endpoint ENDPOINT]

create a DROID compatible signature file from Wikidata

options:
 -h, --help show this help message and exit
 --definitions DEFINITIONS
   use a local definitions file, e.g. from Siegfried
 --wdqs, -w live results from Wikidata
 --lang LANG, -l LANG change Wikidata language results
 --limit LIMIT, -n LIMIT
   limit the number of resukts
 --output OUTPUT, -o OUTPUT
   filename to output to
 --output-date, -t output a default file with the current timestamp
 --endpoint ENDPOINT, -url ENDPOINT
   url of the WDQS

for more information visit https://github.com/ross-spencer/wddroidy

The actual SPARQL query used can be manually edited in the src folder. E.g. you can limit the query by format or family or classification. I provide some more inspiration in the Siegfried Wiki.

Let me know if it’s useful!

This is really just a quick hack and it needs a lot more testing to improve the quality of the output. Most can be dealt with on the Wikidata side I am sure, but some might need to be done in the tool. If it’s useful, reach out, and let’s discuss what can be changed or how it can be used in your work.

Data quality

It will quickly become apparent the data quality isn’t what it is with PRONOM and that is why a curated and authoritative service such as PRONOM is always going to be needed. As mentioned in previous talks, this can in theory be complemented with downstream data in federated databases. This might mean curating Wikidata better using some of the tools available, or curating data into a Wikibase (the platfom Wikidata is built upon). Both options bring different benefits and advantages such as creating a bigger tent of signature developers on Wikidata, or, another example, more expressive signatures being made available via federated Wikibases.

And a word on Wikiba.se

A reminder too, that setting up a Wikibase can take some effort (I was once running three at the same time 😬) but a service called https://wikiba.se/ exists. wikiba.se could form an excellent scratch pad to begin thinking about mapping PRONOM like data to a Wikibase and also begin solving some of the other issues around mapping container signatures and outputting those in a way that is compatible for DROID. Let me know if you give it a whirl, or want to collab on any of that.

Otherwise, thanks in advance! And enjoy wddroidy!

Loading

1 thought on “Making DROID work with Wikidata

Leave a Reply

Your email address will not be published. Required fields are marked *

Follow

Get every new post delivered to your Inbox

Join other followers: