PRONOM release statistics
My contribution to PRONOM research week 2023 (held in November 2023) is a PRONOM summary website and Application Programming Interface (API). The idea is to provide a snapshot of the current state of the PRONOM database and prompt the community to further the work of completing records that remain incomplete. Hopefully complementing the descriptathon described by Francesca Mackenzie and the PRONOM team and providing automation so as to help with producing descriptathon lists in the future.
The summary web-page and API can be found at the following links:
- Summary page: https://api.pronom.ffdev.info/docs
- API: https://pronom.ff`dev.info
But it’s January?!
Yes, it’s a little late to the party! It’s fashionable, right? — well, actually, I proposed the idea to the team at the regular PRONOM drop-in back in October, it has just taken a bit longer to get together than anticipated.
The release was built in three phrases:
- Development of generic PRONOM tooling.
- Development of an API.
- Development of a web-page using the API.
PRONOM tools
The generic PRONOM tools (‘pronom-tools’) is released as a package on PyPi and offers different command line options providing ways of generating statistics about PRONOM and accessing information such as links to the most recent signature files.
- PyPI link to pronom-tools.
The idea of the tooling is to enable the download of all of the important PRONOM files and then produce statistics about what’s there. This isn’t straightforward at present, for example, to generate statistics about the signatures that are available in a PRONOM release the standard PRONOM export must also be cross-referenced with the DROID container signature file; this then allows you to see the number of file formats with some sort of signature that can be used by DROID and Siegfried.
Statistics produced by the PRONOM tools also distinguish between the following:
- File formats requiring signatures.
- File formats that need descriptions.
- File formats that need both signatures and descriptions.
These are some of the ways that the PRONOM community talks about the work that still needs completing in the current database and is very much inspired by the aforementioned descriptathon.
Additional PRONOM tooling
One of my goals over the past decade was to write some sort of alert for PRONOM releases. I haven’t quite got there with this release of the PRONOM tools but a shakedown of the code showed me what was missing and I will implement this soon.
When complete, I can setup the utility ‘pronom-cron’ to check for a PRONOM release a couple of times a day and update the statistics automatically when one becomes available.
There might be other utility to this, for example, for updating tooling that relies on PRONOM such as digital preservation systems that require up to date information about file formats.
PRONOM statistics API
The PRONOM statistics API is designed to allow users to access information about PRONOM without necessarily having to download the pronom-tools suite and further copies of the PRONOM database.
The statistics try to augment existing PRONOM data and specifically try to avoid duplicating anything already available.
The API provides a convenient data-oriented view of the PRONOM database and for the purposes of this work allow me to easily build a website on-top. You might be able to find other more interesting and exciting ways of accessing and remixing this data.
PRONOM statistics page
The PRONOM statistics page accesses various endpoints made available in the PRONOM statistics API to output a website that describes the state of PRONOM at the time the site is accessed.
Promoting the PRONOM cause
The idea is that the release summary will provide users with useful information about PRONOM today and inspire those that use the service or wish to get involved to contribute new data.
Take for example the statistics for the v116 release of PRONOM:
- Number of records: 2377
- Records with complete descriptions: 1840
- Records with signatures: 1983
- Records requiring descriptions: 537
- Records requiring signatures: 394
We can see how comprehensive PRONOM is but we can also see the work that remains outstanding.
Users using the PRONOM statistics page can get easy to access links to the records that either require descriptive information, or just as importantly, require file format signatures to enable automatic identification using tools such as DROID, Fido, and Siegfried.
Open source and the capacity to remix
The website was developed in such a way as to allow anyone to host it or use as a foundation for remixing what’s there and providing a different view of the data.
The API enables a lot of this too and users are invited to build anything that inspires them on top of that. Users are also invited to submit new ideas for statistics that can be shown or other views of the data, for example, users might want a comma-separated-values (CSV) summary of the PRONOM dataset.
GitHub repositories for this work
The following GitHub repositories are used for this work:
- PRONOM release tools and API: https://github.com/ffdev-info/pronom-release-tools
- PRONOM statistics page: https://github.com/ffdev-info/pronom-page
ffdev-info
The ffdev-info Organization is a new organization I have setup on GitHub to collect my work around file-formats, including my work with Wikidata.
- ffdev-info on GitHub: https://github.com/ffdev-info
Relationship to PRONOM
I have long promoted different views of the PRONOM dataset. In doing this work it is not to replace anything PRONOM and the PRONOM team might want to do. In the short- to medium-term it attempts to fill a small gap in the PRONOM development workflow. In the long-term I hope it will inspire developers of PRONOM to consider building access to these different views and statistics into the core PRONOM API.
Relationship to digipres.org
Digipres.org also provides summary statistics about PRONOM. Digipres.org is currently focused on overall coverage.
On the other hand the work described in this blog is there to specifically provide a view of PRONOM development gaps and a more data-oriented view of the PRONOM dataset as a whole.
Hopefully the two complement each other. In future the two could potentially converge. The pronom-tools package can be accessed as a library within Python and has a number of applications I am yet to explore.
PRONOM in summary
Thank you for reading! The PRONOM statistics work has been enjoyable to think about as always. It’s actually quite complicated to coordinate different parts of the PRONOM dataset and output something like this, but it’s fun. I also got to use a new technology in HTMX which provides this not very webby web-developer with tooling to create web-sites even with more of a back-end bias. HTMX might be something I revisit on the blog again in future.
Other PRONOM related links
More on my PRONOM work that may be of interest to readers can be found at the following links.
- PRONOM signature development utility and guides: https://ffdev.info
- Video introducing ffdev.info for PRONOM research week 2020: https://openpreservation.org/blogs/pronom-research-week-signature-development-utility-2-0-ffdev-info/
- Five Star Format Signature Development: https://openpreservation.org/blogs/five-star-file-format-signature-development/
- Using a custom Wikibase with Siegfried (talk): https://exponentialdecay.co.uk/blog/talk-using-a-custom-wikibase-with-siegfried/