Moonshine: a small part of the file format analyst’s toolkit

Today I released Moonshine 2.0.0.

Moonshine is a a file format discovery tool I developed a few years ago. A small “hack” (hence in part, moonshine) on top of the UK Web Archive’s Shine Interface.

The UK Web Archive (UKWA) have indexed the first four bytes (ffb) of every file that it has archived. These four bytes can be searched across the entire UK Web collection and returned to the caller.

For example, to search for GIFs in the UKWA you can use the following URL:

https://www.webarchive.org.uk/shine/search?page=1&query=content_ffb:47494638&sort=crawl_date&order=asc

Today you’ll be presented with 133,576,903 results. That’s a lot to choose from!

Providing they have been captured by the UKWA it is possible to get sample files for all files that you know the first four bytes for. Famously many old Microsoft Office files used: D0CF11E0.

If you take the UK Web Archive’s interface and think of it like an API (it is!) you can scale this approach to create a tool that can grab samples of different file formats that a digital preservationist may be researching.

Moonshine does just this, simplifying the UKWA interface to perform one thing “well” (I hope).

That GIF search again looks as follows:

moonshine --ffb 47494638

2023/09/08 15:38:55 searching Shine@UKWA
2023/09/08 15:38:55 created URL: https://www.webarchive.org.uk/shine/search?page=1&query=content_ffb:47494638&sort=crawl_date&order=asc
2023/09/08 15:38:55 pinging URL: https://www.webarchive.org.uk/shine/search?page=1&query=content_ffb:47494638&sort=crawl_date&order=asc
2023/09/08 15:38:56 files discovered: '133576903'
2023/09/08 15:38:56 pages available: '13357691'
2023/09/08 15:38:56 setting pageCount ('13357691') max to: 1000 (solrMaxPages)
2023/09/08 15:38:56 argument `-page 1` has no effect when random (default) is selected
2023/09/08 15:38:56 created URL: https://www.webarchive.org.uk/shine/search?page=474&query=content_ffb:47494638&sort=crawl_date&order=asc
2023/09/08 15:39:00 returning file: 8 from page: 474
http://web.archive.org/web/19961019001918/http://www.ch.ic.ac.uk:80/ectoc/echet96/papers/026/rai9.gif

This just returns a random file from one of the many results available.

To download this at the same time as the request to the UK Web Archive is made (on Linux) one can do the following:moonshine --ffb 47494638 | xargs wget

There are many other options described in the README.md, including how to get the first five pages of results, using xargs wget to download those all in one go.

Sampling mode!

Moonshine 2.0.0 introduces a sampling mode that looks at the total number of resources available and then returns 20 files from across the distribution of pages to give file format researcher a random sample of the objects with a matching ffb.

This time, let’s try an example with PNG files:

moonshine --ffb 89504E47 --sample | xargs wget

2023/09/08 16:43:59 searching Shine@UKWA
2023/09/08 16:43:59 created URL: https://www.webarchive.org.uk/shine/search?page=1&query=content_ffb:89504e47&sort=crawl_date&order=asc
2023/09/08 16:43:59 pinging URL: https://www.webarchive.org.uk/shine/search?page=1&query=content_ffb:89504e47&sort=crawl_date&order=asc
2023/09/08 16:44:00 files discovered: '19853902'
2023/09/08 16:44:00 pages available: '1985391'
2023/09/08 16:44:00 setting pageCount ('1985391') max to: 1000 (solrMaxPages)
2023/09/08 16:44:00 returning a sampled list...
2023/09/08 16:44:00 distribution (pages): ‖‖‖‖‖|‖‖‖‖‖‖‖‖‖‖‖|‖|‖‖‖‖||‖|‖‖‖|‖‖||‖‖‖‖‖‖‖‖‖|‖|‖‖‖‖|‖‖‖‖‖|‖‖|‖‖‖|‖‖‖‖|‖‖‖|‖‖‖‖‖‖‖‖‖‖‖‖‖‖‖|‖‖‖‖|‖‖‖|‖‖‖‖‖‖‖‖‖‖‖‖‖‖‖‖‖‖‖‖
2023/09/08 16:44:00 get page 48
2023/09/08 16:44:00 created URL: https://www.webarchive.org.uk/shine/search?page=48&query=content_ffb:89504e47&sort=crawl_date&order=asc
2023/09/08 16:44:02 get page 160
2023/09/08 16:44:02 created URL: https://www.webarchive.org.uk/shine/search?page=160&query=content_ffb:89504e47&sort=crawl_date&order=asc
2023/09/08 16:44:03 get page 164
2023/09/08 16:44:03 created URL: https://www.webarchive.org.uk/shine/search?page=164&query=content_ffb:89504e47&sort=crawl_date&order=asc
2023/09/08 16:44:05 get page 203
2023/09/08 16:44:05 created URL: https://www.webarchive.org.uk/shine/search?page=203&query=content_ffb:89504e47&sort=crawl_date&order=asc
2023/09/08 16:44:07 get page 206
2023/09/08 16:44:07 created URL: https://www.webarchive.org.uk/shine/search?page=206&query=content_ffb:89504e47&sort=crawl_date&order=asc
2023/09/08 16:44:09 get page 218
2023/09/08 16:44:09 created URL: https://www.webarchive.org.uk/shine/search?page=218&query=content_ffb:89504e47&sort=crawl_date&order=asc
2023/09/08 16:44:10 get page 245
2023/09/08 16:44:10 created URL: https://www.webarchive.org.uk/shine/search?page=245&query=content_ffb:89504e47&sort=crawl_date&order=asc
2023/09/08 16:44:12 get page 264
2023/09/08 16:44:12 created URL: https://www.webarchive.org.uk/shine/search?page=264&query=content_ffb:89504e47&sort=crawl_date&order=asc
2023/09/08 16:44:14 get page 265
2023/09/08 16:44:14 created URL: https://www.webarchive.org.uk/shine/search?page=265&query=content_ffb:89504e47&sort=crawl_date&order=asc
2023/09/08 16:44:15 get page 355
2023/09/08 16:44:15 created URL: https://www.webarchive.org.uk/shine/search?page=355&query=content_ffb:89504e47&sort=crawl_date&order=asc
2023/09/08 16:44:17 get page 370
2023/09/08 16:44:17 created URL: https://www.webarchive.org.uk/shine/search?page=370&query=content_ffb:89504e47&sort=crawl_date&order=asc
2023/09/08 16:44:19 get page 407
2023/09/08 16:44:19 created URL: https://www.webarchive.org.uk/shine/search?page=407&query=content_ffb:89504e47&sort=crawl_date&order=asc
2023/09/08 16:44:21 get page 456
2023/09/08 16:44:21 created URL: https://www.webarchive.org.uk/shine/search?page=456&query=content_ffb:89504e47&sort=crawl_date&order=asc
2023/09/08 16:44:23 get page 474
2023/09/08 16:44:23 created URL: https://www.webarchive.org.uk/shine/search?page=474&query=content_ffb:89504e47&sort=crawl_date&order=asc
2023/09/08 16:44:24 get page 501
2023/09/08 16:44:24 created URL: https://www.webarchive.org.uk/shine/search?page=501&query=content_ffb:89504e47&sort=crawl_date&order=asc
2023/09/08 16:44:26 get page 541
2023/09/08 16:44:26 created URL: https://www.webarchive.org.uk/shine/search?page=541&query=content_ffb:89504e47&sort=crawl_date&order=asc
2023/09/08 16:44:28 get page 579
2023/09/08 16:44:28 created URL: https://www.webarchive.org.uk/shine/search?page=579&query=content_ffb:89504e47&sort=crawl_date&order=asc
2023/09/08 16:44:30 get page 722
2023/09/08 16:44:30 created URL: https://www.webarchive.org.uk/shine/search?page=722&query=content_ffb:89504e47&sort=crawl_date&order=asc
2023/09/08 16:44:32 get page 764
2023/09/08 16:44:32 created URL: https://www.webarchive.org.uk/shine/search?page=764&query=content_ffb:89504e47&sort=crawl_date&order=asc
2023/09/08 16:44:34 get page 798
2023/09/08 16:44:34 created URL: https://www.webarchive.org.uk/shine/search?page=798&query=content_ffb:89504e47&sort=crawl_date&order=asc

We can make a tiny gallery of the results!

montage *.png -mode Concatenate -tile 6x5 montage.png

Or you can continue on your file format research journey!

PRONOM signature development

If one is researching new file formats then they may tend to come from a uniform provenance. When researching new file format signatures for yourself with the potential to submit them to PRONOM, moonshine helps provide better variability when testing your own signatures. If you download 20 files with matching ffb from the UK Web Archive and you try your PRONOM quality signature against them, if you can identify some that do not match then you can improve the effectiveness of your signature to match a wider range of samples. Additionally, you can share the UK Web Archive links with the PRONOM team so that they can look at a common sample file. This is especially useful if your files come from a closed collection. For bonus points you can share some of the links you find on the Just Solve It File Formats Wiki and users looking at that site can then download the same samples there as well.

Further notes on Moonshine 2.0.0

1.0.0 and 2.0.0 have been released in quick succession. The releases stabilize the code somewhat and increase the number of results that developers have access to. Previously there were various restrictions via the UK Web Archive’s SOLR index that I hadn’t got on top of. Hopefully I have done that here and the tool can flourish a little more.

Let me know how it goes if you develop using Moonshine, and perhaps share your workflow sometime in your own writing.

Check out Moonshine on GitHub and look at the releases for a release that matches your operating system: https://github.com/exponential-decay/moonshine

Moonshine can also be found on the COPTR Wiki here: https://coptr.digipres.org/index.php/Moonshine

Loading

Leave a Reply

Your email address will not be published. Required fields are marked *

Follow

Get every new post delivered to your Inbox

Join other followers: