Client-side file format identification and reporting pipeline with Siegfried and Demystify Lite
Demystify Lite has a new feature — Siegfried!! With thanks to Richard Lehane, and the opportunity to make this possible with sponsorship of the feature from Archives New Zealand.
Richard recently posted about this work on LinkedIn.
Siegfried and WASM
WASM (WebAssembly) is a portable binary format that enables high-performance applications to run on a web-browser. That latter point, on a web-browser is important, it means these high-performance applications are running completely “client-side” which means client data isn’t transferred to a server, it remains completely within your host environment. This is important when you are working with files that you don’t want to give the world wider access to, for security, or sensitivity reasons, or whatever.
Siegfried as a file format identification tool is one such application where you will be working with a lot of files that you do not want to surrender control of to anyone else. Demystify Lite was already using Pyscript which combines WASM with a handful of other technologies to make it possible to run Python scripting client side.
Archives New Zealand’s workflow
With Demystify Lite and Siegfried WASM, Archives New Zealand can now promote a browser-only format identification and reporting pipeline to its agencies, thus reducing the amount of sign-off required for running applications on the agency’s side.
The previous denylist work for Archives New Zealand added an important missing feature in Demystify for the browser. Agencies can create reports for accessioning archivists prior to digital transfer and the report can flag files of concern, or interest, e.g. those requiring extra care.
The initial file format report still needs to be created offline, at least up until now!
After the denylist feature the team asked me if it would be possible to run DROID in a similar way to demystify.
The request made a lot of sense and I am sure with some work DROID can be reworked to do this, but I am not very adept at Java, and WASM is also still relatively new to me.
I did, however, know of Andy Jackson’s Siegfried JS and so I suggested we take a look at it, and see if Siegfried can support something more formally — if we could, we could ask an agency to use Demystify Lite to first generate a Siegfried report (optionally download that as an important source of truth) and run Demystify’s analysis, all in one environment, all without transferring files anywhere.
I reached out to Richard, and given I had other priorities for Archives New Zealand, we split our tasks into Richard creating a WASM implementation for Siegfried, and me integrating it into Demystify.
I recommend taking a look at Richard’s blog plus some of the other links on that page, especially the README for the WASM implementation.
The Demystify Lite implementation is pretty straightforward and follows the template on Richard’s README to allow Siegfried to produce checksums and give users the option of scanning archive files. Once we’ve done that, we need to give Siegfied WASM a list of files to scan from a top-level directory and a report will be created and stored in a hidden <div>
element. The <div>
is then used to enable download of the results, or creation of the Demystify Lite report (or both!).
Although I think there could be some better design on the page itself I am pretty happy with the results and the user-experience seems good enough for a first attempt.
Give it a try: https://ross-spencer.github.io/demystify-lite/ and let me know what you think!
Enabling new approaches
Format identification and client side checksums are a potent tool. A mechanism could be created to implement something using Siegfried WASM as an entry point for digital transfer, e.g. “scan files” -> “click, confirm to send to agency”. Complete with checksums we are easily able to verify what was sent is what was received in fewer steps (right now we might send a sidecar file containing checksums along with the files). Compared to current approaches where we might do this offline, thenĀ create a specific submission ingest package file, we can reduce the number of steps needed for digital transfer quite significantly.
One of the biggest advantages of Siegfried WASM (and by-proxy Demystify Lite) is that we can reduce the number of tools needed by any one person or organization. With access via a web-browser these utilities can be accessed by more people. With fewer tools means fewer security audits in some instances. More people are also able to look at the same tools meaning more eyes on potential issues (security and otherwise) and potentially more bug-fixes and improvements.
Next steps
For demystify as I continue to maintain it, I become more and more convinced of the need to improve the HTML generation engine – it’s not the greatest looking HTML and it’s not very modular. I’d like to make it possible to select sections to output. I also want to incorporate better design using CSS which is much easier to do if Demystify Lite becomes the utility’s primary entry point.
Demystify Lite’s web-design can be improved, I might need to find a better crash course on web-design to make it really good though.
I’ll continue to think about ways of leveraging Siegfried WASM. I want to consider how it can be better used in decentralization efforts. I hope others will take a look as well. I can’t wait to see what’s made from this new utility.
Thanks again to Archives New Zealand
I appreciate the opportunity to help enable this integration. It is difficult for me to spend a lot of time improving these tools without sponsorship and I appreciate Archives New Zealand recognizing this utility as an important part of their workflow and finding ways to fund its improvement.
It is always great to work alongside Richard, so that’s another positive! I also feel privileged to see the work come good and be one of its first integrator’s.
More information
I still have it on my TODO list to look at WASM in more detail. Unfortunately the nitty gritty of this work was outside of my remit this time around but it is a topic I hope to come back to. Looking more closely at Richard’s work here will be a good way to learn.
Take a look at Andy Jackson’s blog on some novel uses of client side technology such as this and its potential to improve web-archiving.
Demystify-life with added Siegfried is live here, please give it a whirl. Any feedback is appreciated. The Demystify Lite repository is here and is a good place to log suggested improvements or note any bugs that you have discovered.