ross spencer :: exponentialdecay.digipres :: blog

Safe Crash Emergency Glass at the Ostia Antica, Rome. The image shows the reflection of one of the Ostia Antica's statues as well as providing some textual information in Italian about how to use the glass. The text Safe Crash has been modified to say Safe Text for the purposes of this blog.

Porting SafeText and analyzing digital content with Apache Tika

Last year I wrote about pitfalls in modern journalism, especially with regards to receiving documents and information from whistleblowers without offering them adequate protection.

The tl;dr is that you, as a whistleblower, need to protect yourself; and you, as an editor or journalist, need to protect your whistleblowers.

Steganographic fingerprints might be one method adopted to detect someone leaking information. Steganographic characters replace common textual characters with unusual but hard to detect variants, e.g. they look the same to the human eye, or are actually invisible. Using a tool called SafeText by David Jacobson we can identify these hidden fingerprints in the content that you share.

I firmly believe we can find clues about what is important to preserve, or learn to preserve, when we analyse the content of the digital record and not just the (file) format of the digital record.

A file can contain many different features and these are all challenges to their future interpretation, and thus preservation.

I wanted to use SafeText in some of my other non-Python tooling and so I decided to port the code to Golang as a composable module and binary.

By coincidence at the time I started writing this I had also just written about revisiting tikalinkextract and so I thought I would write this small explanation about how you might combine Tika and SafeText to perform some content analysis of your own.

Who knows, maybe we will find a conspiracy. Maybe we’ll find secret codes in our own digital records. Maybe we’ll learn something new about our records…

Lets have a look at putting Tika and SafeText together and see where it goes.

A homage by Will Burrows to the Ricardian Socialism motto: "The boss needs you, you don't need them -- Labor is entitled to all it creates". The image shows Quark, O'Brian, and Leeta discussing this while sharing a drink in the bar. The poster pays homage to the poster circa 1968 with the same phrase.

Maintenance begins at creation, so why are we not creating better?

The beats are the same. You work for government, or academia (lets face it, that’s probably where 90% of the work is) you have a deliverable; you save it; you print to PDF; you store it on an institutional repository with some metadata (or Zenodo, OSF or equivalent) and its done.

There’s a small chance that it’s FAIR (Findable, Accessible, Interoperable, Reusable) right? It has metadata that can be discovered by an audience looking for it and can be indexed by search engines. The data is potentially accessible if published correctly. They’re not particularly interoperable or easily converted, and PDFs aren’t really designed for reuse, even if tools like Apache Tika help ease the burden of extracting artifacts. It’s just a PDF, why are we even talking about FAIR? There begins a story…

The beats are the same, yet, we work in digital preservation, our backgrounds are in GLAM or software, why do we want to shoot ourselves in the foot? Why are we not using our skills to create better?

Portrait of me in my Orcfax tee, featuring our company mascot, Echo

Winding down at Orcfax: a retrospective

With the recent announcement that Orcfax is heading into operational mode, it’s a bittersweet moment that means our adventure in…

Does the future look bright? Or are we entering digital dark times? Image is a photo of a poster taken in Ravensburg, September 2021. The original image is from Benni Erbsland from the Erwegung fur Radikale Empathie (Movement for Radical Empathy) based out of Stuttgart: https://bewegung-fuer-radikale-empathie.de/benni-erbsland/

Digital dark times: Salaries in digital preservation

The Serpentine is one of the world’s most renowned art galleries. Their exhibitions as varied as Gerhard Richter, Damien Hirst, and Marina Abramović. They don’t hold a permanent collection, instead, they provide a space for temporary collections and an annual pavilion, the pavilion designed by luminaries such as Zaha Hadid, Frank Gehry, and Ai Weiwei.

Given a recent job posting it looks like they are looking at maintaining their memory better and branching out into digital preservation.

Here’s the kicker — its salary band is GBP 35,000 to GBP 38,000. So it must be an entry level position, especially in London, right?

Well, let’s see what they want you to do for that price tag…

Image from July 2021 depicting the fields around Ravensburg in Southern Germany. There is a sign with some graffiti on which depicts a car sliding off the road, presumably because there is very little curb and it's likely you will career into the grass if you're not careful.

When you can’t pay for things the currency of payment is psychic…

Contributing back to the commons in digital preservation hasn’t been for everyone.

We know the famous XKCD that touches on the underappreciated work of maintainers in obscurity. When you, or your institutions, or services are using free and open source software, or other information and data in the commons, and you’re not contributing back, you’re perpetuating this, and what’s more, there’s a virtuous cycle that we’re missing out on.

I read something the other day and it felt like a red flag.

The Painter Goblin becomes corporeal by having its prints converted from digital to canvas in real life. In this image, the Painter Goblin canvases arer bathed in sunlight provided by a west-facing window around sunset. The grid used to display the Painter Goblin in a salon style shadowed by the window frame onto the wall. The light in this image has been enhanced to increase its saturation to mirror the vibrancy of The Painter Goblin's original image.

The Painter Goblin: Becoming Corporeal

When you move country you have to be prepared to change quite a lot about your life. Back at the end of 2020, apart from literally everything else going on my partner and I also moved from Canada to Germany.

For me, this was my fifth or so international move (including shorter temporary stays) in as many years.

Being able to pick up sticks and move like that means living a drastically minimized life. Most of the things you have fit in a suitcase. Most of the things you have are small, and largely not overly whimsical. Sure, you can fit a few treasures into your bag, but you learn to value small ones, not things you might otherwise use to decorate an entire apartment!!

So, what do you do when you do have an apartment to decorate?

You ask the best known painter in your family to conjure some magic, The Painter Goblin!

Image shows two layered waveforms, one a corrupt waveform and the other a good original. The corrupt form is in red and the uncorrupt one is green.

Revisiting bsdiff as a tool for digital preservation

I introduced bsdiff in a blog in 2014. bsdiff compares the differences between two files, e.g. broken_file_a and corrected_file_b and creates a patch that can be applied to broken_file_a to generate a byte-for-byte match for corrected_file_b.

On the face of it, in an archive, we probably only care about corrected_file_2 and so why would we care about a technology that patches a broken file?

In all of the use-cases we can imagine the primary reasons are cost savings and removing redundancy in file storage or transmission of digital information. In one very special case we can record the difference between broken_file_a and corrected_file_b and give users a totally objective method of recreating corrected_file_b from broken_file_a providing 100% verifiable proof of the migration pathway taken between the two files.

Turning NASA Wake-up Calls into data

For a while back then I was into space flight again. Scientists, science communicators, and engineers were all excited for a new era of rocket launches and the potential unification of the human race as we look towards the future.

During that time I discovered Colin Fries’ work in the NASA History Division to document all NASA “Wake-up calls”. A wake-up call is simply a piece of music used to wake astronauts on missions, a different piece of music, daily, for the duration of the flight.

Take, for example, the last Space Shuttle mission (Space Transportation System) STS-135; it was in flight for 13 days, and the wake-up call on day one was Coldplay’s Viva la Vida, while on day 13 it was Kate Smith singing God Bless America.

As a huge music buff who has the radio or music television on 18 hours a day, I really wanted to delve into this further. While Colin’s work is great, it’s just a PDF file (@wtfpdf). A PDF is not an ideal file format for querying data and gleaning new insights. So, while I wanted to explore it, I first decided to turn it into a true dataset. The result was a set of resources, a website, a JSON, a CSV, and an SQLite database which are each more functional and more maintainable over time.

Lets take a look at the results and https://nasawakeupcalls.github.io below!

Crayola's 1997 Techno Brite crayon set with color names created to market the Crayola website, including names featured here such as World Wide Web Yellow, Point and Click Green, and Cyber Space Orange

Looking after your URLs: tikalinkextract eight years on

We might not have a second life, but what if I told you there was a second internet? Not the deep web, but another web that we engage with nearly every day?

Think about it, that QR code you scanned for more information? That payment link you followed on your electricity bill? The website you’re told to visit at the end of a television ad?

The antipodes of the internet are these terminal endpoints, material and not necessarily material objects that represent the end of the freely navigable web — the QR code on a concert poster is the web printed onto the physical world. There is every chance it will be scanned and followed by someone from a mobile device, but it’s a transient object, something that will exist for a short amount of time, and then disappear into the palimpsest of the poster board or wall it was pasted on until it eventually disappears.

This is part of the materiality of the internet that has long fascinated me. Perhaps it comes from being a student of material culture, but if we look around, we see the Internet everywhere!

An slide excerpt from my presentation Declarative Programming for Digital Preservationists showing how network effect can be embraced and side-effects are reduced in the declarative paradigm.

Declarative programming for Digital Preservationists @ NTTW8

Just released on the No Time to Wait (NTTW) YouTube channel is my presentation from NTTW8 in Karlsruhe, Germany. (Slides also available here).

The presentation follows up on my proposal for iPRES 2024 and allowed me to present parts of what was, in the end, a pretty significant paper (in terms of word count).

Some of my reflections on the presentation are below.

Journal of a Digital Preservation Quartermaster

Porting SafeText and analyzing digital content with Apache Tika

Maintenance begins at creation, so why are we not creating better?

Winding down at Orcfax: a retrospective

Digital dark times: Salaries in digital preservation

When you can’t pay for things the currency of payment is psychic…

The Painter Goblin: Becoming Corporeal

Revisiting bsdiff as a tool for digital preservation

Turning NASA Wake-up Calls into data

Looking after your URLs: tikalinkextract eight years on

Declarative programming for Digital Preservationists @ NTTW8

Follow ross spencer :: exponentialdecay.digipres :: blog