Golang Archives - ross spencer :: exponentialdecay.digipres

Photographed at the Whitney Museum in 2018 the image shows a print by artist Edward Meneeley from his series IBM drawings. The image is bright green and shows a component of a computer or maybe printer that is stamped across the page. You can find more information at The Whitney: https://whitney.org/collection/works/7225

Shortened links? Expand them and save the URLs

Shortened links are a digital preservation and web archiving nightmare. You can imagine how they need to work:

Create a unique short code for a given (target) URL (like a hash, but far far shorter)
Pair the short code with your URL in a database.
Create a redirect rule on the URL shortening server from the new source URL to the target URL.
Send the shortened link to the caller, e.g. shortURL.com/123badf00d
In-perpetuity: continue to pay for your domain; maintain the database; look after redirect rules during server migrations; ensure duplicate short-codes are not created.

A URL-shortening business in five easy steps.

But what does that mean for digital preservation?

Safe Crash Emergency Glass at the Ostia Antica, Rome. The image shows the reflection of one of the Ostia Antica's statues as well as providing some textual information in Italian about how to use the glass. The text Safe Crash has been modified to say Safe Text for the purposes of this blog.

Porting SafeText and analyzing digital content with Apache Tika

Last year I wrote about pitfalls in modern journalism, especially with regards to receiving documents and information from whistleblowers without offering them adequate protection.

The tl;dr is that you, as a whistleblower, need to protect yourself; and you, as an editor or journalist, need to protect your whistleblowers.

Steganographic fingerprints might be one method adopted to detect someone leaking information. Steganographic characters replace common textual characters with unusual but hard to detect variants, e.g. they look the same to the human eye, or are actually invisible. Using a tool called SafeText by David Jacobson we can identify these hidden fingerprints in the content that you share.

I firmly believe we can find clues about what is important to preserve, or learn to preserve, when we analyse the content of the digital record and not just the (file) format of the digital record.

A file can contain many different features and these are all challenges to their future interpretation, and thus preservation.

I wanted to use SafeText in some of my other non-Python tooling and so I decided to port the code to Golang as a composable module and binary.

By coincidence at the time I started writing this I had also just written about revisiting tikalinkextract and so I thought I would write this small explanation about how you might combine Tika and SafeText to perform some content analysis of your own.

Who knows, maybe we will find a conspiracy. Maybe we’ll find secret codes in our own digital records. Maybe we’ll learn something new about our records…

Lets have a look at putting Tika and SafeText together and see where it goes.

Image of a children's learn to code textbook by Usborne Books. The page shows a snippet from a computer game in BASIC called "Escape". The illustration is of three menacing looking Cyborgs.

Informed consent: considering steganographic techniques to fingerprint Generative AI output

Artificial intelligence (AI) is a polarizing topic. For every reasoned assessment of the technology and its potential to make some of our smaller, onerous, or more repetitive tasks easier, there are probably 100 reactive pieces predicting some radical overhaul of societal norms, from the service industry receiving new intakes of out of work software developers to laypeople taking on roles traditionally occupied by those of a college education, if they just start “asking their AI the right questions“ ¯\_(ツ)_/¯

The amount of AI-propaganda is draining, and the reaction is often spread across the board too, some cheer leading, some decrying, plenty taking their time to offer skilled and nuanced rebuttals, or suggestions for improvements.

I find myself largely trying to stay out of the conversations. A lot like blockchain conversations 8 years ago, it will take another half decade for the hype-cycle to plateau for us to see where it can truly complement our work.

One part of the conversation that is increasingly harder to ignore, is being informed about when AI has been used in the generation of text or images. It is the property of knowing, or having the tools to know is what I feel is the most important.

How can we be better informed about when AI is used, so that we are better prepared as consumers, to receive and understand content?

In this blog I want to explore the potential for steganography techniques to be used in the output of AI to fingerprint content and provide a way for front-end mechanisms to identify it, as we might file formats using magic numbers, so that users can be given the chance of informed consent: the opportunity to opt-in or out of whether we engage with AI content or not.

"Bei der Buche", a landscape architectural installation by landscape architect and photographer Karina Raeck. Created in 1993 in the Wartberg area north-east of Stuttgart.

wikidata + mediawiki = wikidata + provenance == wikiprov

Today I want to showcase a Wikidata proof of concept that I developed as part of my work integrating Siegfried and Wikidata.

That work is wikiprov a utility to augment Wikidata results in JSON with the Wikidata revision history.

For siegfried it means that we can showcase the source of the results being returned by an identification without having to go directly back to Wikidata, this might mean more exposure for individuals contributing to Wikidata. We also provide access to a standard permalink where records contributing to a format identification are fixed at their last edit. Because Wikidata is more mutable than a resource like PRONOM this gives us the best chance of understanding differences in results if we are comparing siegfried+Wikidata results side-by-side.

I am interested to hear your thoughts on the results of the work. Lets go into more detail below.

A tree at sunset photographed from the train on the Bodendsee in Southern Germany

Versioning as memory?

So, it turns out my theme of the moment is code hygiene (or maybe memory?).

Today I am thinking about versioning, especially in relation to its impact on digital preservation; both software preservation and the impact of versions on long-term preservation efforts in other contexts.

Client-side file format identification and reporting pipeline with Siegfried and Demystify Lite

With thanks to the sponsorship of Archives New Zealand and Richard Lehane for his great coding expertise and his collaboration; Demystify Lite has a new feature — Siegfried!!

Richard recently posted about this work on LinkedIn but lets look at this effort in more detail below.

Moonshine: a small part of the file format analyst’s toolkit

Today I released Moonshine 2.0.0. Moonshine is a a file format discovery tool I developed a few years ago. A…

Programming things: Giving up… (or at least getting bitten by semver and Golang’s unforgiving nature, and wanting to!)

There are good days, and there are bad days when coding, and you never stop learning. Today was not a…

Abstract representation of a Key Value store

Key :: Value Access Language (KVAL) for BoltDB and Golang

With some forced downtime as the effects of the Kaikōura earthquake are felt here on the North Island, with the shutdown of Archives New Zealand, what better way to spend it than creating a new grammar and parser for key-value databases? I have spent the last few weeks developing a specification for a Key-Value Access Language (KVAL) and implementing a binding for it for Golang’s BoltDB. I hope it will be of interest to folks, but let’s take a look at it in more detail below.

Category: Golang

Shortened links? Expand them and save the URLs

Porting SafeText and analyzing digital content with Apache Tika

Informed consent: considering steganographic techniques to fingerprint Generative AI output

wikidata + mediawiki = wikidata + provenance == wikiprov

Versioning as memory?

Client-side file format identification and reporting pipeline with Siegfried and Demystify Lite

Moonshine: a small part of the file format analyst’s toolkit

Programming things: Giving up… (or at least getting bitten by semver and Golang’s unforgiving nature, and wanting to!)

Key :: Value Access Language (KVAL) for BoltDB and Golang

Follow ross spencer :: exponentialdecay.digipres :: blog