The sensitivity index: Corrupting Y2K - ross spencer :: exponentialdecay.digipres

In December I asked “What will you bitflip today?” Not long after, Johan’s (@bitsgalore) Digtial Dark Age Crew released its long lost hidden single Y2K — well, I couldn’t resist corrupting it.

Fixity is an interesting property enabled by digital technologies. Checksums allow us to demonstrate mathematically that a file has not been changed. An often cited definition of fixity is:

Fixity, in the preservation sense, means the assurance that a digital file has remained unchanged, i.e. fixed — Bailey (2014)

It’s very much linked to the concept of integrity. A UNESCO definition of which:

The state of being whole, uncorrupted and free of unauthorized and undocumented changes.

Integrity is massively important at this time in history. It gives us the guarantees we need that digital objects we work with aren’t harboring their own sinister secrets in the form of malware and other potentially damaging payloads.

These values are contingent on bit-level preservation, the field of digital preservation largely assumes this; that we will be able to look after our content without losing information. As feasible as this may be these days, what happens if we lose some information? Where does authenticity come into play?

Through corrupting Y2K, I took time to reflect on integrity versus authenticity, as well as create some interesting glitched outputs. I also uncovered what may be the first audio that reveals what the Millennium Bug itself may have sounded like! Keen to hear it? Read on to find out more.

Glitching Y2K

First, let me begin, as I will end by highlighting the musical genius that is the Digital Dark Age Crew’s Y2K — please take a look at its backstory; and give it a few more plays on YouTube.

Content warnings

For audiophiles and those sensitive to sound

Audiophiles, and those sensitive to sound might want to avoid pressing play on some of the content below, it gets, very, very glitchy!

NB. If you do, what do you feel? I found listening back to some of the audio created some feeling of foreboding and anxiety, I’m not sure if this is a property of the audio, or something deeper inside me.

For digital preservationists

We’re going to go outside of the norms of the discipline towards the end of this blog (digital preservation without a net!). Please, reflect, and comment, I’d love to hear your thoughts; but please also just keep in mind it’s an idle exploration as a result of some of the experiments here today.

Content warnings done, let’s press ▶️

Primed for corruption!

After my first blog on this I was keen to follow up with some more experiments. Given the timing of Johan’s post, I felt like it might be a sign to give it a whirl.

I have spent a lot of time looking quite shallowly at a lot of formats and so I am not a specialist in any one file format.

The last time I truly played around with audio, other than formatting my music collection, ripping CDs, or downloading and converting different audio snippets, was probably at high-school, where messing with lo-fi sound bytes was in fashion for us lab nerds.

I didn’t know where to begin with Johan’s audio, but respecting fair-use for the purposes of education as much as possible, I knew I needed a small snippet of the file. I also knew that there is a lot of information in an audio file and a lot to try and present in even just a few seconds, and so I took a small 10 second snippet of a passage that I liked.

exponential-decay · Y2K: 10 second snippet

The snippet gave me a baseline from which I could begin trying to glitch the content.

A question of codec

I do not know how many codecs FFMPEG supports, but if you run ffmpeg -codecs you will see 100s of audio-visual (AV) codecs available.

As mentioned I am not an AV specialist, this is a specialist function within the GLAM sector. What I do know is that the choice of codec will affect how glitching works and how it eventually gets played back through various media players like VLC.

I took an easy path to selecting a codec and put the snippet of Y2K into Audacity for re-export and selected the ones I had previously heard of, or thought sounded cool. I would then take these samples and try to corrupt them one by one.

My final list looked as follows.

Formats and Codecs

{
    "format": {
        "filename": "ddac-y2k-snippet.ac3",
        "format_name": "ac3",
        "format_long_name": "raw AC-3",
    }
} 
{
    "format": {
        "filename": "ddac-y2k-snippet.flac",
        "format_name": "flac",
        "format_long_name": "raw FLAC",
    }
}
{
    "format": {
        "filename": "ddac-y2k-snippet.mka",
        "format_name": "matroska,webm",
        "format_long_name": "Matroska / WebM",
    }
}
{
    "format": {
        "filename": "ddac-y2k-snippet.ogg",
        "format_name": "ogg",
        "format_long_name": "Ogg",
    }
}
{
    "format": {
        "filename": "ddac-y2k-snippet.opus",
        "format_name": "ogg",
        "format_long_name": "Ogg",
    }
}
{
    "format": {
        "filename": "ddac-y2k-snippet.pcm",
        "format_name": "matroska,webm",
        "format_long_name": "Matroska / WebM",
    }
}
{
    "format": {
        "filename": "ddac-y2k-snippet.wav",
        "format_name": "wav",
        "format_long_name": "WAV / WAVE (Waveform Audio)",
    }
}

A script for corruption

I created a bash script that does a number of things:

creates a derivative file from a source input.
flips a percentage of bits at random using bitflip.
creates a waveform of the new derivative.
repeats the process using the last flipped derivative of the file.
exits when a waveform can no longer be generated.

The script looks as follows:

#! /usr/bin/bash

FNAME="$1"

DIRNAME="WAVEFORM-$1"
AUDIODIRNAME="AUDIO-$1"

rm -rf "$DIRNAME"
mkdir "$DIRNAME"

rm -rv "$AUDIODIRNAME"
mkdir "$AUDIODIRNAME"

flip() {
  # bitflip 0.001% of a file at random.
  $(bitflip spray percent:0.001 "$1")
  cp "$1" "$2-$1" 
}

waveform() {
  # create a wavefovm(from an audio file.
  ffmpeg \
    -i "$1" \
    -f lavfi \
    -i color=c=black:s=1280x640 \
    -filter_complex "[0:a]showwavespic=s=1280x640:colors=white[fg];[1:v][fg]overlay=format=auto" \
    -frames:v 1 \
    -hide_banner \
    -loglevel fatal \
    "$2_out.png" 
    status=$(echo $?)
    return $status
}

sum() {
  # get the checksum of a blank frame. (currently unused)
  md5 $(md5sum "$1")
  if [[ "$md5" == "232069063dee33e87fe588ce0ccdc4c0" ]]
  then
    return 1
  fi
  return 0
}

status=0
idx=0
while [ $status -le 0 ]
do
  printf -v zp "%05d" $idx
  echo $zp
  flip "$FNAME" "$AUDIODIRNAME/$zp"
  waveform "$FNAME" "$DIRNAME/$zp"
  idx=$(( $idx + 1 ))
done

NB. The sum() function is excess but potentially useful as it describes an empty waveform frame. We can maybe exit the script when we first see an empty waveform, or skip the output, but we don’t do this in these experiments.

The script allows me to experiment on my seven files, and this is where it starts to get interesting.

Some formats break, some formats bend

I started flipping a relatively high percentage of bits per iteration using bitflip. I would find some formats broke quite easily and nothing could be rendered from them and so I had to find a percentage that might work better.

I settled on 0.001%. For each of the formats this looks as follows:

format	percentage
pcm	flipping 12 bits out of 1.219088 Mbits (0.001% of 149 KiB) in file
wav	flipping 153 bits out of 15.302352 Mbits (0.001% of 1.8 MiB) in file
opus	flipping 12 bits out of 1.218616 Mbits (0.001% of 149 KiB) in file
ogg	flipping 16 bits out of 1.667504 Mbits (0.001% of 204 KiB) in file
mp3	flipping 24 bits out of 2.408424 Mbits (0.001% of 294 KiB) in file
ac3	flipping 15 bits out of 1.59744 Mbits (0.001% of 195 KiB) in file
flac	flipping 91 bits out of 9.196144 Mbits (0.001% of 1.1 MiB) in file

I realized quickly I really only had two candidates robust enough to maybe get some interesting glitches out of. Even flipping 0.001 bits, some formats would only continue to render for one or two cycles. The cycle counts are below:

format	cycles
pcm	5
wav	201
opus	12
ogg	1
mp3	14,991
ac3	730
flac	187

Bad candidates: OGG, PCM, and OPUS

These formats didn’t really glitch, in that the audio didn’t so much as corrupt, or change in quality, as just disappear, or stop playing entirely. There were relatively few iterations, and it looks like players do not have very good error correcting capabilities for these formats. Outside of this blog, I would probably have to spend a year trying to understand why this is the case, but we’ll revisit this a little bit in the conclusions later.

Good candidates: MP3, and AC3

These two files just kept on going! I can’t say that I loved their “glitches” (below) but the properties of these files that enabled so many derivatives to be created was truly fascinating, again, we’ll visit this later.

Mystery candidate: FLAC

I’m going to show the waveform created for flac. The mystery here is that the file is immediately unrecognizable to players after two iterations (almost empty in VLC with just some burst of sound), and entirely unplayable/corrupt after 4. The audio shows as completely empty in Audacity. The FFMPEG waveform, however, continues to be generated for 188 cycles.

The images suggest that it may be possible to retrieve something from this file, perhaps, if the format itself could be repaired, the codec underneath ensures some of the data remains playable?

Workhorse candidate: WAV

The WAV snippet performed admirably. That being said, its output remained very consistent, with each cycle only really adding noise to the audio, before the format itself was just too corrupt to play. You can see in the waveform that noise appears as a much more dense waveform, and a much spikier waveform in places.

So, MP3 and AC3

How do we do this?

Waveform first

I really like the waveform view of the results. I cheated to create this image with MP3 (just a little) by using an earlier glitch, corrupting 0.05% bits at a time, resulting in ~1500 derivatives. For AC3 I use the complete set of 730:

MP3

I like how the MP3 just keeps on going.

Because I am using a bit flipping technique to corrupt these files, data is continually being warped; bit values are constantly changing from 0 to 255 with each change upon another change having an impact on what we see. I feel like we get a clear picture of the impact of this in the MP3 example where there are entirely blank frames – it’s like we deleted all the data in the file, but as placeholders for “new data” flipping the bits means that further data seems to appear from nowhere.

AC3

The AC3 example is intriguing because the structure of its shape seems to remain consistent until it all of a sudden becomes very coarse and saw-tooth like.

Recurring patterns

Across MP3 and AC3 as I continue to play with the output, I saw, at least for my relatively naive approach to glitching, that the shapes in the images above seemed to be consistent across other attempts. The eventual “saw-toothing” of the AC3 for example occurred many times during my experiments. MP3’s robustness saw it on different attempts seemingly glitch out of existence before reappearing and dying.

Audio second

This is where I ask you once again to proceed with caution. The audio here gets pretty rough!

Image of me circa 2018/2019 in Beroun, CZ. I'm super-imposed on the everything is fine meme depicting a cartoon of everything being on fire. — EVERYTHING IS FINE!

MP3

It is, perhaps, the quality of the audio that we should reflect on with both the files created during these experiments. I took 9 samples from each output and combined them. We can ask, when does the audio start becoming too difficult to understand? Did we get any cool glitching?

MP3 is available to play below.

exponential-decay · Y2K:00000:00150:00500:01000:02000:04000:09610:11000:14450

I feel like around the 150th derivative the MP3 starts to lose enough fidelity to continue to be able to easily understand what it is. By the 500th iteration this really is the case, and beyond there we get a lot of noise before the format fades out very quickly.

I’ve included the MP3’s dying embers in the output above.

Did we get any cool glitching? Reading around forums, folks have expressed an sense of being underwater in the way MP3 glitches, this seems to happen here. Unfortunately I feel like the glitching we do get is really just digital noise (maybe as you might expect) and we don’t get any real emergent output, e.g. the quality of the original track doesn’t change in any significant way, it just gets worse. But what do you think?

AC3

exponential-decay · Y2K:00000:00020:00055:00101:00200:00400:00500:00606:00730

The AC3 is quite “nice” in the 55th and 101st iterations we can hear bleeps and clicks that feel new and robotic. The audio quickly gets worse after that and starts to sound like a cyborg malfunctioning, and not getting any better!

For me, the way the AC3 degrades is somehow the most anxiety inducing, I really don’t like it as it gets towards the end after its 700+ cycles.

BUT taking the positives, what I do like, I mean, really really like is that both files can take this punishment and deliver recognizable audio for a good number of bitflips — this is worth reflecting on.

Time to be diplomatics

Digital preservation is tied to integrity. We work on the property of being “fixed”, i.e. fixity. We gamble on redundancy and ask our preservation systems to continue this property for us with lots of copies keeping stuff safe (LOCKSS) and lots of integrity checking. Yet, the world is messy. What if something goes wrong? What if we lose integrity? How are we discussing authenticity more widely in digital preservation? How do we prepare ourselves to show something is authentic even if something goes wrong? e,g. we lose data?

An example, we have information encoded in FLAC, it’s generally received wisdom that it is a good format because it uses lossless compression to store audio at its highest fidelity (read: it makes it cheaper). Yet, in the experiments above we can lose access to its payload very easily. MP3 on the other hand, resists corruption for a number of iterations, we can still gleam from the audio information that makes sense to us, we can hear speech, a bass-line, we know it’s electronic music of some sort. The MP3 no longer has any of the properties of integrity, its checksums so wildly changed, but, it is analogue-like in its gradual, and somewhat graceful decay. Does the MP3’s properties here lend itself better to preserving information than the more “sound” FLAC?

It reminds me of the discipline of diplomatics which provides us with the tools to understand the authenticity of official records — some of the materials, especially in the analog sense, may be corrupting in their own ways, faded inks, smudged writing, sun-damaged paper, and nothing getting any younger. Yet! We can make statements about the authenticity, or possible authenticity of something. The discipline relies on access to information. In four out of the seven scenarios above, we lose access to information entirely (at least not without much deeper forensic analysis and remediation), what would we do to understand these records then?

A sensitivity index?

Given all of this, could we naively create a sensitivity (to corruption) index for the file formats in our collections? If we numbered the index high for easily corruptible data to low for difficult to corrupt data, what would we see?

If we created a family related index based on the above audio formats, what would the results of the index be?

format	cycles	audible cycles	sensitivity index
pcm	5	5	97.5%
wav	201	200	0%
opus	12	2	99%
ogg	1	1	99.5%
mp3	14,991	100	50%
ac3	730	30	60%
flac	187	0	100%

NB. As a very quick and ready method, I based the results on the largest known possible value (200). I would be surprised if there weren’t better ways to calculate some sort of index.

It could be interesting to see how this shapes our theories of file formats for digital preservation, alas, there is much need for caution.

An important note about scientific integrity

Science relies on the careful control of different variables; this is ridiculously difficult to do in the format world. Formats are deeply human things, they exist to store state, or enable information interchange. They exist to support, they exist to entertain.

Importantly, formats are built upon stacks of technology that extend this human characteristic even further, flawed through bugs, or lack of attention, to being robust because of a great deal of care or interest.

A format that exists that is robust today might be robust because of the amount of attention paid to the software or specifications created in the past – PDF is a classic example of a format that needs to persist, don’t get me wrong, I don’t need it to, but technologists made their bed with it, and now to ensure important information survives into the future, we will probably always have to preserve the PDF (I mean, @wtfpdf).

More fun, there are probably ample stories about the importance of the MP3, the iPod, and Apple as we know it today wouldn’t exist without it. Was MP3 robust before all its attention in the 90s? Or is it robust because of all its attention in the 90s? (I do know we can answer this question, but this is not a format-level analysis, and I probably need to wrap up this blog this year!)

Either way, I am quite interested in understanding what formats exist today, in today’s technological conditions, that have these analog, malleable, stretchy properties, and do bend but don’t break.

Before I go…

I want to wrap up with a few footnotes from the testing above and some final thoughts developed while writing.

The Millennium bug!

What might it sound like? (I did tease it all the way back in the beginning) Well, maybe this critter, hidden in the AC3?!

exponential-decay · 00054 – Millennium Bug

Corrupting MIDI

Of course, I was disappointed I truly couldn’t glitch the audio above and create a high-quality output, and so I had the idea to convert it to MIDI audio — MIDI is an instruction based file-format which means that if we can manipulate those in any way via glitching we might be able to change the output.

Of course, the first conversion is pretty much corrupting the audio unrecognizably anyway. You can hear that here:

exponential-decay · Y2K MIDI: Original

But then, as hoped, I did find MIDI had nice properties for glitching, and you can hear that here:

exponential-decay · Y2K MIDI: Glitched

You can clearly hear the introduction of new instruments and changes to the pitch; it also becomes more resonant; all while the recording maintains a decent level of fidelity, i.e. it doesn’t sound noisy or full of feedback.

I have laid the two waveforrms on top of each other for comparison.

It may be interesting to come back to some long-form MIDI and try some of these experiments again.

MIDI, and MIDI on Linux

Check out this cool app from Spotify to create your own MIDI: https://basicpitch.spotify.com/
Source code: https://github.com/spotify/basic-pitch

To play MIDI:

Check out TiMidity on Linux to play MIDI without a hardware synthesizer: https://sourceforge.net/projects/timidity/

Thanks to @matteusbeus for suggesting Spotify’s tooling here and being immediately available to do so!

And finally

Thank you all for ~~listening~~ reading. Now please do go and give Y2K a few more rewinds.

Final footnotes

Digital decay

Manfred Thaller describes two types of digital decay in the README for his shotGun tool:

“Bit rot” stands for files where the degradation of storage media leads to tiny changes in the data stored, usually individual bits shifting from zero to one or the other way around. This type of damage occurs typically as a result of aging media. Most typically it either affects individual bytes, or sequences of bytes the length of which may vary with media types, will most frequently be 512, however.

And:

“Byte loss” stands for files, where faulty copying operations have lead to the loss of individual bytes, so a file is a few bytes shorter. This is usually the result of faulty equipment and / or faulty transmission lines. Most typically such a problem will only affect very short sequences of bytes, usually individual bytes only, as larger loss of data gets discovered quickly.

The experiments in this article are more akin to “bit rot”, however, much more artificial. I don’t believe data could be corrupted like this so easily (at least not naturally) in the wild.

Digital integrity is variable

Sean Martin and Malcolm Macleod’s excellent paper describes the variability of the digital image when created through analog means, i.e. through photography. Of course, it makes sense that light signals will be received and processed differently based on the digital camera and lens being used. Integrity need not be so tied to any particular checksum when the subject of the digital image is what gives it value not its specific layout of bytes.

Eno’s Eno’s

A final bit of bonus content, I saw this post about the notation on Brian Eno’s Music for Airports from Ed Summers.

It seemed appropriate somehow while writing this blog. The notation is described by the Smithsonian Mag here, and it is one style of a number of experimental music notations being toyed with at the time by different composers and artists. It’s a nice read for those who are interested.

8 thoughts on “The sensitivity index: Corrupting Y2K”

Johan van der Knijff says:

2025-02-11 at 11:28

@exponentialdecay @wtfpdf @matteusbeus This is some really nice work Ross! The iterative corruption process (and its results) reminds me a bit of William Basinski's disintegration loops (https://en.wikipedia.org/wiki/The_Disintegration_Loops), or Alvin Lucier's "I Am Sitting in a Room" (https://en.wikipedia.org/wiki/I_Am_Sitting_in_a_Room).
1. #Digital ⚓️ #Vagabond 🦈 says:
  
  2025-02-11 at 14:25
  
  @bitsgalore @exponentialdecay I appreciate it Johan, and thanks for the inspiration to try it! These links are great too, I really enjoyed reading them and following through on some of the other related links in the articles as well. I don't recall ever watching the Marcus Brownlee YouTube experiments but I just watched it and love that video. Can't believe they went through the process of downloading and uploading 1000 videos!
  1. Johan van der Knijff says:
    
    2025-02-11 at 15:30
    
    @beet_keeper @exponentialdecay Hadn't seen the Brownley video, that's nuts!
    
    This sort of stuff can really take you down a proper rabbit hole, where you end up at things like this:
    
    https://www.youtube.com/watch?v=57PyheGftGY
    1. #Digital ⚓️ #Vagabond 🦈 says:
      
      2025-02-11 at 17:10
      
      @bitsgalore @exponentialdecay 😲
Christian says:

2025-02-13 at 16:08

A wonderful read (and listen) Ross! It feels very much like a new conception, or the inverse, of the ship of Theseus; does a digital object which undergoes decay remain fundamentally the same object? Does it still remain authoritative given a loss to integrity? At the point of ingest, I think the default answer would be no. However, what of corruption which occurs to records already described within collections?
You mention diplomatics and I believe that while the science has a history pertaining to physical records, much can still be applied to those which are born-digital. Where seals, signatures, and other marks of authority are the focus of physical records, the metadata being captured with born-digital records must likewise play a critical role. And where annotations in margins or tears find their way into a catalogue description, so too must digital degradation be documented if no other solution is available.

Pingback: Informed consent: considering steganographic techniques to fingerprint Generative AI output – ross spencer :: exponentialdecay.digipres :: blog
Pingback: File format building blocks: primitives in digital preservation – ross spencer :: exponentialdecay.digipres :: blog
Pingback: Revisiting bsdiff as a tool for digital preservation - ross spencer :: exponentialdecay.digipres :: blog