Revisiting bsdiff as a tool for digital preservation

I introduced bsdiff in a blog in 2014. bsdiff compares the differences between two files, e.g. broken_file_a and corrected_file_b and creates a patch that can be applied to broken_file_a to generate a byte-for-byte match for corrected_file_b.

On the face of it, in an archive, we probably only care about corrected_file_2 and so why would we care about a technology that patches a broken file?

In all of the use-cases we can imagine the primary reasons are cost savings and removing redundancy in file storage or transmission of digital information. In one very special case we can record the difference between broken_file_a and corrected_file_b and give users a totally objective method of recreating corrected_file_b from broken_file_a providing 100% verifiable proof of the migration pathway taken between the two files.

On space saving we have very concrete examples of digital preservation system or digital preservation policy that do not allow for the replacement of an original digital file, e.g. as it was first received from an agency or donor or via some other workflow. If we later find an issue with the file but we find out that we can correct any errors we may end up ingesting a second “copy” of the object but with the fixes applied.

As file sizes increase duplication, and redundancy gets higher.

Marion Jaks and Jérôme Martinez described this exact scenario for multi-gigabyte video files at No Time to Wait 8.

In their scenario they discovered that non-conformant WAV files can be corrected using an automated fix. After applying the fix 99% of the file remains unchanged.

The (current) preservation policy at the Österreichische Mediathek is such that both the original and the corrected file will be kept which means a lot of duplicate information is kept.

Marion notes that policies may change in future and Jérôme too describes the potential to use documentation to avoid this redundancy.

I put the idea of bsdiff to them. In this scenario, a patch file may provide enough documentation and potential for automation (to create a corrected file) to make important storage (and cost) savings.

That being said, it has been a long time since I looked at the tooling so in this blog I take a look at the state of bsdiff in 2005. Let’s see if we can use it to fix the corrupted object we created when we corrupted The Digital Dark Age Crew’s Y2K!

Corrupting Y2K

When we corrupted Y2K we created a corrupt file from a high-quality original.

I tried to corrupt (glitch) about seven different file formats and present various different corrupted versions that demonstrate how the quality of the audio changes over various different corruptions.

Take one example in isolation.

exponential-decay · 00055-ddac-y2k-snippet-ac3

This track represents the 55th glitch of the Digital Dark Age Crew’s Y2K. Our broken file has the MD5: dec84cb48e1d44d07166ea6bb3a8fd6c

In this instance we also have access to the original file. Hypothetically, this file can represent having access to the original, e.g. in the case of a network transfer error, or a modified version, in the case we were able to correct some aspect of the file.

The corrected file has the checksum: ab5afe4018907f5cae3905234f539c8d

Both files are 196 KiB.

You can see the side by side some examples of where the file was corrupted:

Image shows one corrupted file side-by-side with its non-corrupted partner through the lens of a diff tool. The differences are highlighted on the command line in red and green. — Figure 1: Visual diff of a small section of two files showing the differences between the two.

If we remind ourselves of the original bitflip command, the differences represent 55 x 0.001% of the file: bitflip spray percent:0.001 {}.

We will call our broken file: source and our corrected file target. We want to create a patch file: patch such that when it is applied to source we generate target (although we will need to call that result and ensure that it is the same as target.

To create patch we do the following:

bsdiff source.ac3 target.ac3 patch.ac3

NB. using .ac3 as an extension for the patch file is an arbitrary choice to simplify this illustration. Other schemes might be adopted in other scenarios.

The patch file is 4.0 KiB which is 2% the size of the broken or target file. Its contents fits easily on this page (a copy of the repaired file we will generate from this patch would take up 98 x the space):

00000000: 42 53 44 49 46 46 34 30 30 00 00 00 00 00 00 00  BSDIFF400.......
00000010: 86 07 00 00 00 00 00 00 00 0c 03 00 00 00 00 00  ................
00000020: 42 5a 68 39 31 41 59 26 53 59 48 28 c8 1d 00 00  BZh91AY&SYH(....
00000030: 04 40 c0 58 24 00 00 c0 00 20 00 21 a6 8c d4 21  .@.X$.... .!...!
00000040: 80 bd ab 06 34 1e 2e e4 8a 70 a1 20 90 51 90 3a  ....4....p. .Q.:
00000050: 42 5a 68 39 31 41 59 26 53 59 ca 98 69 a8 00 00  BZh91AY&SY..i...
00000060: ab 7f ff ff ff ff ff ff ff ff ff ff ff df ff ff  ................
00000070: fb 5f ff fd ff fe ff bb ff ff ff ef ff ff 7c 7f  ._............|.
00000080: ff fe dd d0 05 b8 00 00 07 9c e9 00 0a 22 52 9b  ............."R.
00000090: 20 d0 43 11 81 a4 c8 c1 32 64 da 46 d1 34 c8 64   .C.....2d.F.4.d
000000a0: da 98 34 98 98 98 9a 69 ea 69 a3 13 4c 9b 21 90  ..4....i.i..L.!.
000000b0: d0 26 98 00 4c 02 30 4c d3 53 46 99 1a 60 34 9e  .&..L.0L.SF..`4.
000000c0: a6 8c d1 34 1e a6 99 1a 64 69 8d 1a 69 a2 36 93  ...4....di..i.6.
000000d0: 21 55 35 3d 1a 9b 21 1b 40 68 02 6d 47 a6 81 a6  !U5=..!.@h.mG...
000000e0: 9a 02 62 62 7a 4f 41 30 7a a6 69 a0 d1 a3 40 00  ..bbzOA0z.i...@.
000000f0: 4c 99 06 4d 06 46 10 60 9a 30 04 c0 19 0d 46 00  L..M.F.`.0....F.
00000100: 0d 34 00 d0 d2 64 64 c0 21 82 1a 6d 42 29 fa 09  .4...dd.!..mB)..
00000110: a0 09 54 00 86 13 23 09 a6 11 a6 11 93 02 60 86  ..T...#.......`.
00000120: 98 09 89 a3 04 69 89 84 c4 c0 4c 00 00 04 00 69  .....i....L....i
00000130: a3 13 08 30 09 88 30 26 99 03 46 98 00 98 00 00  ...0..0&..F.....
00000140: 80 00 00 00 00 00 00 0d 00 34 00 00 00 00 00 0d  .........4......
00000150: 00 34 34 68 00 34 00 00 00 00 06 40 00 00 00 34  .44h.4.....@...4
00000160: 00 01 a3 26 80 65 29 26 02 30 99 31 0d 34 d3 4d  ...&.e)&.0.1.4.M
00000170: 03 23 4c 23 43 09 a1 a3 26 86 26 8c 08 34 60 08  .#L#C...&.&..4`.
00000180: 64 c8 31 32 01 89 a1 82 31 18 9a 32 34 c8 30 4d  d.12....1..24.0M
00000190: 06 08 03 26 13 08 30 8c 9a 06 23 26 83 1e ef b8  ...&..0...#&....
000001a0: 87 fc 7d e6 2f 88 50 7f 7a 1e ea e8 de 44 d1 f6  ..}./.P.z....D..
000001b0: 73 2b 8d 04 b9 4c 04 0a 6e 42 00 8e 4e 99 3c ea  s+...L..nB..N.<.
000001c0: 69 eb 99 22 14 2b 33 82 94 ea 76 46 34 0e 79 d0  i..".+3...vF4.y.
000001d0: ee 20 e7 50 51 6a c9 98 b2 51 92 82 35 ad ce 64  . .PQj...Q..5..d
000001e0: ac a9 28 bb 13 8c a8 0b 26 10 8a 4c e6 6d 5c c1  ..(.....&..L.m\.
000001f0: 93 8a 09 8a 82 30 93 3c dc d2 78 92 95 c5 5e e9  .....0.<..x...^.
00000200: 19 5c a0 a6 20 d8 42 a8 46 9c 14 96 09 9b 03 bd  .\.. .B.F.......
00000210: 1c a2 ea 1e 4c 6d 24 84 9d b0 eb 04 7c 45 49 d4  ....Lm$.....|EI.
00000220: e5 31 58 29 24 0d 90 8c 8f 06 02 aa 1b 0c 2b 5d  .1X)$.........+]
00000230: 11 14 20 04 af 6b b3 8c 9d 49 24 b9 12 0c 1b bd  .. ..k...I$.....
00000240: c3 44 15 f0 f5 94 5d 45 a2 ec 66 0a d2 8a 33 a2  .D....]E..f...3.
00000250: 9b c6 00 81 42 8d c8 da c9 2a 47 2b 2b 26 74 5c  ....B....*G++&t\
00000260: dd 89 8c 8d c4 aa 79 5c 46 91 39 e7 02 8d 25 39  ......y\F.9...%9
00000270: 84 3b c0 dd 92 2f 0c 4d 62 57 b1 90 b4 8e 2b 9a  .;.../.MbW....+.
00000280: 41 0a 36 6a 28 a8 44 76 20 74 79 04 23 4c 40 d0  A.6j(.Dv ty.#L@.
00000290: 29 4c 14 9d 1a b5 81 4c 94 ca c0 21 95 af 1b d2  )L.....L...!....
000002a0: f0 ca 40 ef 6b 0a 5a ef 93 23 26 55 48 4d d9 da  ..@.k.Z..#&UHM..
000002b0: e8 01 c6 a9 91 bc d0 39 ca 88 cb 91 35 15 54 8e  .......9....5.T.
000002c0: 47 00 57 90 8c c9 2b 09 3a 82 c1 21 ca 54 48 31  G.W...+.:..!.TH1
000002d0: 91 18 27 33 c8 0c 08 08 8b e0 ea 05 ae 70 98 a5  ..'3.........p..
000002e0: 05 22 81 6f 7b da 70 27 50 46 29 74 ca 98 23 67  .".o{.p'PF)t..#g
000002f0: 40 86 b5 9d 6b 57 a0 24 ba 50 ab 58 14 f0 c6 19  @...kW.$.P.X....
00000300: eb 7b 45 12 29 85 32 84 44 90 a2 2b 26 0c 8a c0  .{E.).2.D..+&...
00000310: aa 21 44 42 bd 67 55 9a 45 28 86 f3 75 5a 0b 02  .!DB.gU.E(..uZ..
00000320: 73 34 43 25 2a 59 54 8a 8e 78 0b 88 92 58 32 a5  s4C%*YT..x...X2.
00000330: f0 31 02 4c 1b b3 21 a1 a0 c8 af 44 c3 2b 5a 36  .1.L..!....D.+Z6
00000340: 42 21 53 c4 46 80 84 74 a9 a1 15 08 65 71 82 6a  B!S.F..t....eq.j
00000350: 26 4b 0b a6 53 10 14 89 b8 b0 47 06 4f 13 2c 67  &K..S.....G.O.,g
00000360: 14 7b 2a 28 22 71 9b d5 26 42 f2 3c 8a ea 61 a7  .{*("q..&B.<..a.
00000370: 5c da a3 24 05 22 19 34 67 76 69 02 c0 10 3c 6b  \..$.".4gvi...

Now we want to apply patch to the broken file: source to generate the corrected file: result.ac3.

We do this using the bsdiff companion tool bspatch as follows:

bspatch source.ac3 result.ac3 patch.ac3

We compare this to target.ac3 and see that the result is a byte-for-byte match. We can list the directory contents:

$ md5sum *
dec84cb48e1d44d07166ea6bb3a8fd6c source.ac3
ba4c8e99b0f69fab8a8330c69a4084a6 patch.ac3
ab5afe4018907f5cae3905234f539c8d target.ac3
ab5afe4018907f5cae3905234f539c8d result.ac3 <-- checksum matches target.ac3

And we can listen to the results:

exponential-decay · DDAC Y2K Snippet

We can see our commands and the results on the command line via Asciinema.

Repeating the process with a video

We can demonstrate the same process with a video. Take the source below, corrupted again using bitflip.

The source file has a checksum of a4964b40c87c8dee73fe338a70eb2ed9 and is 3.3 MiB.

We are fortunate to have the target file. Again, imagine we’ve corrected its container object, or added some metadata. The target has the checksum: 0da8c1dcd3b4dd7f8eff629d6f876bcc

bsdiff source.mp4 target.mp4 patch.mp4

Like the audio file, the resulting patch file is 4.0 KiB (in this instance this is just 0.12% the size of the original):

00000000: 42 53 44 49 46 46 34 30 37 00 00 00 00 00 00 00  BSDIFF407.......
00000010: 89 04 00 00 00 00 00 00 49 7b 34 00 00 00 00 00  ........I{4.....
00000020: 42 5a 68 39 31 41 59 26 53 59 9c 1f a2 3e 00 00  BZh91AY&SY...>..
00000030: 06 5c c0 60 20 20 00 04 00 00 20 00 08 40 40 20  .\.`  .... ..@@ 
00000040: 00 21 a0 1b 50 83 26 22 f2 92 43 06 78 bb 92 29  .!..P.&"..C.x..)
00000050: c2 84 84 e0 fd 11 f0 42 5a 68 39 31 41 59 26 53  .......BZh91AY&S
00000060: 59 53 b6 29 dd 00 0f 4c ff ff f6 d5 fb 3f fc fb  YS.)...L.....?..
00000070: fb 7f 5f a4 7f 36 8e 4f f7 25 7b 9d 7d a6 ef b7  .._..6.O.%{.}...
00000080: ed c5 ee af f4 cd d7 3f e6 7d d0 05 5e 00 00 00  .......?.}..^...
00000090: 00 20 08 54 89 9a 83 68 cd 44 f5 0c 98 18 89 90  . .T...h.D......
000000a0: f5 1a 69 b4 4c 4c 3d 4d 09 e2 9a 63 53 4f 51 a6  ..i.LL=M...cSOQ.
000000b0: d4 7a 80 d3 0d 0d 26 4f 4c 4d 4c 34 0c 10 9e 8d  .z....&OLML4....
000000c0: 4d a6 a1 ea 7a 4d a9 fa 9b 4d 0c 93 d4 41 93 11  M...zM...M...A..
000000d0: 93 4c 9a 0d 0c 98 86 40 1a 68 0c 23 04 69 a6 08  .L.....@.h.#.i..
000000e0: 30 98 86 23 46 80 34 03 08 c0 9a 06 9a 68 32 00  0..#F.4......h2.
000000f0: c8 69 93 40 0d 32 10 64 c4 64 d3 26 83 43 26 21  .i.@.2.d.d.&.C&!
00000100: 90 06 9a 03 08 c1 1a 69 82 0c 26 21 88 d1 a0 0d  .......i..&!....
00000110: 00 c2 30 26 81 a6 9a 0c 80 32 1a 64 d0 03 4c 84  ..0&.....2.d..L.
00000120: 19 31 19 34 c9 a0 d0 c9 88 64 01 a6 80 c2 30 46  .1.4.....d....0F
00000130: 9a 60 83 09 88 62 34 68 03 40 30 8c 09 a0 69 a6  .`...b4h.@0...i.
00000140: 83 20 0c 86 99 34 00 d3 20 42 a8 46 8c 86 a5 40  . ...4.. B.F...@
00000150: 06 40 00 00 0d 1a 00 00 00 00 00 00 00 00 00 00  .@..............
00000160: 00 00 00 00 00 d3 fe 0b f5 2e d7 28 1f da 05 dd  ...........(....
00000170: 2a 9b e0 7d 4c 26 50 aa 61 21 9c a9 d8 48 0f 08  *..}L&P.a!...H..
00000180: 11 32 8f fd 09 d8 40 a9 84 02 06 70 a0 e1 29 d9  .2....@....p..).
00000190: 21 5c 60 d6 40 07 68 78 42 89 ba 04 0c 24 0d d2  !\`.@.hxB....$..
000001a0: 01 ca 03 8c 20 26 90 a8 7e e8 55 e3 0a 14 a0 f1  .... &..~.U.....
000001b0: 90 13 fe c0 aa 50 34 27 09 14 73 85 28 10 e5 28  .....P4'..s.(..(
000001c0: b8 c8 a6 f9 13 84 8a 1a c2 a1 84 2b 94 1c 24 73  ...........+..$s
000001d0: 25 34 94 34 95 71 90 13 7d 48 22 6f d3 04 14 d6  %4.4.q..}H"o....
000001e0: 10 43 69 40 77 ca a9 9d e2 20 03 38 50 ca 04 46  .Ci@w.... .8P..F
000001f0: 95 42 80 0c 65 17 18 40 35 80 14 ea 85 79 40 23  .B..e..@5....y@#
00000200: 48 19 42 02 e7 02 ae 10 8f 18 03 49 4e 52 a9 c6  H.B........INR..
00000210: 10 c6 01 4c 20 4c e0 40 c6 47 58 0e 32 a1 84 a2  ...L L.@.GX.2...
00000220: 65 22 74 85 50 d2 01 e7 20 1a c2 07 1b 69 10 e9  e"t.P... ....i..
00000230: 0a f1 94 4e 70 0b 42 8b ca 04 5c e1 36 91 10 e3  ...Np.B...\.6...
00000240: 2a 25 20 65 0b ac 0a e7 2a af 48 4c e0 77 4a 29  *% e....*.HL.wJ)
00000250: be 53 59 47 94 28 a7 28 74 95 1a 03 19 55 71 cb  .SYG.(.(t....Uq.
00000260: 02 95 03 84 82 65 00 07 28 00 5d a1 13 09 57 48  .....e..(.]...WH
00000270: 4d a0 3a 46 52 a6 f9 11 e9 0d 07 28 4d d0 8e d2  M.:FR......(M...
00000280: 8b b4 08 99 40 18 cb ce 30 94 1d d2 29 c6 7a 4a  ....@...0...).zJ
00000290: 3c 65 43 69 53 94 28 36 18 02 ed 0a bb e5 14 df  <eCiS.(6........ 000002a0: 00 39 4f c8 80 17 39 1e 72 28 75 48 1b e0 3a a4 .9O...9.r(uH..:. 000002b0: 29 0d 24 01 3a a2 93 9c 83 b4 b9 ca 06 72 a2 ee ).$.:........r.. 000002c0: 90 75 90 46 80 3a 4a 06 b2 a3 94 a2 f5 4a 39 c8 .u.F.:J......J9. 000002d0: ae 52 86 30 85 00 bc 61 43 76 fc 15 1d a1 35 85 .R.0...aCv....5. 000002e0: 13 aa 04 e9 21 ba 44 e1 2e e8 57 48 54 35 80 df ....!.D...WHT5.. 000002f0: 28 19 c2 07 39 41 c2 45 35 94 ca 55 1d 20 55 35 (...9A.E5..U. U5 00000300: b2 85 13 9c 26 e8 34 84 07 75 ca 01 10 ca c6 00 ....&.4..u...... 00000310: 53 18 45 38 c0 38 ca 8e 50 81 be 45 d2 05 39 c8 S.E8.8..P..E..9. 00000320: 3a 59 41 84 0b c2 54 e5 03 c2 51 c6 15 ea 80 c6 :YA...T...Q..... 00000330: 44 e9 21 94 0a f1 84 1c e1 0d c4 82 6d d3 04 51 D.!.........m..Q 00000340: e9 2b c2 40 d2 00 31 85 1e 52 22 9c 67 84 26 d2 .+.@..1..R".g.&. 00000350: 18 4a 38 c2 ed 02 f4 8c a0 13 84 08 f4 8e 12 f2 .J8............. 00000360: 80 5c 65 13 74 8a 3b 42 8e d2 1c 21 03 48 40 dd .\e.t.;B...!.H@. 00000370: 02 85 28 27 18 57 a4 00 63 00 e3 20 08 6d 22 71 ..('.W..c.. .m"q 00000380: 81 43 48 57 74 8e 30 23 9c 6b 00 e7 00 63 02 86 .CHWt.0#.k...c.. 00000390: f9 03 48 14 30 80 17 38 40 e1 00 0f 38 47 29 10 ..H.0..8@...8G). 000003a0: 35 d7 00 03 28 50 c6 5e 5a 60 8a 38 c0 27 62 01 5...(P.^Z`.8.'b. 000003b0: 14 11 1a 44 50 44 7b 99 9e 2a 1d eb 64 28 08 98 ...DPD{..*..d(.. 000003c0: 4b 47 bd 3a f6 33 93 eb 15 b3 6d 6d 67 ec b3 3f KG.:.3....mmg..? 000003d0: 61 9a 42 5f f7 ae 60 7e 1b fb 4a a4 42 94 46 d5 a.B_..`~..J.B.F. 000003e0: cf bf 85 5a 76 e8 fc cd 63 90 4a ab 89 59 29 04 ...Zv...c.J..Y). 000003f0: 01 d4 3d a2 64 30 cc 2e 2d 59 af 34 51 08 8a f6 ..=.d0..-Y.4Q... 00000400: d5 92 60 d1 f9 f1 fe eb 43 fa 73 0c 52 dc 22 c4 ..`.....C.s.R.". 00000410: 04 82 22 00 52 d8 51 9a b1 73 ee 58 cc a2 a8 13 ..".R.Q..s.X.... 00000420: 11 48 8c 04 44 75 f5 35 fb f6 17 9a 47 ec af fc .H..Du.5....G... 00000430: cb 3e ed 77 d8 e2 1d c2 80 da be 43 8c f2 d2 34 .>.w.......C...4
00000440: 5b 1e 2a 87 93 65 3a 3e 22 81 11 00 0d 8c 1f 51  [.*..e:>"......Q
00000450: af 9d 3e bf 55 61 ab 8b fa c0 b9 57 0e 4e e1 a9  ..>.Ua.....W.N..
00000460: fa a0 20 00 ef 45 33 5e 44 48 4c a5 9c 53 fe 7f  .. ..E3^DHL..S..
00000470: 68 6b 45 28 c4 40 39 65 cd 44 c4 8d 06 35 3d 3d  hkE(.@9e.D...5==
00000480: af 17 ca 29 32 4c 31 bb 98 f9 df be 39 69 44 59  ...)2L1.....9iDY
00000490: 58 c7 d2 b8 a1 3e b7 39 fb a6 4b e3 19 cb ff 7f  X....>.9..K.....
000004a0: c8 5b 92 f2 fa c2 20 20 20 00 02 03 2e 8f 9a 54  .[....   ......T
000004b0: 85 b8 c7 76 7a 52 2d cd c6 ea 77 25 1d c4 73 44  ...vzR-...w%..sD
000004c0: 1d 1b 22 90 e5 2a 99 e0 21 74 c3 eb af 76 f7 5e  .."..*..!t...v.^
000004d0: fc 72 39 9c 1f ce 2e e4 8a 70 a1 20 a7 6c 53 ba  .r9......p. .lS.
000004e0: 42 5a 68 39 17 72 45 38 50 90 00 00 00 00        BZh9.rE8P.....

And we can apply it to the source to generate the result. We can once again compare result to target by looking at the checksums. Both match as we expected:

$ md5sum *
a4964b40c87c8dee73fe338a70eb2ed9  source.mp4
458a98361f5e47ac515e3cbc0a41d66e  patch.mp4
0da8c1dcd3b4dd7f8eff629d6f876bcc  target.mp4
0da8c1dcd3b4dd7f8eff629d6f876bcc  result.mp4 <-- checksum matches target.mp4

We can see the appearance of the corrected file is vastly improved!

Analyzing the patch file

If like me you’re asking how a patch file might be 4.0 KiB for a 200kb audio file and a 4mb video file, then you will find the answer lays in compression.

The utility makes use of local bzip2 compression and attaches is own header (and possibly footer) to the compressed output to help make it easier to identify.

Because a bsdiff patch is actually a “mask” and the mask only contains differences between two files the mask will contain a lot of zeros where no data is going to be replaced; blocks of contiguous identical content are very easy to compress.

We can see how sparse the differences between the two audio files used in this blog in Figure 1. There are only a few bytes dotted about that make the difference between the corruption and not so we anticipate much of the patch file to be zeros.

How can we demonstrate this?

Diffing the diffs

If we take an empty file and apply a patch to that file it is reasonable to assume we will end up with a file patch-bytes in size, containing only null and positive bytes, i.e. the mask. But do we?

We can try it!

Given a patch file: patch.ac3
Create a new empty file: touch empty.ac3
Run bspatch against the empty file and create mask.ac3

bspatch empty.ac3 mask.ac3 patch.ac3

inspect the mask file, e.g. xxd -g1 -a mask.ac3 | less (explainshell)

00000000: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
*
00000040: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 08  ................
00000050: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
*
000000f0: 00 20 00 00 00 00 00 00 00 00 00 00 00 00 00 00  . ..............
00000100: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
*
00000290: 00 00 00 00 00 00 00 00 80 00 00 00 00 00 00 00  ................
000002a0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
*
000004d0: 00 00 00 00 00 00 00 00 00 40 00 00 00 00 00 00  .........@......
000004e0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
*
000005d0: 00 00 00 00 00 00 00 00 fe 00 00 00 00 00 00 00  ................
000005e0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
*
00000650: 00 00 00 00 00 00 00 00 c0 00 00 00 00 00 00 00  ................
00000660: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................

You will observe that most of the content (in this example at least) are null bytes, and dotted throughout bytes where bsdiff would apply a transformation to create the byte or bytes from the original target file.

You might be able to get a better idea via xxd on its own.

Shell script

You can use the following shell script to do the same:

#! /usr/bin/bash

echo "hex diff two files using bsdiff"
if [ $# -lt 2 ]
then
 echo "please supply an source and target file"
 exit
fi
echo "bad_file_a: $1"
echo "corrrected_file_b: $2"
bsdiff $1 $2 patch.file
touch empty.file
bspatch empty.file mask.file patch.file
read -p "⚠️   display diff (Y/N): " confirm && [[ $confirm == [yY] || $confirm == [yY][eE][sS] ]] || exit 1
xxd -g1 -a mask.file

If nothing else, a visualization

The mask we create has most use as a visualization technique as the result is a mixture of bytes inserted or transformed via bsdiff’s algorithm and so some bytes are bit-for-bit matches for the target file and some are only a result of a transformation (and so not an exact bit match).

That being said, if you are looking to get an idea of the areas in a file that have been corrupted between broken_file_a and corrected_file_b you can get a pretty effective first opinion either using hex tools like above or using tools like binvis.io:

Another quick demo

Another quick demo using a shell script:

#! /usr/bin/bash

# create a target file.
echo "There are differences in this file." > target.txt

# insert some differences to create source 
# (6 differences below in total).
echo "there  re dinnerences in this zile!" > source.txt

# create our patch and diff files to see the mask.
bsdiff source.txt target.txt patch.txt
touch empty.txt
bspatch empty.txt mask.txt patch.txt

# output our comparison.
echo "==========================================="
echo "bsdiff demo: showing mask compared to diffs"
echo "==========================================="
echo ""
echo "mask:"
echo ""
xxd -g1 mask.txt
echo "----"
echo "source:"
echo ""
xxd -g1 source.txt
echo "----"
echo "target:"
echo ""
xxd -g1 target.txt

This will output the following:

===========================================
bsdiff demo: showing mask compared to diffs
===========================================

mask:

00000000: e0 00 00 00 00 00 41 00 00 00 00 00 f8 f8 00 00  ......A.........
00000010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ec 00  ................
00000020: 00 00 2e 0a                                      ....
----
source:

00000000: 74 68 65 72 65 20 20 72 65 20 64 69 6e 6e 65 72  there  re dinner
00000010: 65 6e 63 65 73 20 69 6e 20 74 68 69 73 20 7a 69  ences in this zi
00000020: 6c 65 21 0a                                      le!.
----
target:

00000000: 54 68 65 72 65 20 61 72 65 20 64 69 66 66 65 72  There are differ
00000010: 65 6e 63 65 73 20 69 6e 20 74 68 69 73 20 66 69  ences in this fi
00000020: 6c 65 2e 0a

Observing the mask file you can see most bytes are null bytes. In contrast our six bytes that are different between the source file and target file are clear as they are non-null.

Try some different content and give it a whirl to get an idea how this technology might work for you!

Evaluating the workflow

In my original post about bsdiff I describe the use of a (possible) simple provenance note that might be part of a digital original’s ingest metadata:

Programmers Notepad 2.2.2300-rc used to convert plain-text file to UTF-8. UTF-8 byte-order-mark (0xEFBBBF) added to beginning of file – file size +3 bytes. Em-dash (0x97 ANSI) at position d1256 replaced by UTF-8 representation 0xE28094 at position d1256+3 bytes (d1259-d1261) – file size +2 bytes.

There are limitations to narrative, especially when it comes to technology. Cognitively it just takes time to parse, technically, not everyone can easily understand the language.

The patch file created through bsdiff provides a true mechanism that can be adopted to demonstrate a verifiable link between an original source file and a modified, e.g. corrected version of that file. In some cases a fix may enable rendering (display) to occur at all and so it can be hugely consequential.

Instead of just providing written documentation of something happening, e.g. somewhere in the PREMIS event detail in digital preservation systems still making use of that standard we can provide a patch file, a note, instructions, and even tooling that links broken_file_a to corrected_file_b via patch_file_c.

In times of wavering trust, being able to demonstrate an empirical link between two objects to the public makes a lot of sense.

bsdiff also offers the opportunity to cut down on storage use on your servers; again, the patch file, becomes the mechanism by which a broken file can be turned into a corrected file and there can be ways to deliver a corrected high-fidelity copy of that file via bsdiff and bspatch to your users.

It won’t always be feasible

We are able to take a correct object and correct the errors but it’s only possible within a small number of digital preservation scenarios, e.g.

we have the luxury of high-quality original, e.g. from a digitization process.
we have enough knowledge about an original file format and are able to correct any issues ourselves.

Maybe there are other scenarios folks reading this post will intuit. Certainly, the use-case from the Österreichische Mediathek could potentially be handled this way, and given the medium, there is likely to be a lot of savings in their preservation storage. How much could me made in terms of savings would need to be measured and evaluated.

Conclusion

There are occasions where we have the ability to correct damage in a digital object, and given that occasion we may also need to make a decision, how much storage do we need to use in aid of the fix? How do we transmit information about what fix took place?

bsdiff offers a way of preserving space, but even if we don’t preserve space, bsdiff provides us with a method of recording machine readable provenance information about a correction. Such a note could exist between broken_file_a and corrected_file_b creating a package of the following:

broken_file_a
corrected_file_b
patch_file_c

Users are then able to observe both the broken and corrected files as well as manually apply the patch to the broken file to give them measurable (cryptographically strong) evidence that the preservation action applied to a file is what it purported to be when the checksum of broken_file_a + patch_file_c equals that of corrected_file_b.

We have options, and as digital literacy and capability in the field grow, they are options we are able to pursue with confidence.

Give it a whirl, let me know if you decide to adopt it in your workflows. Let me know how it goes.

Acknowledgements

Thanks Peter B., Kieran, and JoshuaTJ for this thread in Mastodon inspiring parts of this revisit (especially around diffing the diffs.).

And thanks once again to @bitsgalore and the DDAC for their content and triggering these in-depth analyses.

5 thoughts on “Revisiting bsdiff as a tool for digital preservation”

coucou says:

2025-09-28 at 21:19

@exponentialdecay
https://github.com/mendsley/bsdiff ?
1. Ross Spencer says:
  
  2025-09-29 at 07:15
  
  Hi Coucou — yes, that one looks equivalent. The original is by Colin Percival, see https://www.daemonology.net/bsdiff/ with more info here. bsdiff is packaged with FreeBSD and I’m not sure where the sources are for the version I am using on Ubuntu but it looks like it was packaged circa 2003 going on the man page. I hope that helps!
  
Pingback: Using bsdiff to create ‘patches’ to fix pres files – Evan's blog
Tyler Thorsted says:

2025-09-30 at 21:30

Interesting process. Is this similar to how NZ archives handles “fixes” in repository? Also, what would be best practice for the extension on the patch file, seems like retaining the source extension would be problematic as the bsdiff file uses its own file format.

1. Ross Spencer says:
  
  2025-10-01 at 07:36
  
  Hi Tyler — Thanks for reading! And good questions. About Archives NZ, I didn’t mean to imply it was, but at the time I first wrote about bsdiff, I was wrestling with the mechanics of the versioning in their digital preservation system. There was a lot to think about there, including duplication of storage and the eventual representation of versions in the METS (I ended up writing about it here and we ended up re-digitizing and re-ingesting). In DP systems in general, I think there would need to be additions to data models and their implementation and execution of workflows to incorporate bsdiff/delta versioning effectively. Perhaps it’s more suited for a greenfield project? I first implemented it with colleagues at TNA for a digitization project (one we had lots of control over). I couldn’t speak to how that ended up landing in the preservation system there or whether it persisted. WRT to naming, I hear you. I expect .patch would be a good baseline, but we could add more meaning in the extension if desired, i.e., something to indicate its use for preservation/repair. Interesting to think about!