13 thoughts on “Revisiting bsdiff as a tool for digital preservation

    1. Hi Coucou — yes, that one looks equivalent. The original is by Colin Percival, see https://www.daemonology.net/bsdiff/ with more info here. bsdiff is packaged with FreeBSD and I’m not sure where the sources are for the version I am using on Ubuntu but it looks like it was packaged circa 2003 going on the man page. I hope that helps!

  1. Interesting process. Is this similar to how NZ archives handles “fixes” in repository? Also, what would be best practice for the extension on the patch file, seems like retaining the source extension would be problematic as the bsdiff file uses its own file format.

    1. Hi Tyler — Thanks for reading! And good questions. About Archives NZ, I didn’t mean to imply it was, but at the time I first wrote about bsdiff, I was wrestling with the mechanics of the versioning in their digital preservation system. There was a lot to think about there, including duplication of storage and the eventual representation of versions in the METS (I ended up writing about it here and we ended up re-digitizing and re-ingesting). In DP systems in general, I think there would need to be additions to data models and their implementation and execution of workflows to incorporate bsdiff/delta versioning effectively. Perhaps it’s more suited for a greenfield project? I first implemented it with colleagues at TNA for a digitization project (one we had lots of control over). I couldn’t speak to how that ended up landing in the preservation system there or whether it persisted. WRT to naming, I hear you. I expect .patch would be a good baseline, but we could add more meaning in the extension if desired, i.e., something to indicate its use for preservation/repair. Interesting to think about!

        1. I forgot that we added that extension to PRONOM. I think that is fine.

          I don’t think the bsdiff docs recommend a particular convention. What I would just emphasize as you write your workflow is that you include enough context in the filename, as needed, to understand it now, and in the future.

          Looking back on this blog a few months later, I think I would include “patch” somewhere in the naming convention (because bsdiff as a proper noun requires knowing more context on its own), and I think I would change the name “mask” above to “masked,” or maybe something else, as I’m not 100% happy with that name when trying to interpret its full meaning today.

          I’m not sure though. Any thoughts appreciated!

  2. Hi Ross, thank you so much for revisiting this excellent tool that I did not know previously. I love it very much! I have a few questions / remarks though:
    1) from the mask and the stored file, could we get a slightly more human-readable version that would be equivalent to an output of the command diff <(hexdump originalFile.ext) <(hexdump fixedFile.ext) ?
    2) It wasn't obvious to me which file you provide as input to binvis.io to get the output you show. After some time I supposed if was the mask.ac3 file. Am I right?
    3) I tested how the tool would work on an uncompressed sound in AIFF, rewrapped into WAVE. It took some seconds for a 43 Mb file, but it worked just fine (the patch was ~260 bytes). I suppose it could then work for normalisation operations if it consists only in a rewrapping.
    4) Regarding the fixing process, which is a more obvious use case, I'm wondering whether it could also serve to get a measure of the impact of a fix, if one did not used a hex editor to do the fix, and therefore wants to see how much it affected the file, as the diff seems very efficient. Do you confirm that?

    1. > could we get a slightly more human-readable version

      Another inspiration for visiting this post was this Toot earlier in the year: https://digipres.club/@joshuatj/114233850477861706

      I am not working with large files at the moment, but the thesis is that as a workaround for being unable to diff big files, you can use this text based approach to achieve something that approaches being helpful.

      > It wasn’t obvious to me which file you provide as input to binvis.io to get the output you show. After some time I supposed if was the mask.ac3 file. Am I right?

      Yes, naming here might be improved if this workflow is to be developed further. “masked” might be more appropriate. It’s the file we create that has the same size/shape as the original, but is plugged with only the areas in binary that were different between the source and the target.

      > I tested how the tool would work on an uncompressed sound in AIFF, rewrapped into WAVE.
      > I suppose it could then work for normalization operations if it consists only in a re-wrapping.

      Cool! Great to hear some more use-cases.

      Appetite is always dependent on a number of factors. Providing different assurance mechanisms to the satisfaction of the end user determines how ambitiously you can use it. Documenting the workflow and making its tooling easily accessible to the institution and the end user is important. You might also write new utilities/interfaces, e.g. build it into the user interface of the access system.

      Mechanical and cryptographic (i.e. round-tripping and checksum-driven) proofs are important for us here.

      > I’m wondering whether it could also serve to get a measure of the impact of a fix,

      Feel free to clarify, but what I read is: /the patch file can help us describe a measurement of the extent of a fix?/

      Yes, I think so, but I don’t think all the pieces are there yet — the patch is bzipped and so that’s largely why it looks so efficient.

      We create the “mask” files as they are the same shape and size as the original file as described above. I might suggest “extent” is a count of all non-zero bytes in that file. Testing that file, we get the following results:

      all bytes 199680
      null bytes 198842
      non null bytes 838

      So the fix-extent is 838 bytes.

      Let me know if I can clarify further, and let me know how the rest of your experimentation goes!

      1. Thanks Ross, that helps a lot. The “fix-extent” – number of bytes swapped vs. total number of bytes – seems indeed to be a very good measure of the change. Sometimes, indeed, there is doubt about whether we should “fix” the error, and some decision factors are helpful (assessed impact of the error, impact of the fix, time and effort required to perform it, existence of another valid version, deviation from the specification). I think that provides a measure of the fix extent, and that’s excellent!

      2. One additional naive comment: the visualisation of diffs via the mask file works well with swapped bytes or added bytes, but does not represent at all deleted bytes – which is probably related to the way the tool is designed…

        1. I didn’t think about that. Do you mean, appended/truncated bytes in-particular? I probably won’t be able to circle back and test this in the short-term but if you do manage to create any visualizations/comparisons it would be great to see!

          1. We have an example of a TIFF file that has a TIFF-hul-1 error (unexpected EOF). This is because one IFD entry has an offset that is beyond the file size. We decided to delete that IFD entry. The mask is the size of the fixed file, so the deleted bytes that were in the original file are not represented when displaying the mask with xxd -a.

Leave a Reply

Your email address will not be published. Required fields are marked *

Follow

Get every new post delivered to your Inbox

Join other followers: