Back in early 2022 I was finally able to get around to writing a paper that I had been thinking about for the better part of a decade.
information in these reports may look something like as follows:
siegfried : 1.10.1
scandate : 2023-07-09T16:53:44+02:00
signature : default.sig
created : 2023-05-12T09:10:13Z
- name : 'pronom'
details : 'DROID_SignatureFile_V112.xml; container-signature-20230510.xml'
filename : 'sf'
filesize : 10024103
modified : 2023-07-09T16:20:20+02:00
md5 : b08e809832955674c801559f7a9adf17
- ns : 'pronom'
id : 'fmt/690'
format : 'Executable and Linkable Format'
version : '64bit Little Endian'
basis : 'byte match at 0, 7'
Over a large enough corpus of files this information reveals so much about a collection. That information includes:
- Range of format identification.
- Unidentified file formats as an indicator of further work.
- Identified file formats as an indicator of unwanted files.
- Identified file formats as an indicator of the complexity of a collection.
- File and directory names (and their potential to be analysed).
- Encoding information.
- Empty directories.
- File Sizes.
- Date files were last modified.
- Information about zero-byte files.
- Checksum analysis and duplicate detection.
- Information about System files.
With consistent, API-like ways of viewing this data one can document a collection in a lot of detail and draw connections between other collections and identify new work programs involved in maintaining a digital collection over the longest period of time.
A file format identification report is an important artifact that often exists at the beginning of a digital transfer process and may then be recreated a number of times as the collection is processed. My paper goes into a lot more detail about how you might use the information in it and looks at some of the other tools out there that are already trying to do that.
The paper received a lot of positive comments at the time of publishing. Let me know what you think about it and if you have other ideas about how you might leverage format identification reports in your day-to-day work.
Read more in Code4Lib issue 53: https://journal.code4lib.org/articles/16351
Since the publication of this paper I have written a new tool that uses the checksums output in a file format identification report to create a checksum for a “folder” or directory where checksums for those do not normally exist. This tool is called Sumfolder1 and I will be introducing it in more detail in a later blog.