What information is in a file format identification report?

In early 2022, I was finally able to get around to writing a paper that I had been thinking about for the better part of a decade. The paper, “Fractal in Detail: What Information Is in a File Format Identification Report?” was published in the Code4Lib journal Issue 53.

The paper takes a deep dive into the fractal contents of file format identification reports exported from tools like Siegfried and DROID.

Let’s take a brief look the article and its contents below.

Given this example from Siegfried, information in file-format reports may look something like as follows:

---
siegfried : 1.10.1
scandate : 2023-07-09T16:53:44+02:00
signature : default.sig
created : 2023-05-12T09:10:13Z
identifiers :
   - name : 'pronom'
     details : 'DROID_SignatureFile_V112.xml; container-signature-20230510.xml'
---
filename : 'sf'
filesize : 10024103
modified : 2023-07-09T16:20:20+02:00
errors :
md5 : b08e809832955674c801559f7a9adf17
matches :
   - ns : 'pronom'
     id : 'fmt/690'
     format : 'Executable and Linkable Format'
     version : '64bit Little Endian'
     mime :
     class :
     basis : 'byte match at 0, 7'

Over a large enough corpus of files this information reveals so much about a collection. That information includes:

Range of format identification.
Unidentified file formats as an indicator of further work.
Identified file formats as an indicator of unwanted files.
Identified file formats as an indicator of the complexity of a collection.
File and directory names (and their potential to be analysed).
Encoding information.
Empty directories.
File Sizes.
Date files were last modified.
Information about zero-byte files.
Checksum analysis and duplicate detection.
Information about System files.

With a consistent abstraction for viewing this data, one can document a collection in great detail, draw connections between other collections, and identify new work programs involved in maintaining a digital collection over the longest period of time.

I have been working to extract this information from file format identification reports in a consistent repeatable way in my tool Demystify, and Demystify-Lite (see also the blog about that effort).

A file format identification report is an important artifact that often exists at the beginning of a digital transfer process and may then be recreated a number of times as the collection is processed. My paper goes into a lot more detail about how you might use the information in it and looks at some of the other tools out there that are already trying to do that.

The paper received a lot of positive comments at the time of publishing. Let me know what you think about it and if you have other ideas about how you might leverage format identification reports in your day-to-day work.

Epilogue

Since the publication of this paper I have written a new tool that uses the checksums output in a file format identification report to create a checksum for a “folder” or directory where checksums for those do not normally exist. This tool is called Sumfolder1 and I will be introducing it in more detail in a later blog.

What information is in a file format identification report?

Epilogue

5 thoughts on “What information is in a file format identification report?”

Leave a Reply Cancel reply

Follow ross spencer :: exponentialdecay.digipres :: blog