Published: Fractal in detail: What information is in a file format identification report?
Back in early 2022 I was finally able to get around to writing a paper that I had been thinking about for the better part of a decade.
The paper on Code4Lib looks at the information that can be extracted from a file format identification report, e.g. reports such as those output from digital preservation tools Siegfried and DROID.
information in these reports may look something like as follows:
---
siegfried : 1.10.1
scandate : 2023-07-09T16:53:44+02:00
signature : default.sig
created : 2023-05-12T09:10:13Z
identifiers :
- name : 'pronom'
details : 'DROID_SignatureFile_V112.xml; container-signature-20230510.xml'
---
filename : 'sf'
filesize : 10024103
modified : 2023-07-09T16:20:20+02:00
errors :
md5 : b08e809832955674c801559f7a9adf17
matches :
- ns : 'pronom'
id : 'fmt/690'
format : 'Executable and Linkable Format'
version : '64bit Little Endian'
mime :
class :
basis : 'byte match at 0, 7'
Over a large enough corpus of files this information reveals so much about a collection. That information includes:
- Range of format identification.
- Unidentified file formats as an indicator of further work.
- Identified file formats as an indicator of unwanted files.
- Identified file formats as an indicator of the complexity of a collection.
- File and directory names (and their potential to be analysed).
- Encoding information.
- Empty directories.
- File Sizes.
- Date files were last modified.
- Information about zero-byte files.
- Checksum analysis and duplicate detection.
- Information about System files.
With consistent, API-like ways of viewing this data one can document a collection in a lot of detail and draw connections between other collections and identify new work programs involved in maintaining a digital collection over the longest period of time.
I have been working to extract this information from file format identification reports in a consistent repeatable way in my tool Demystify, and Demystify-Lite.
A file format identification report is an important artifact that often exists at the beginning of a digital transfer process and may then be recreated a number of times as the collection is processed. My paper goes into a lot more detail about how you might use the information in it and looks at some of the other tools out there that are already trying to do that.
The paper received a lot of positive comments at the time of publishing. Let me know what you think about it and if you have other ideas about how you might leverage format identification reports in your day-to-day work.
Read more in Code4Lib issue 53: https://journal.code4lib.org/articles/16351
Epilogue
Since the publication of this paper I have written a new tool that uses the checksums output in a file format identification report to create a checksum for a “folder” or directory where checksums for those do not normally exist. This tool is called Sumfolder1 and I will be introducing it in more detail in a later blog.
4 thoughts on “Published: Fractal in detail: What information is in a file format identification report?”