As the story goes, I got home last Thursday. 14 November, 2013; A tweet from the previous day, still on my mind:

Will be easier with Linked Registries…

Yes. A lot will be easier with Linked Registries… But it has been a year and two months since I departed The National Archives, UK. A little longer since I blogged about Linky, The Linked Data Hedgehog, and looking back on progress outlined on The National Archives: Labs Pages – two years, give or take, since, I, along with a good colleague, helped to establish an infrastructure that might easily have become a ‘Linked Registry’… so where is it? where are they?

Avoiding the elephant in the room, the greatest, genuine, greatest achievement in the world of format registries this past year has been the Archive Team initiative: http://fileformats.archiveteam.org/

The project began in November 2012, and at the time of writing this blog, the Just Solve the File Format Problem wiki has 2,272 entries created by the community in just one year’s effort, and it’s still going strong.

But it’s not a ‘linked registry’ per se.

So, between 14 November and work beginning 18 November, four days, I started with a blank page on a text editor, a fresh code base, and wrote a linked registry:

the-fr.org :: The Format Registry

Four days.

Wrote it…
Open-sourced it: https://github.com/exponential-decay/the-format-registry
Released it: http://the-fr.org/

There’ll be more content on this blog in the coming days/weeks looking at the architecture and what the site does and what it can provide users.

Because of the wish to actually polish this work and make it sustainable, and persistent, the disclaimer on the site reads thusly:

While this notice is up the persistence of the links cannot be guaranteed. There are mechanisms still to be built which will provide guarantee. At present the application reads the data from source, and maps it, but cannot guarantee ‘subject’ mapping. Thus far the work has focused on export, vocabulary, and link redirection.

So, given the other efforts, and given this is another project that ‘isn’t quite there’ why should you still care?

Well:

1. It’s open source and ready for you to play with the code and contribute if you like. Or just download it and host something for yourself!

But:

2. The introductory paragraphs to the project outline this as a challenge. What can actually be delivered in short order. Without the burden of the frameworks that one (others) may often have to work within:

Welcome to The Format Registry: A linked data file format registry.

The work is the result of a four-day hack during November 2013. Its goal is to challenge the status quo, to influence the rapid development of further format registries and other linked open data initiatives within the digital preservation community.

The focus of this project will be on the data and the augmenting of what is currently available.

3. Your comments matter… up to a point. I’m going to call for comments on the work (hint: this is that call). Vocabulary, infrastructure, further requirements. I’ll pretty much keep developing otherwise and hope it satisfies requirements. Ideally I’d like to actually satisfy requirements, noted here, via twitter or by email. I’ll have to keep developing anyway, so prompt commentary appreciated.

What else do you need to know?

Data: Currently the site maps the PRONOM dataset which is made available by The National Archives under the Open Government License (OGL). The plan is to look at augmenting this work with UDFR and the Archive Team registry information where possible.

Mapping: Only a subset of PRONOM is mapped. It will probably remain a subset when it is fully curated. At the time of writing, the site uses 15 different predicates/properties.

Linking: The site is minimally linked. That is, there are a handful of seeAlso links where appropriate, e.g. http://the-fr.org/prop/format-registry/puid.

Hosting: I am currently piggybacking my regular hosting service with https://krystal.co.uk/. The domain is paid up for two-years so links, once static, can persist for that period of time at least.

TODO: A non-exhaustive list of immediate priorities on the GitHub hosting for the project.

Other than that, welcome to the-fr.org. Take a look at the intro page. Navigate around a bit. Let me know what you think.

Thanks Andy. Much appreciated. Hopefully there will be a few more blog posts which I hope to go into more detail about this work. As for your questions.

1. This, as with the work I was involved in at The National Archives, owes P2 an acknowledgement as a demonstration of how we can mark-up PRONOM data and package it to be more useful to the digital preservation community. Dave Tarrant who worked on P2 was central to helping shape some of the early work at TNA. As with any registry using PRONOM as a seed it will always look a bit like P2. The hope is to keep the links active, which doesn’t always seem to be the case at soton.ac.uk; and I hope to keep the records up to date with new records created when new PUIDs are. I hope to be able to add more sources of information and add functionality as appropriate too.

2. To add to 1. I think ‘dashboard’ is the only model I have the ability to sustain. How this will work in practice I will find out, but there are some reasonable sources to add from immediately, as you identified in 3.

3. I’ve a few bits and pieces to get into place. Since your comment I’ve created the beginnings of an API for the registry. I also need to handle the output of properties belonging to various different record types via XSL a little better. I need to add a a handful of other useful properties from PRONOM too. Crucially I need to write unit tests and then make the registry’s URIs static and persistent. Once the foundations are completely in place I will look at the enrichment of data and then see what can be done by various resources such as the File Formats wiki to make them more useful to efforts like this.

As an initial comment/thought about fileformats.archiveteam.org wiki, specifically, then we’re looking for the table class “infobox formatinfo” which is variably sized and currently on line 60, column 82 of the format source pages. Depending on the assumptions we make about its positioning then we’d probably need a flexible search for that box and a small amount of effort to parse it. It shouldn’t be too difficult and there’ll probably be enrichment code specific to each new data source. The easier sources to handle *should* be the ones which are already structured, and even better, those not surrounded by style information and other application based data such as JavaScript.

2 thoughts on “the-fr.org :: The Format Registry”

Andy Jackson says:

2013-11-19 at 21:30

This looks great – well done. Some questions:

1. This reminds of the P2 registry (http://p2-registry.ecs.soton.ac.uk/). How would you compare/contrast them?

2. What’s the scope? Do you envisage aggregating and comparing various data sources and seeing how they change over time? Or to you see this as an ‘editor’ interface rather than a ‘dashboard’? If you are also adding/editing information through this system, how do you plan on keeping everything synchronised over time, e.g. when PRONOM updates?

3. There is some structured data in the File Formats wiki – how do you need it to be modified and exposed in order to make use of, say, the known file extensions?

1. Ross Spencer says:
  
  2013-11-25 at 11:12
  
  Thanks Andy. Much appreciated. Hopefully there will be a few more blog posts which I hope to go into more detail about this work. As for your questions.
  
  1. This, as with the work I was involved in at The National Archives, owes P2 an acknowledgement as a demonstration of how we can mark-up PRONOM data and package it to be more useful to the digital preservation community. Dave Tarrant who worked on P2 was central to helping shape some of the early work at TNA. As with any registry using PRONOM as a seed it will always look a bit like P2. The hope is to keep the links active, which doesn’t always seem to be the case at soton.ac.uk; and I hope to keep the records up to date with new records created when new PUIDs are. I hope to be able to add more sources of information and add functionality as appropriate too.
  
  2. To add to 1. I think ‘dashboard’ is the only model I have the ability to sustain. How this will work in practice I will find out, but there are some reasonable sources to add from immediately, as you identified in 3.
  
  3. I’ve a few bits and pieces to get into place. Since your comment I’ve created the beginnings of an API for the registry. I also need to handle the output of properties belonging to various different record types via XSL a little better. I need to add a a handful of other useful properties from PRONOM too. Crucially I need to write unit tests and then make the registry’s URIs static and persistent. Once the foundations are completely in place I will look at the enrichment of data and then see what can be done by various resources such as the File Formats wiki to make them more useful to efforts like this.
  
  As an initial comment/thought about fileformats.archiveteam.org wiki, specifically, then we’re looking for the table class “infobox formatinfo” which is variably sized and currently on line 60, column 82 of the format source pages. Depending on the assumptions we make about its positioning then we’d probably need a flexible search for that box and a small amount of effort to parse it. It shouldn’t be too difficult and there’ll probably be enrichment code specific to each new data source. The easier sources to handle *should* be the ones which are already structured, and even better, those not surrounded by style information and other application based data such as JavaScript.

the-fr.org :: The Format Registry

the-fr.org :: The Format Registry

What else do you need to know?

2 thoughts on “the-fr.org :: The Format Registry”

Leave a Reply Cancel reply

Follow ross spencer :: exponentialdecay.digipres :: blog