Architecture of The-FR.org
Last week I blogged about the publication of a new linked data format registry based on the work I did previously at The National Archives, UK.
Where the work goes, we will have to see. Open sourcing it was an important goal of the short sprint. Partly because I hope it demonstrates an architecture that can be adopted for a similar registry, and it may also provide a code-base that can be adapted for similar, linked open data projects. This blog provides an overview of that architecture…
First, I must acknowledge the Puelia Linked Open Data API that was released by Talis in the UK. I’ve used this API previously and part of my understanding of how it works has translated across into this work. It’s a deceptively simple model and there might be recognisable features in this. I have, however, created my own platform from scratch and for various reasons that was a more desirable approach for what I wanted to achieve.
Export of data from PRONOM…
The basic seed for the registry, as with a handful of others in recent years, P2, and UDFR, to name two, is the baseline PRONOM dataset. XML is exposed for this data, and can be scraped by accessing the following URLs, using an incremental numbering scheme:
- http://www.nationalarchives.gov.uk/PRONOM/fmt/{no}
- http://www.nationalarchives.gov.uk/PRONOM/x-fmt/{no}
Archiving the PRONOM dataset
Decisions made on any records in a registry warrant the maintenance of legacy data as an audit trail. Should anything change, the discrepancy can be located by looking at previous versions of the data stored. The PRONOM export is kept in various locations on the server for different uses, and different potential uses. The XML is maintained in private timestamped folders on the server. This is simply because it costs less bandwidth to make zipped versions of the same data available to the public. This can be used in future to add legacy triples to the already available linked data. The most up-to-date PRONOM data is also copied to a separate folder. From this folder, triples are created without the need for additional logic for access.
XML is converted to triples…
We search the current PRONOM data directory for any XML within and read each file individually; matching any nodes we’re interested in with a mapping routine to create triples. URIs are given for each PUID in the PRONOM dataset and each node matched to its correct predicate and value.
Triples loaded to the Triplestore
The mapping process results in a N-Triples file (*.NT) which is uploaded to an ARC2 triplestore. At this point, as we have a SPARQL endpoint, we have a foundation that can be exploited by anyone wishing to explore the dataset, and we have a foundation on which we can build a web GUI to provide linked open data.
ASK
We’re interested in providing a HTML representation, and machine-readable representation for any of the URIs we talk about in our triplestore. When a user requests a URI it needs to dereference to one of those forms. We begin by checking that there is data by using an ASK query.
DESCRIBE
If the ASK query returns ‘true’ we can then go back to the triplestore to perform a describe query.
Transform RDF/XML to Markdown via XSL and output to HTML
RDF/XML is returned from the DESCRIBE query. This is transformed using XSL via an XSL transformation engine into markdown text.
The markdown is wrapped in HTML and presented back to the user.
That is the basic workflow for the publication of HTML about the data in our triplestore. If a user requests data via a /data/ URI, then the request handler follows the same basic steps, but instead of going to the server and translating the response to markdown and HTML, it asks for a specific representation from the SPARQL endpoint, e.g. RDF, JSON, TSV. The response header is set accordingly to allow users to download data with the appropriate MIME type.
.htaccess
The key to making this work is the .htaccess file which rewrites certain URIs and routes requests to the Apache HTTP server through to the appropriate response handlers. There are three primary URIs of interest:
- /id/file-format/{no}
- /doc/file-format/{no}
- /data/file-format/{no}
‘id’ URIs provide a unique name for our file-formats. The doc and data URIs return HTML and data, respectively. The type of data returned by the the ‘data’ URIs is determined by the file extension attached to the request.
There are URIs for our classes and properties. E.g.
There are API URIs being developed to provide alternative routes into the data stored in the triplestore. These are described in the API documentation.
And that’s the architecture of the project thus far. If you have any comments or questions, feel free to ask below. Hopefully it makes some sense and hopefully the source code is useful to anyone embarking on a similar project.
Finally, in summary, here is all that information as a short animated GIF:
1 thought on “Architecture of The-FR.org”