A homage by Will Burrows to the Ricardian Socialism motto: "The boss needs you, you don't need them -- Labor is entitled to all it creates". The image shows Quark, O'Brian, and Leeta discussing this while sharing a drink in the bar. The poster pays homage to the poster circa 1968 with the same phrase.

Maintenance begins at creation, so why are we not creating better?

The beats are the same. You work for government, or academia (lets face it, that’s probably where 90% of the work is) you have a deliverable; you save it; you print to PDF; you store it on an institutional repository with some metadata (or Zenodo, OSF or equivalent) and its done.

There’s a small chance that it’s FAIR (Findable, Accessible, Interoperable, Reusable) right? It has metadata that can be discovered by an audience looking for it and can be indexed by search engines. The data is potentially accessible if published correctly. They’re not particularly interoperable or easily converted, and PDFs aren’t really designed for reuse, even if tools like Apache Tika help ease the burden of extracting artifacts. It’s just a PDF, why are we even talking about FAIR? There begins a story…

The beats are the same, yet, we work in digital preservation, our backgrounds are in GLAM or software, why do we want to shoot ourselves in the foot? Why are we not using our skills to create better?

Records are information and information is data and data has an amazing property, it can be reused, remixed, manipulated, modeled, remodeled, and represented.

There are at least two information models that reflect these properties:

Records continuum (link)

  • Create
  • Capture
  • Organize
  • Pluralize

Research data life-cycle (link)

  • Planning
  • Propose
  • Collect, store, document
  • Preparation/analysis
  • Publishing
  • Preservation
  • Reuse
  • Planning

With the right licensing and a flexible perspective, every record we create can be looked at as data.

Yet, as I look at different outputs in the world of digital preservation today, I see so many different outputs that we are happy to just publish as a PDF.

Image of a bird in homage to Portlandia's "Put a bird on it". The bird is in silhouette with the text "Put it in a PDF"

I raised a similar point about the PREMIS standard, why is this not a living website? Maintained under source control and available to receive proposals and corrections from anyone? (verified by the correct governance).

Recent project outputs I have seen include results of questionnaires, interviews, guidance, non-linear, non-static lists of requirements, distributed by PDF and hosted on different academic or research repositories. Similarly, tables, timelines, charts, and graphic heavy reports are rendered static inside the IT world’s version of carbonite.

I don’t even think we really like PDF… and yet…

Create to maintain

I have touched upon create-to-maintain once or twice. I learned about it from Anna M. at Archives New Zealand, and it stuck with me. Another aspect of front-loading our efforts in digital preservation, i.e. designing it into our processes, instead of being reactive; giving records creators the information they need to create better records. Be it scans of documents made at an MFD (Multi-Function Device), selecting a desktop publishing format, or when selecting a document format in their local office productivity suite.

I took some of these ideas to a recent project meeting about some things I could bring to the table. Preparing for that I was reminded about the work done by Mike Morrison and the companies they work for on modern open interactive scientific publishing workflows.

And similarly a video a few years back from Tantacrul on music notation and MuseScore.

These are important videos to reflect on because in Mike’s, there seems to be the introduction of complexity to the “new” formats that we might be using. However, in reality, both touch upon an important point: “representation.” Given the correct underlying “model” or view, you can rework information to suit your needs. In Mike’s examples, complex interactive storytelling can be replayed through a traditional two-column science format if you choose. In MuseScore’s MusicXML, it is possible to display music using many different notation schemes to suit your preferences—music notation has been in slow evolution for 600 years.

What digital substrate could we be using for the different categories of digital record out there? How can we take what we know about digital preservation and instead of restricting ourselves to one format, embrace plurality to enable the creation of rich, flexible, preservable records?

Digital substrate

It might be important to identify record types and purposes first and then our medium.

Record type

When we are creating records do we have a sense of what they will be used for?

  • Do the outputs have a data-like quality about them? E.g. timelines and survey results?
  • Is it a living or evolving resource? e.g. guidance, or a tutorial?
  • Do the outputs have the potential to be teased apart and analysed?
  • Can they be re-ordered? or remixed?
  • Would they benefit from being reordered or remixed? e.g. providing filtering, or sort functions?
  • Can the entirety of the content be reused?
  • Are there sub-artifacts that can be reused? e.g. High-quality images, charts, or diagrams?
  • Are the records designed for longevity?
  • Are they designed for impact across a community or communities?
  • Do we want to promote translation into other languages?
  • Is it likely, or preferable, that records are corrected, or updated, and potentially versioned over time?

If we answer yes to any of these questions, then maybe there’s a better medium than PDF.

A medium better than PDF?

As our foundational medium, then there are some options more obvious than others. I guess a lot will be web based. Open access journals have been web-based for the longest time.

But the web also represents a field where a lot of effort has gone into creating technologies for layout that are simple, or complex, or even skeuomorphic, depending on how “real” you want something to look. Underlying web technologies promote semantic markup. You could probably still find technologies like XHTML that separate content entirely from layout.

Google style online documents are far from perfect but they do promote versioning and collaborative editing. As a drafting mechanism you can also extract your content through copy and paste to markdown, or save-as markdown as well.

Markdown, which is optionally web-based, is a popular foundation for static websites as it is easy to write and include top level metadata (front-matter) that is structured and easily converted by rendering tools to something useful, or simply crawled to be better understood.

Markdown-like formats can be easily checked into source control, and new resources built from there, e.g. using Jekyll or static web site builders.Source control too can be part of the medium, with tools like Git, the common, portable component of forges like GitHub or Codeberg, enabling (mostly) append-only logs of authorship, track changes, transparent version control, and so on – thus enabling automation of document slips and version histories in our more traditional static reporting.

Web archiving a path of least resistance

In lieu of an immediate answer, I feel the path of least resistance for a lot of our outputs is likely to be static websites built using documentation tooling that use formats like Markdown as a base.

Websites are already fairly easily crawled, and we understand the features that make that possible.

Browsers are one of our biggest lenses on the world via desktop, phone, laptop, and so on. We have well-understood ways of signposting websites, adding metadata, and making them accessible, and assessing accessibility.

We are also able to pluralize websites for different locales, different displays, or different representations.

We have access to both content and presentation data and so are a natural fit for archiving in future.

Other structured formats

Briefly touched upon, we also have formats like MyST markdown for better scientific publishing.

We might be inspired by formats like MusicXML enabling presentation of music using different notation.

Formats like ReSpec exist making it easier to write technical documentation.

Wiki’s are of course not uncommon and provide different ways of extracting data from them.

These formats all seem to take a largely classical approach to their design, providing an interoperability layer in a data format like XML, or something approaching a data layer like semantically marked up text, and then providing or enabling different presentation layers on top.

Do you have any other particular favorite file formats that work like this?

Can we learn from these formats and potentially deliver outreach to users promoting their use as ways of lowering the barriers to preservation in the future? Can we use these to ask agencies and potential donors to create better? Can we create better using formats like these?

Creating a medium?

Does there exist a future where, as a (digital preservation) community, interested in preserving information for the longest periods of time sit down together and create a new medium that makes our lives easier?

Maybe it’s a stretch, but the skills are there!

Look at FFV1!

NDSA levels of digital preservation

It was this time last week this blog began when I was doing some other writing, and tried to find an easy to use copy of the NDSA Levels of Digital Preservation (LoP). This feels like it should be the easiest of tasks.

Yet! I find that the NDSA levels of preservation is just a JPEG on a website that links to an OSF.io repository that links to a PDF that contains the matrix – in just one language.

For other languages, you can follow a similar process to find an OSF.io repository with a similar PDF, but translated. In those cases, the version of the translation is behind the primary English resource, and usually version 2.0 of the matrix versus 2.1, presumably because no one has had time to create a new representation yet, or maybe it is because the original was encoded in one of the least ergonomic formats possible and so it just takes time to steel oneself to make it happen.

Converting the NDSA levels of digital preservation to data

I really want to have easy access to the LoP. I don’t want to just copy and paste elements of it into whatever I am writing today and repeat it again tomorrow, I have used them in the past, and I will use them again in the future. I want easy, repeatable access to this information.

So I took it upon myself to convert the matrix into data:

The data irons out some of the noise introduced by being converted to a table in PDF (largely the way likes are broken uncleanly in cells).

The data now feeds a small script/website that can take the data and currently replay it as a HTML table, or plain JSON.

This could be easily extended to support other formats like YAML, ASCII table, Markdown Table, and so on.

Could the original have been created better?

I asked some questions earlier; could we have decided to put more effort into the delivery of the NDSA Levels of Digital Preservation?

  • We want (need!) translations of the LoP (it shouldn’t require a new PDF or signpost every time we create one).
  • We want to update the LoP over time.
  • I think we want to promote reuse, e.g. in journal articles, other digital preservation websites; it’s also licensed as CC-BY-SA 4.0.
  • We might want the original matrix displayed as-is.
  • We might want to see people present the information differently (some translations also offer a transposed version flipping the axes).
  • We want the LoP to be accessible.
  • We want the LoP to have longevity.

Like my point about PREMIS, we might also want a portal to enable discussion, offer suggestions, corrections, and so on.

My new rendition of the LoP might not be perfect, but I think it approaches a better ideal:

  • data;
  • that enables a static website to be built;
  • offering multiple representations in a single place;
  • is easy to update;
  • it is easy to add new translations and make them available immediately;
  • sitting on top of source control, offering transparency, trust, versioning, clear authorship, and audit trail;
  • and offers ways of receiving new information and making it immediately available, such as links to resources referencing different versions of the LoP.

In this instance, the website might not be the easiest to preserve, I think it needs to enable a query string to encode language, and a corresponding sitemap providing access to those (this is possible as demonstrated with the permalink function in 0xffae). I can tackle this in the near future.

But even if the website needs an upgrade or two, the data itself is FAIR – Findable (via website, and Codeberg); Accessible (multiple translations, easy to add to, clearly delineated as data); Interoperable (can be converted to different formats via data, or web-interface); Reusable (there to be accessed and reused with the correct licensing conditions).

If we make the introduction to this blog a rule of thumb, then our PDF representation of the LoP (roughly) provides Findable, and Accessible.

We have the skills, knowing what we do in digital preservation to complete the FAIR principles here, provide Interoperable, and Resuable, and make the LoP so much more.

PDF is not a feature

It is a barrier. Digital preservation does not need to be reactive; it can be proactive. In fact, our biggest responsibility to our future selves these days is to ensure that it is built into all of our day-to-day processes. This means as much outreach as possible to our colleagues creating the digital record. Recursively—that means us as well.

The principles of FAIR come from data and are arguably built into the research data life cycle and closely mirrors aspects of the continuum model. FAIR does not need to stop for data because we’ve said data is a special use case; the principles can be used anywhere. I think that’s important for us to consider.

When we view all of our records as FAIR we might find new ways of presenting them that promote maximum accessibility, i.e. to those with permanent or temporary physical disabilities, or even serendipitously to members of the public, or the press, who might just happen to find our work, and without having to cut through all the formal trimmings of what has become “formal reporting” has the ability to see through our work through levels of increasing complexity.

Records that are interoperable and reusable can be translated more easily, and their intent can be understood by many more people across the world. Especially if we consider them important and believe they are worth preserving with anticipated long-term influence.

Our efforts (in digital preservation) are often characterized as reactive, i.e., what are we receiving? What potential preservation risks might we encounter with a collection? How will we deal with them?

We needn’t be reactive. We have the skills to be proactive. The more we ask our teams at the front of house to “create better” and the more we learn about how to create-to-maintain for ourselves, the rosier our lives look, preserving information, and enabling reuse of information down the line.

I think we owe it to ourselves to create better but what do you think?


Interoperability in NHS England

My friend and I recently prepared an article looking at the important of open source and interoperability standards within the NHS England, which might be of interest to readers of this post.

Digital Capability, Open-Source Use, and Interoperability Standards Within the National Health Service in England: Survey of Health Care Trusts.

Searching for a signature

If you have access to the OPF recording, David, Tyler, and myself presented recently on a workshop we prepared for iPRES along with Sonia Ranade of The National Archives, UK, and previously (and hopefully in future!) Francesca Mackenzie also of The National Archives UK.

We talk about our desire to commit to developing a curriculum for PRONOM signature development in a more maintainable way using the Carpentries model for writing tutorials.

It takes a bit more effort, but we think the results are worth pursuing and hope it demonstrates a path for others too.

The resource is available online, and on GitHub.

Resources for documentation

If we end up creating a lot more web resources then it might start looking more like documentation. A lot of effort has gone into making software documentation accessible and usable.

Diátaxis provides a guideline for separating documentation so that it is easier to access and understand.

There are a lof of simple frameworks for producing static documentation websites. I have used Docusaurus and Just The Docs in the past. Depending on the tooling you are most comfortable with you can find an alternative. I am looking at MKDocs in the near future for a Python based project.

Find a static site builder that suits you via the awesome-docs list.

Will Burrows

The cover art for this blog is by Will Burrows. A tribute to the Ricardian Socialism motto: “The boss needs you, you don’t need them — Labor is entitled to all it creates”.

It may be a tangent, but I extend our efforts to the taxpayers as well. When we are funded with hundreds of thousands of euros or millions of dollars to work on important, long-lasting outputs, it is on us to make that work as available and accessible to the public as possible. Let us not become elite by unintentionally placing barriers between our work and the public. Take that extra moment to deliver content paid for by taxpayers on a platform and in a way that allows us to maximize FAIR principles for everyone who might be curious.

Burrows’ print is available at https://willburrowsart.com/product/boss-needs-you-print/.

Loading

Leave a Reply

Your email address will not be published. Required fields are marked *

Follow

Get every new post delivered to your Inbox

Join other followers: