Maintenance begins at creation, so why are we not creating better?

The beats are the same. You work for government, or academia (lets face it, that’s probably where 90% of the work is) you have a deliverable; you save it; you print to PDF; you store it on an institutional repository with some metadata (or Zenodo, OSF or equivalent) and its done.

There’s a small chance that it’s FAIR (Findable, Accessible, Interoperable, Reusable) right? It has metadata that can be discovered by an audience looking for it and can be indexed by search engines. The data is potentially accessible if published correctly. They’re not particularly interoperable or easily converted, and PDFs aren’t really designed for reuse, even if tools like Apache Tika help ease the burden of extracting artifacts. It’s just a PDF, why are we even talking about FAIR? There begins a story…

The beats are the same, yet, we work in digital preservation, our backgrounds are in GLAM or software, why do we want to shoot ourselves in the foot? Why are we not using our skills to create better?

Records are information and information is data and data has an amazing property, it can be reused, remixed, manipulated, modeled, remodeled, and represented.

There are at least two information models that reflect these properties:

Records continuum (link)

Create
Capture
Organize
Pluralize

Research data life-cycle (link)

Planning
Propose
Collect, store, document
Preparation/analysis
Publishing
Preservation
Reuse
Planning

With the correct licensing and a flexible perspective, every record we create can be looked at as data.

Yet, as I look at different outputs in the world of digital preservation today, I see so many different outputs that we are happy to just publish as a PDF.

I raised a similar point about the PREMIS standard, why is this not a living website? Maintained under source control and available to receive proposals and corrections from anyone? (verified by the correct governance).

Recent project outputs I have seen include the results of questionnaires, interviews, guidance, non-linear, non-static lists of requirements, distributed by PDF and hosted on different academic or research repositories. Similarly, tables, timelines, charts, and graphic heavy reports are rendered static inside the IT world’s version of carbonite.

I don’t even think we really like PDF… and yet…

Create to maintain

I have touched upon create-to-maintain once or twice. I learned about it from Anna M. at Archives New Zealand, and it stuck with me. Another aspect of front-loading our efforts in digital preservation, i.e. designing it into our processes, instead of being reactive; giving records creators the information they need to create better records. Be it scans of documents made at an MFD (Multi-Function Device), selecting a desktop publishing format, or when selecting appropriate document formats in their local office productivity suite.

I took some of these ideas to a recent project meeting about some things I could bring to the table. Preparing for that I was reminded about the work done by Mike Morrison and the companies they work for on modern open interactive scientific publishing workflows.

And similarly a video a few years back from Tantacrul on music notation and MuseScore.

These are important videos to reflect on because in Mike’s, there seems to be the introduction of complexity to the “new” formats that we might be using. However, in reality, both touch upon an important point: “representation.” Given the correct underlying “model” or view, you can rework information to suit your needs. In Mike’s examples, complex interactive storytelling can be replayed through a traditional two-column science format if you choose. In MuseScore’s MusicXML, it is possible to display music using many different notation schemes to suit your preferences—music notation has been in slow evolution for 600 years.

What digital substrate could we be using for the different categories of digital record out there? How can we take what we know about digital preservation and, instead of restricting ourselves to one format, embrace plurality to enable the creation of rich, flexible, preservable records?

Digital substrate

It might be important to identify record types and purposes first and then our medium.

Record type

When we are creating records do we have a sense of what they will be used for?

Do the outputs have a data-like quality about them? E.g. timelines and survey results?
Is it a living or evolving resource? e.g. guidance, or a tutorial?
Do the outputs have the potential to be teased apart and analysed?
Can they be re-ordered? or remixed?
Would they benefit from being reordered or remixed? e.g. providing filtering, or sort functions?
Can the entirety of the content be reused?
Are there sub-artifacts that can be reused? e.g. High-quality images, charts, or diagrams?
Are the records designed for longevity?
Are they designed for impact across a community or communities?
Do we want to promote translation into other languages?
Is it likely, or preferable, that records are corrected, or updated, and potentially versioned over time?

If we answer yes to any of these questions, then maybe there’s a better medium than PDF.

A medium better than PDF?

As our foundational medium, then there are some options more obvious than others. I guess a lot will be web-based. Open access journals have been web-based for the longest time.

But the web also represents a field where a lot of effort has gone into creating technologies for layout that are simple, or complex, or even skeuomorphic, depending on how “real” you want something to look. Underlying web technologies promote semantic markup. You could probably still find technologies like XHTML that separate content entirely from layout.

Google style online documents are far from perfect but they do promote versioning and collaborative editing. As a drafting mechanism you can also extract your content through copy and paste to markdown, or save-as markdown as well.

Markdown, which is optionally web-based, is a popular foundation for static websites as it is easy to write and include top level metadata (front-matter) that is structured and easily converted by rendering tools to something useful, or simply crawled to be better understood.

Markdown-like formats can be easily checked into source control, and new resources built from there, e.g. using Jekyll or static web site builders. Source control too can be part of the medium, with tools like Git, the common, portable component of forges like GitHub or Codeberg, enabling (mostly) append-only logs of authorship, track changes, transparent version control, and so on – thus enabling automation of document slips and version histories in our more traditional static reporting.

Web archiving: a path of least resistance

In lieu of an immediate answer, I feel the path of least resistance for a lot of our outputs is likely to be static websites built using documentation tooling that use formats like Markdown as a base.

Websites are already fairly easily crawled, and we understand the features that make that possible.

Browsers are one of our biggest lenses on the world via desktop, phone, laptop, and so on. We have well-understood ways of signposting websites, adding metadata, and making them accessible, and assessing accessibility.

We are also able to pluralize websites for different locales, different displays, or different representations.

We have access to both content and presentation data and so are a natural fit for archiving in future.

Other structured formats

Briefly touched upon, we also have formats like MyST markdown for better scientific publishing.

We might be inspired by formats like MusicXML enabling presentation of music using different notation.

Formats like ReSpec exist making it easier to write technical documentation.

Wikis are of course not uncommon and provide different ways of extracting data from them.

These formats all seem to take a largely classical approach to their design, providing an interoperability layer in a data format like XML, or something approaching a data layer, like semantically marked up text, and then providing or enabling different presentation layers on top.

Do you have any favorite file formats that work like this?

Can we learn from these formats and potentially deliver outreach to users promoting their use as ways of lowering the barriers to preservation in the future? Can we use these to ask agencies and potential donors to “create better”? Can we create better using formats like these?

Creating a medium?

Does there exist a future where, as a (digital preservation) community interested in preserving information for the longest periods of time, we sit down together and create a new medium that makes our lives easier?

Maybe it’s a stretch, but the skills are there!

Look at FFV1!

NDSA Levels of Digital Preservation

It was this time last week this blog began when I was doing some other writing, and tried to find an easy to use copy of the NDSA Levels of Digital Preservation (LoP). This feels like it should be the easiest of tasks.

Yet! I find that the NDSA levels of preservation is just a JPEG on a website that links to an OSF.io repository that links to a PDF that contains the matrix – in just one language.

For other languages, you can follow a similar process to find an OSF.io repository with a translated version of the PDF. In translated instances, the document version is usually behind the English resource, for example, version 2.0 of the matrix versus 2.1; presumably because no one has had time to create a new representation yet, or maybe it is because the original was encoded in one of the least ergonomic formats possible, so it just takes time to steel oneself to make it happen.

Converting the NDSA Levels of Digital Preservation to data

I really want to have easy access to the LoP. I don’t want to just copy and paste elements of it into whatever I am writing today and repeat it again tomorrow, I have used them in the past, and I will use them again in the future. I want easy, repeatable access to this information.

So I took it upon myself to convert the matrix into data:

https://codeberg.org/ross-spencer/converter.ndsa/src/branch/pages/data/ndsa.json

The data irons out some of the noise introduced by being converted to a table in PDF (largely the way lines are broken uncleanly in cells making them difficult to copy and paste without having to splice them back together).

The data now feeds a small script/website that can take the data and currently replay it as a HTML table, or plain JSON.

https://ross-spencer.codeberg.page/converter.ndsa/

This could be easily extended to support other formats like YAML, ASCII table, Markdown Table, and so on.

Could the original have been created better?

I asked some questions earlier; could we have decided to put more effort into the delivery of the NDSA Levels of Digital Preservation?

We want (need!) translations of the LoP (it shouldn’t require a new PDF or signpost every time we create one).
We want to update the LoP over time.
I think we want to promote reuse, e.g. in journal articles, other digital preservation websites; it’s also licensed as CC-BY-SA 4.0.
We might want the original matrix displayed as-is.
We might want to see people present the information differently (some translations also offer a transposed version flipping the axes).
We want the LoP to be accessible.
We want the LoP to have longevity.

Like my point about PREMIS, we might also want a portal to enable discussion, offer suggestions, corrections, and so on.

My new rendition of the LoP might not be perfect, but I think it approaches a better ideal:

data;
that enables a static website to be built;
offering multiple representations in a single place;
is easy to update;
it is easy to add new translations and make them available immediately;
sitting on top of source control, offering transparency, trust, versioning, clear authorship, and audit trail;
and offers ways of receiving new information and making it immediately available, such as links to resources referencing different versions of the LoP.

The website might not yet be the easiest to preserve. I think it needs to enable a query string to encode language, and a corresponding sitemap providing access to those (this is possible as demonstrated with the permalink function in 0xffae). I can tackle this in the near future.

But even if the website needs an upgrade or two, the data itself is FAIR – Findable (via website, and Codeberg); Accessible (multiple translations, easy to add to, clearly delineated as data); Interoperable (can be converted to different formats via data, or web-interface); Reusable (there to be accessed and reused with the correct licensing conditions).

If we make the introduction to this blog a rule of thumb, then our PDF representation of the LoP (roughly) provides ‘Findable’, and ‘Accessible’.

We have the skills, knowing what we do in digital preservation to complete the FAIR principles here, provide ‘Interoperable’, and ‘Resuable’, and make the LoP so much more.

PDF is not a feature

PDF is not a feature, it is a barrier. Digital preservation does not need to be reactive; it can be proactive. In fact, our biggest responsibility to our future selves should be to ensure that it is built into all of our day-to-day processes. This means as much outreach as possible to our colleagues creating the digital record—recursively—that means us as well.

The principles of FAIR come from data and are arguably built into the research data life-cycle and closely mirrors aspects of the continuum model. FAIR does not need to stop for data because we’ve said data is a special use case; the principles can be used anywhere. I think that is important for us to consider.

When we view all of our records as FAIR, we might find new ways of presenting them that promote maximum accessibility, i.e., to those with permanent or temporary physical disabilities, or even serendipitously to members of the public or the press, who might just happen to discover our work. Without having to cut through all the formal trappings of what has become “formal reporting,” they are given the ability to see through our work through degrees of increasing complexity.

Records that are interoperable and reusable can be translated more easily, and their intent can be understood by many more people across the world. Especially if we consider them important and believe they are worth preserving with anticipated long-term influence.

Our efforts (in digital preservation) are often characterized as reactive, i.e., what are we receiving? What potential preservation risks might we encounter with a collection? How will we deal with them?

We needn’t be reactive. We have the skills to be proactive. The more we ask our teams at the front of house to “create better” and the more we learn about how to create-to-maintain for ourselves, the rosier our lives look, preserving information, and enabling reuse of information down the line.

I think we owe it to ourselves to ‘create better’ but what do you think?

Interoperability in NHS England

My friend, Matthew Bennion and I recently prepared an article looking at the important of open source and interoperability standards within the NHS England, which might be of interest to readers of this post.

Digital Capability, Open-Source Use, and Interoperability Standards Within the National Health Service in England: Survey of Health Care Trusts.

Searching for a signature

If you have access to the OPF recording, David, Tyler, and myself presented recently on a workshop we prepared for iPRES along with Sonia Ranade of The National Archives, UK, and previously (and hopefully in future!) Francesca Mackenzie also of The National Archives UK.

We talk about our desire to commit to developing a curriculum for PRONOM signature development in a more maintainable way using the Carpentries model for writing tutorials.

It takes a bit more effort, but we think the results are worth pursuing and hope it demonstrates a path for others too.

The resource is available online, and on GitHub.

Resources for documentation

If we end up creating a lot more web resources then it might start looking more like documentation. A lot of effort has gone into making software documentation accessible and usable.

Diátaxis provides a guideline for separating documentation so that it is easier to access and understand.

There are a lof of simple frameworks for producing static documentation websites. I have used Docusaurus and Just The Docs in the past. Depending on the tooling you are most comfortable with you can find an alternative. I am looking at MkDocs in the near future for a Python based project.

Find a static site builder that suits you via the awesome-docs list.

Will Burrows

The cover art for this blog is by Will Burrows. A tribute to the Ricardian Socialism motto: “The boss needs you, you don’t need them — Labor is entitled to all it creates”.

It may be a tangent, but I extend our efforts to the taxpayers as well. When we are funded with hundreds of thousands of euros or millions of dollars to work on important, long-lasting outputs, it is on us to make that work as available and accessible to the public as possible. Let us not become elite by unintentionally placing barriers between our work and the public. Take that extra moment to deliver content, paid for by taxpayers, on a platform, and in a way that allows us to maximize FAIR principles for everyone who might be curious.

Burrows’ print is available at https://willburrowsart.com/product/boss-needs-you-print/.

10 thoughts on “Maintenance begins at creation, so why are we not creating better?”

Bertrand Caron says:

2026-05-13 at 09:10

@exponentialdecay wow, I have so many reactions to that blog post Ross… In no particular order:

– The PREMIS data dictionary is heading towards the approach you are describing (https://github.com/PREMIS-EC/data-dictionary), though the contribution / versioning policy is still pending – and it's probably the most important part!

1/5
1. Bertrand Caron says:
  
  2026-05-13 at 09:11
  
  @exponentialdecay I like very much the questions you are asking in the "Record type" section. It resonates with the questions I asked my colleagues about their preservation intent – what do you want to be able to do and to allow your users to do in the future with your objects? – and how it affects the file format question.
  
  2/5
  1. Bertrand Caron says:
    
    2026-05-13 at 09:11
    
    @exponentialdecay – I noticed that structured data has not yet been incorporated into the basic skillset of many librarians / archivists, but I’m surprised that it’s also true for academics, though the five-star model is on top of any piece on data quality (e.g., https://op.europa.eu/webpub/op/data-quality-guidelines/en/#chapter4).
    
    3/5
    1. Bertrand Caron says:
      
      2026-05-13 at 09:11
      
      @exponentialdecay Regarding PDFs, I’m personally fascinated by the complexity and opacity of that standard… but for a lot of my colleagues, it’s a reassuring format – it displays apparently fine on any machine, the mediation is transparent, it has pages like a book, it seems to be self-sufficient and does not span over interdependent files…
      
      4/5
      1. Bertrand Caron says:
        
        2026-05-13 at 09:12
        
        @exponentialdecay
        
        At least our visual representations embed the structured data used to generate them as an embedded file? E.g., PNG diagrams created with draw.io incorporating the XML description as an MxFile.
        
        5/5
      2. Ross Spencer says:
        
        2026-05-15 at 09:03
        
        Still catching up with a few other comments that reflect this. I feel like maybe there’s a temporal component missing in that recognition? Others are right. One, well authored static PDF looks like it helps us a lot down the line. Isn’t the PDF always there for us to create as another representation? So is there an ergonomic multi-part “present state” that could be recognized among practitioners offering dynamic maintenance and versioning AND reusability and interoperability? That doesn’t limit expression, or prevent the creation of future less-static types as a preservation fall-back? (Probably!)
      3. Bertrand Caron says:
        
        2026-05-15 at 09:20
        
        @exponentialdecay you mean "don't implement PDF, (re)present it" 😉 ?
        
        I quite agree! That makes me think about the difficulty we had in my former job at BnF to distinguish between prescriptive quality requirements ("if you create sthg in that content type, prefer that format because its expressiveness and features are richer") and risk requirements ("if you already produced something in that format, we will take it but we will not be very happy and will not be able to do much with it.").
      4. #Digital ⚓️ #Vagabond 🦈 says:
        
        2026-05-15 at 10:07
        
        @BertrandCaron @exponentialdecay love the callback! 😀 And 💯 although I guess it keeps the job interesting!
    2. Ross Spencer says:
      
      2026-05-15 at 08:51
      
      This is a great resource, thank you!
      
2. Ross Spencer says:
  
  2026-05-15 at 08:48
  
  Ah, this is great, thank you Bertrand. Good to see PREMIS going this direction. Looking forward to seeing how it progresses!

Maintenance begins at creation, so why are we not creating better?

Create to maintain

Digital substrate

Record type

A medium better than PDF?

Web archiving: a path of least resistance

Other structured formats

Creating a medium?

NDSA Levels of Digital Preservation

Converting the NDSA Levels of Digital Preservation to data

Could the original have been created better?

PDF is not a feature

Interoperability in NHS England

Searching for a signature

Resources for documentation

Will Burrows

10 thoughts on “Maintenance begins at creation, so why are we not creating better?”

Original Comment URL

Your Profile

Original Comment URL

Your Profile

Original Comment URL

Your Profile

Original Comment URL

Your Profile

Leave a Reply Cancel reply

Follow ross spencer :: exponentialdecay.digipres :: blog