Looking after your URLs: tikalinkextract eight years on

We might not have a second life, but what if I told you there was a second internet? Not the deep web, but another web that we engage with nearly every day?

Think about it, that QR code you scanned for more information? That payment link you followed on your electricity bill? The website you’re told to visit at the end of a television ad?

The antipodes of the internet are these terminal endpoints, material and not necessarily material objects that represent the end of the freely navigable web — the QR code on a concert poster is the web printed onto the physical world. There is every chance it will be scanned and followed by someone from a mobile device, but it’s a transient object, something that will exist for a short amount of time, and then disappear into the palimpsest of the poster board or wall it was pasted on until it eventually disappears.

This is part of the materiality of the internet that has long fascinated me. Perhaps it comes from being a student of material culture, but if we look around, we see the Internet everywhere!

The second web is an appendix to the world-wide-web and that is the subject of a set of utilities I wrote back in 2017 — Links all the way down: HTTPreservation of Documentary Heritage, and Hyperlinks in your files? How to get them out using tikalinkextract.

It is perhaps tikalinkextract I have the greater affinity for out of those utilities. A recognition that in government at the time, we were receiving digital records, e.g. Word and Word-like documents, with hyperlinks somewhere in their text inline. The records dated from about 1998 to 2004, and out of 5633 analysed at the time, ~900 contained 1608 hyperlinks.

This is a problem. Various statistics can be found about the life of a web-page, the most recent I have read is that the average lifespan of a web page is 100 days, and the average lifespan of a website is only 2.7 years.

Unfortunately I lost the source of that quote but it ended with:

The Online content is often at greater risk than older analogue content.

It is certainly true, but somewhere in between is our somewhat hybrid content that’s digital but not truly online.

Dropping a hyperlink into a file is easy. This blog entry already uses about a dozen to take narrative shortcuts. But because the blog exists on the web, web-crawlers designed to read these pages will often read the anchor links herein and follow those to the other parts of the web they reference allowing all links here to be archived.

Documents are different. They don’t really exist on the web. Even cloud-based solutions such as Google Docs aren’t easily crawled, but those records sitting on your hard disk certainly aren’t crawled.

Many documents back in the day were printed but also they might just live in a document management system, or an email server somewhere. Somewhat passive and disconnected from the rest of the world.

Given these documents might all contain hyperlinks providing important contextual information, how do we make sure the links are archived?

tikalinkextract

I proposed tikialinkextract for archival workflows. Something that could be integrated at time of ingest in a digital preservation system, or maybe even sooner in an appraisal workflow. tikalinkextract uses an Apache Tika server to process plain-text from all sorts of document-like files from PDFs to Microsoft Word and everything in-between. If Tika can process it, the text can be analysed.

After the text has been extracted it is analysed for hyperlinks, with some additional nuance, basically anything beginning http:// or https:// and outputs those links as strings to the command line so that they can then be fed into web-archiving workflows.

I proposed a wget based workflow in the tikalinkextract README at the time but a list of hyperlinks like this could be used as seeds for any number of web archiving workflows, from simple API calls to the Internet Archive’s SaveNow or more complex web-archiving workflows you have in your institution.

Tikalinkextract just gives you a head-start getting you the information you need to determine how much additional information you need to gather, and determine if any of it has already gone, or skewed, compared to the original author’s intent.

For example, combined with one of my helper programs HTTPreserve’s linkstat you can analyse all of the links to determine the HTTP response codes and whether there are archive.org links to content from the same era as the original record.

You can see an example of such an analysis from my analysis of the Million Dollar Web Page a few years back.

Httpreserve.info will also provide you with the same breakdown.

Time travel is better than archaeology

If you are already thinking that tikalinkextract is not enough, you are correct. The truth is, we need to be doing more today. We don’t know when digital records will be processed. Even a few years could make the difference between content skew or disappearing, maybe less.

Save your hyperlinks today, your future self will thank you.

The message I want to leave for the information records managers reading this is that if you aren’t already, you need to promote the archiving of hyperlinks as they are committed to the page.

I try and do so for myself when I am writing reports, or new papers, or saving links to parts of the web that aren’t going to be caught by a web-crawler such as the private GitHub repository I use to organize my work.

Personally, I use both perma.cc and archive.org’s SaveNow.

Perma.cc

I used perma.cc, for example, in the 2019 article I co-wrote with colleagues from Artefactual for iPRES2019, making sure all URLs were perma.cc ones.

I use it for different things such as articles that may land in print or more static hyperlinks such as PDFs.

Perma.cc is from Harvard Law School and it has a clear exit strategy if the site can not be funded any longer. When you save a link to perma.cc you can add notes and more information and you get a view from their web-crawler and an optional screenshot view of the site as well so you can preserve a lot of context.

SaveNow

I use SaveNow from the Internet Archive every day and it’s the third bookmark on my web browser. SaveNow does exactly what it says on the tin and you will get a webpage captured and stored in the Internet Archive today. You will immediately receive a memento formatted hyperlink in return, i.e. it will link back to the internet archive’s record as it was captured, and will contain the data time of the capture, e.g. this link back to the first capture of example.com from 2002.

http://web.archive.org/web/20020120142510/http://example.com/

If you create an account with the Internet Archive, SaveNow will get superpowers and it will allow you save outlinks from the site, receive notifications about its archiving and you will be given the option to save the link in “my (your) web archive”.

Habit or workflow

It can be a good habit to follow to make sure you can access your content for longer from a personal perspective, or giving folks all your context from a professional perspective.

If you are writing records management software then consider building tikalinkextract functionality into it so that systems are able to extract hyperlinks and submit them for archiving somewhere like the Internet Archive, regional internet archive, or even your own memento service. That way your users don’t have to think about it, and by the time records become archives, a lot of the work is already done to maintain the authenticity and integrity of records.

Successful preservation of our digital records in the future will be a result of a series of good information records management decisions, and one of those approaches is looking at digital continuity such as this as a gateway to preservation.

Documents still rule the roost

If I recall the early noughties correctly, and it’s difficult to find old information about this, but it was imagined that document-based workflows would eventually be replaced with page-less workflows. Wiki-like technologies could take over the document to create living documents. While it wasn’t an immediately recognized advantage, being more web-based would lend these information hubs to better web-archiving workflows. This utopia never materialized, or if it did, it was skewed by vendors attempting to dominate the market, e.g. with technologies like SharePoint.

It might have been cool, but given it didn’t manifest, the document, the page, is still a big part of day-to-day business activities. Heck, look at the biggest digital preservation conference iPRES and its submission template. Even in 2025 the format is still anchored somewhere in the 20th century using an overly page-oriented academic submission style.

Other disciplines are learning though and platforms such as CurveNote and MystMarkdown exist for Science and technical communication.

Maybe next year for iPRES? 😉

What tools are you already using?

Unfortunately I am not working at a receiving institution such as a library or archive right now, and I’m not writing a system where I can use tikalinkextract and build better more evidence driven tooling and so it’s hard to keep up with current trends.

I hope those of your reading this blog that haven’t got a workflow in place for archiving websites from your documents will take something from it that you can use.

Those already doing it, what workflows have you adopted? What tools? What APIs? How have you automated it? I’d love to hear more and amplify those efforts here.

Can you reach the end of the internet? Yes! But hopefully not in your documents having read this today.

HTTPreserve

HTTPreserve project on GitHub.
tikalinkextract on GitHub.
httpreserve.info.
Linkstat on GitHub.
OPF blog about the project through the lens of perma.cc in 2017.

IIPC Awesome Web-Archiving

An important resource for those getting started with web-archiving is the International Internet Preservation Consortium’s (IIPC) awesome list:

Awesome Web Archiving (IIPC).

The Internet Archive

The BBC recently made a short explainer about the Internet Archive, check it out here:

Can the Internet Archive save our digital history? | BBC News.