Porting SafeText and analyzing digital content with Apache Tika
Last year I wrote about pitfalls in modern journalism, especially with regards to receiving documents and information from whistleblowers without offering them adequate protection.
The tl;dr is that you, as a whistleblower, need to protect yourself; and you, as an editor or journalist, need to protect your whistleblowers.
Steganographic fingerprints might be one method adopted to detect someone leaking information. Steganographic characters replace common textual characters with unusual but hard to detect variants, e.g. they look the same to the human eye, or are actually invisible. Using a tool called SafeText by David Jacobson we can identify these hidden fingerprints in the content that you share.
I firmly believe we can find clues about what is important to preserve, or learn to preserve, when we analyse the content of the digital record and not just the (file) format of the digital record.
A file can contain many different features and these are all challenges to their future interpretation, and thus preservation.
I wanted to use SafeText in some of my other non-Python tooling and so I decided to port the code to Golang as a composable module and binary.
By coincidence at the time I started writing this I had also just written about revisiting tikalinkextract and so I thought I would write this small explanation about how you might combine Tika and SafeText to perform some content analysis of your own.
Who knows, maybe we will find a conspiracy. Maybe we’ll find secret codes in our own digital records. Maybe we’ll learn something new about our records…
Lets have a look at putting Tika and SafeText together and see where it goes.
Tika as a Swiss Army Knife
Apache Tika has been a constant in my tooling since I learned about it.
- NASA Wakeup Calls.
- Faceted description of archival records.
- Extracting hyperlinks in documentary heritage.
Tika is an amazing tool that has the ability to read the metadata and content of many file formats, from PDF to Word documents.
All you need is Tika’s server implementation. With that, you can forward data to it and Tika will attempt to parse what it was given using one of its many modules.
Tika can return the textual content of the file, its metadata, or even its embedded objects.
You can download Tika server from the Apache website. using wget, e.g.
wget https://dlcdn.apache.org/tika/3.3.0/tika-server-standard-3.3.0.jar
You can download and verify the binary’s sha512 checksum as follows:
wget https://downloads.apache.org/tika/3.3.0/tika-server-standard-3.3.0.jar.sha512
cat tika-server-standard-3.3.0.jar.sha512 && sha512sum tika-server-standard-3.3.0.jar
The documentation is pretty good so take a look at that for all its features.
Running Apache Tika
You can run tika as follows:
java -jar tika-server-standard-3.3.0.jar
This will run the service on port 9998.
You can also run nohup (info) to put Tika in the background. This keeps Tika available on port 9998 until you restart your machine or kill the process. I tend to use fuser to do this, e.g. fuser -k 9998/tcp.
You can visit the service in your web-browser by visiting http://localhost:9998/ for a summary of all of its endpoints (although the documentation link above is much clearer on their function).
A command line way to see if it is running is to request Tika’s version information:
curl -X GET -s http://localhost:9998/version
I get the result Apache Tika 3.3.0.
Exploring a small selection of its tools
Before we get into the SafeText examples below, I wanted to demonstrate Tika’s more useful endpoints.
Let’s start by downloading my “fully featured PDF” from the OPF Format Corpus.
wget https://raw.githubusercontent.com/openpreserve/format-corpus/refs/heads/master/fully-featured-pdf/PDF-Sample-Document-Fully-Featured-Layout_Redacted.pdf
Extract metadata
Given the fully featured PDF we can inspect its key-value metadata through Tika’s /meta endpoint:
curl -X PUT -s http://localhost:9998/meta -T PDF-Sample-Document-Fully-Featured-Layout_Redacted.pdf
It should look as follows:
pdf:PDFVersion,1.7 pdf:has3D,true xmp:CreatorTool,Acrobat PDFMaker 23 for Word pdf:docinfo:title,Digital Preservation Testing Document pdf:hasXFA,false X-TIKA:Parsed-By-Full-Set,org.apache.tika.parser.DefaultParser,org.apache.tika.parser.pdf.PDFParser pdf:num3DAnnotations,1 language,en dc:format,application/pdf; version=1.7 xmp:pdf:Keywords,Digital Preservation; Digipres; Archives; OPF Test Corpus pdf:docinfo:creator_tool,Acrobat PDFMaker 23 for Word access_permission:fill_in_form,true pdf:hasCollection,false pdf:encrypted,false dc:title,Digital Preservation Testing Document pdf:containsNonEmbeddedFont,false xmp:CreateDate,2023-05-05T09:34:28Z pdf:docinfo:custom:SourceModified,D:20230505093127 pdf:docinfo:subject,Digital Preservation pdf:hasMarkedContent,true pdf:ocrPageCount,0 access_permission:can_print_faithful,true xmp:ModifyDate,2023-05-05T10:56:56Z pdf:docinfo:creator,@beet_keeper - github.com/ross-spencer access_permission:extract_for_accessibility,true pdf:hasAcroFormFields,true xmp:dc:creator,@beet_keeper - github.com/ross-spencer X-TIKA:Parsed-By,org.apache.tika.parser.DefaultParser,org.apache.tika.parser.pdf.PDFParser dc:description:x-default,Digital Preservation pdf:annotationTypes,14f2c2d8-1078-4f71-9736-f229c34993ba,20f602aa-5ddb-42ba-a935-5c4d49a8bc52,3D6,4a18ad93-6732-4989-a0aa-47061658812a,8deb2acc-b801-4004-b95c-bf257d23789e,RM4,RM7,af5915ed-74a6-4348-b7a8-fff23c413783,fdbca803-a2bc-4b35-aca8-2117d6dfb047,null xmp:dc:description:x-default,Digital Preservation pdf:docinfo:producer,Adobe PDF Library 23.1.175 pdf:annotationSubtypes,3D,Circle,FileAttachment,Line,Link,Polygon,Popup,RichMedia,Sound,Square,Widget xmp:dc:description,Digital Preservation pdf:containsDamagedFont,false pdf:unmappedUnicodeCharsPerPage,0,0,0 Digital preservation testing property #1,Digital preservation testing value #1 dc:description,Digital Preservation Digital preservation testing property #2,Digital preservation testing value #2 Digital preservation testing property #3,Digital preservation testing value #3 access_permission:modify_annotations,true dc:title:x-default,Digital Preservation Testing Document dc:creator,@beet_keeper - github.com/ross-spencer dcterms:created,2023-05-05T09:34:28Z dcterms:modified,2023-05-05T10:56:56Z xmp:dc:title:x-default,Digital Preservation Testing Document xmpMM:DocumentID,uuid:a9a30d32-5d69-4790-b5f6-1705102a20a4 xmp:dc:title,Digital Preservation Testing Document pdf:docinfo:custom:Digital preservation testing property #4,Digital preservation testing value #4 pdf:overallPercentageUnmappedUnicodeChars,0.0 pdf:docinfo:custom:Digital preservation testing property #3,Digital preservation testing value #3 pdf:docinfo:keywords,Digital Preservation; Digipres; Archives; OPF Test Corpus pdf:docinfo:modified,2023-05-05T10:56:56Z pdf:docinfo:custom:Digital preservation testing property #2,Digital preservation testing value #2 pdf:docinfo:custom:Digital preservation testing property #1,Digital preservation testing value #1 Content-Length,22724407 Digital preservation testing property #4,Digital preservation testing value #4 Content-Type,application/pdf xmp:MetadataDate,2023-05-05T10:56:56Z pdf:producer,Adobe PDF Library 23.1.175 dc:subject,Digital Preservation; Digipres; Archives; OPF Test Corpus,Digital Preservation,Digipres,Archives,OPF Test Corpus xmp:pdf:Producer,Adobe PDF Library 23.1.175 pdf:totalUnmappedUnicodeChars,0 access_permission:assemble_document,true xmpTPg:NPages,3 pdf:hasXMP,true pdf:charsPerPage,1062,1157,498 access_permission:extract_content,true pdf:docinfo:custom:44 69 67 69 74 61 6C 20 70 72 65 73 65 72 76 61 74 69 6F 6E 20 74 65 73 74 69 6E 67 20 70 72 6F 70 65 72 74 79 20 23 35,44 69 67 69 74 61 6C 20 70 72 65 73 65 72 76 61 74 69 6F 6E 20 74 65 73 74 69 6E 67 20 76 61 6C 75 65 xmp:dc:subject,Digital Preservation,Digipres,Archives,OPF Test Corpus access_permission:can_print,true SourceModified,D:20230505093127 44 69 67 69 74 61 6C 20 70 72 65 73 65 72 76 61 74 69 6F 6E 20 74 65 73 74 69 6E 67 20 70 72 6F 70 65 72 74 79 20 23 35,44 69 67 69 74 61 6C 20 70 72 65 73 65 72 76 61 74 69 6F 6E 20 74 65 73 74 69 6E 67 20 76 61 6C 75 65 meta:keyword,Digital Preservation; Digipres; Archives; OPF Test Corpus access_permission:can_modify,true pdf:docinfo:created,2023-05-05T09:34:28Z xmpMM:InstanceID,uuid:3fd68a29-56e9-4c06-8ffb-076d775fa4c8
Tika does a lot through content negotiation. Content negotiation is where you ask the web-server for a different representation if it is available. Maybe Tika can display the metadata as JSON? We know the MIMEType for JSON is application/json let’s ask for that using a different Accept header:
curl -X PUT -s http://localhost:9998/meta --header "Accept: application/json" -T PDF-Sample-Document-Fully-Featured-Layout_Redacted.pdf | jq
NB. JQ is used to pretty-print the result. You can remove | jq if you don’t have it installed.
Bingo!
{
"pdf:PDFVersion": "1.7",
"pdf:has3D": "true",
"xmp:CreatorTool": "Acrobat PDFMaker 23 for Word",
"pdf:docinfo:title": "Digital Preservation Testing Document",
"pdf:hasXFA": "false",
"X-TIKA:Parsed-By-Full-Set": [
"org.apache.tika.parser.DefaultParser",
"org.apache.tika.parser.pdf.PDFParser"
],
"pdf:num3DAnnotations": "1",
"language": "en",
"dc:format": "application/pdf; version=1.7",
"xmp:pdf:Keywords": "Digital Preservation; Digipres; Archives; OPF Test Corpus",
"pdf:docinfo:creator_tool": "Acrobat PDFMaker 23 for Word",
"access_permission:fill_in_form": "true",
"pdf:hasCollection": "false",
"pdf:encrypted": "false",
"dc:title": "Digital Preservation Testing Document",
"pdf:containsNonEmbeddedFont": "false",
"xmp:CreateDate": "2023-05-05T09:34:28Z",
"pdf:docinfo:custom:SourceModified": "D:20230505093127",
"pdf:docinfo:subject": "Digital Preservation",
"pdf:hasMarkedContent": "true",
"pdf:ocrPageCount": "0",
"access_permission:can_print_faithful": "true",
"xmp:ModifyDate": "2023-05-05T10:56:56Z",
"pdf:docinfo:creator": "@beet_keeper - github.com/ross-spencer",
"access_permission:extract_for_accessibility": "true",
"pdf:hasAcroFormFields": "true",
"xmp:dc:creator": "@beet_keeper - github.com/ross-spencer",
"X-TIKA:Parsed-By": [
"org.apache.tika.parser.DefaultParser",
"org.apache.tika.parser.pdf.PDFParser"
],
"dc:description:x-default": "Digital Preservation",
"pdf:annotationTypes": [
"14f2c2d8-1078-4f71-9736-f229c34993ba",
"20f602aa-5ddb-42ba-a935-5c4d49a8bc52",
"3D6",
"4a18ad93-6732-4989-a0aa-47061658812a",
"8deb2acc-b801-4004-b95c-bf257d23789e",
"RM4",
"RM7",
"af5915ed-74a6-4348-b7a8-fff23c413783",
"fdbca803-a2bc-4b35-aca8-2117d6dfb047",
"null"
],
"xmp:dc:description:x-default": "Digital Preservation",
"pdf:docinfo:producer": "Adobe PDF Library 23.1.175",
"pdf:annotationSubtypes": [
"3D",
"Circle",
"FileAttachment",
"Line",
"Link",
"Polygon",
"Popup",
"RichMedia",
"Sound",
"Square",
"Widget"
],
"xmp:dc:description": "Digital Preservation",
"pdf:containsDamagedFont": "false",
"pdf:unmappedUnicodeCharsPerPage": [
"0",
"0",
"0"
],
"Digital preservation testing property #1": "Digital preservation testing value #1",
"dc:description": "Digital Preservation",
"Digital preservation testing property #2": "Digital preservation testing value #2",
"Digital preservation testing property #3": "Digital preservation testing value #3",
"access_permission:modify_annotations": "true",
"dc:title:x-default": "Digital Preservation Testing Document",
"dc:creator": "@beet_keeper - github.com/ross-spencer",
"dcterms:created": "2023-05-05T09:34:28Z",
"dcterms:modified": "2023-05-05T10:56:56Z",
"xmp:dc:title:x-default": "Digital Preservation Testing Document",
"xmpMM:DocumentID": "uuid:a9a30d32-5d69-4790-b5f6-1705102a20a4",
"xmp:dc:title": "Digital Preservation Testing Document",
"pdf:docinfo:custom:Digital preservation testing property #4": "Digital preservation testing value #4",
"pdf:overallPercentageUnmappedUnicodeChars": "0.0",
"pdf:docinfo:custom:Digital preservation testing property #3": "Digital preservation testing value #3",
"pdf:docinfo:keywords": "Digital Preservation; Digipres; Archives; OPF Test Corpus",
"pdf:docinfo:modified": "2023-05-05T10:56:56Z",
"pdf:docinfo:custom:Digital preservation testing property #2": "Digital preservation testing value #2",
"pdf:docinfo:custom:Digital preservation testing property #1": "Digital preservation testing value #1",
"Content-Length": "22724407",
"Digital preservation testing property #4": "Digital preservation testing value #4",
"Content-Type": "application/pdf",
"xmp:MetadataDate": "2023-05-05T10:56:56Z",
"pdf:producer": "Adobe PDF Library 23.1.175",
"dc:subject": [
"Digital Preservation; Digipres; Archives; OPF Test Corpus",
"Digital Preservation",
"Digipres",
"Archives",
"OPF Test Corpus"
],
"xmp:pdf:Producer": "Adobe PDF Library 23.1.175",
"pdf:totalUnmappedUnicodeChars": "0",
"access_permission:assemble_document": "true",
"xmpTPg:NPages": "3",
"pdf:hasXMP": "true",
"pdf:charsPerPage": [
"1062",
"1157",
"498"
],
"access_permission:extract_content": "true",
"pdf:docinfo:custom:44 69 67 69 74 61 6C 20 70 72 65 73 65 72 76 61 74 69 6F 6E 20 74 65 73 74 69 6E 67 20 70 72 6F 70 65 72 74 79 20 23 35": "44 69 67 69 74 61 6C 20 70 72 65 73 65 72 76 61 74 69 6F 6E 20 74 65 73 74 69 6E 67 20 76 61 6C 75 65",
"xmp:dc:subject": [
"Digital Preservation",
"Digipres",
"Archives",
"OPF Test Corpus"
],
"access_permission:can_print": "true",
"SourceModified": "D:20230505093127",
"44 69 67 69 74 61 6C 20 70 72 65 73 65 72 76 61 74 69 6F 6E 20 74 65 73 74 69 6E 67 20 70 72 6F 70 65 72 74 79 20 23 35": "44 69 67 69 74 61 6C 20 70 72 65 73 65 72 76 61 74 69 6F 6E 20 74 65 73 74 69 6E 67 20 76 61 6C 75 65",
"meta:keyword": "Digital Preservation; Digipres; Archives; OPF Test Corpus",
"access_permission:can_modify": "true",
"pdf:docinfo:created": "2023-05-05T09:34:28Z",
"xmpMM:InstanceID": "uuid:3fd68a29-56e9-4c06-8ffb-076d775fa4c8"
}
Extract text
We can extract the textual content of the file using a different endpoint. This endpoint is just /tika (I am not sure why the name isn’t more descriptive).
curl -X PUT -s http://localhost:9998/tika --header "Accept: text/plain" -T PDF-Sample-Document-Fully-Featured-Layout_Redacted.pdf
Results in:
Digital preservation testing document header Courier new 11 Document to test digital preservation tooling (footer) Footer font Atkinson Hyperlegible size 8 1 Document (Title) Centred (Arial 26) This is a sample document that contains some bits and pieces for digital preservation testing. Paragraph text below should be Arial font 11. ... ... ...
We can use content negotiation to retrieve this as JSON:
curl -X PUT -s http://localhost:9998/tika --header "Accept: application/json" -T PDF-Sample-Document-Fully-Featured-Layout_Redacted.pdf | jq
There is also a markdown endpoint listed in the documentation and we can retrieve a semantically accurate rendition of the file as markdown:
curl -X PUT -s http://localhost:9998/tika/md --header "Accept: text/plain" -T PDF-Sample-Document-Fully-Featured-Layout_Redacted.pdf
Results in:
... ... ... # []()Document (Title) Centred (Arial 26) This is a sample document that contains some bits and pieces for digital preservation testing. Paragraph text below should be Arial font 11. # []()Introduction (h1) (Arial 20) This is another introduction. This next sentence uses strikethrough. This next sentence is all superscripted. This next sentence is all subscripted. # []()Redacted (h1) (Arial 20) In the original version of this document → THIS IS NOT REDACTED AND CONTAINS THE WORD SUPERCALIFRAGILISTICEXPIALIDOCIOUS → but it will be using Adobe’s PDF redaction features. # []()Content (h1) (Arial 20) Some more content. ## []()Content (h2) (Arial 16) right justified Some right justified content. ... ... ...
Extract objects
My very favorite endpoint asks Tika to extract the file’s artifacts if there are any embedded objects (embedded objects are not an insignificant issue to think about in digital preservation). We know from the fully featured PDF’s documentation that we should be able to extract seven other files embedded in this PDF:
filename : 'PDF-Sample-Document-Fully-Featured-Layout.docx', 'fmt/412' filename : 'Floppy Disks.mp3', 'fmt/134' filename : 'Floppy Disks.wav', 'fmt/141' filename : 'circles.png', 'fmt/11' filename : 'lineset_anim.u3d', 'fmt/702' filename : 'salt_lake_utah.mov', 'x-fmt/384' filename : 'sound.png', 'fmt/13'
Lets see if Tika can work its magic!
curl -X PUT -s http://localhost:9998/unpack --header "Accept: application/zip" -T PDF-Sample-Document-Fully-Featured-Layout_Redacted.pdf > extracted_files.zip
List archive contents:
lsar extracted_files.zip
Outputs:
PDF-Sample-Document-Fully-Featured-Layout.docx salt_lake_utah.mov aebaaa94-97c8-4baf-a6e4-14862d549f5c-Floppy Disks.mp3 2023-05-05-PDF-Sample-Document.docx Floppy Disks.mp3
Tika is able to extract all but the wav and u3d files. It’s not quite perfect, but it’s not bad.
NB. I will try to follow up with the Tika team to ask why there might be an issue with the remaining objects. Tika’s issue tracking is available online.
Also, as I was reading the documentation about this endpoint, I saw that if you change the call from /unpack to /unpack/all you will also get all the text content AND metadata! Handy!
Tika 🤝 SafeText
Having introduced Tika, and SafeText earlier in this blog, what happens if we put them together?
You can copy the sample text from David Jacobson’s original repository. This contains some steganographic characters (you might not be able to see them!). Save the content to a file called sample.txt that is relative to the folder you have been working in today.
The message said: "Нey, let's hang out!" LoremIpsumDolorSit Reporter: "What colour was the individual?" James: "They were grey" Subject: Βudget Ϲuts
Then download a version of SafeText from the repository: https://github.com/ross-spencer/safetext/releases/tag/v0.0.1-rc.6.
Unpack it, and then check that it works:
./safetext -version
You should see something like:
safetext/0.0.1-rc.6 (commit: '455a19' (2025-03-02T19:48:06Z))
Now take sample.txt and pipe its contents to the SafeText binary:
cat sample.txt | ./safetext
You should get the following result:
{
"count": 156,
"total_steganographic": 6,
"percent_steganographic": 3.846154,
"positives": [
"CYRILLIC CAPITAL LETTER EN :: (H)",
"ZERO WIDTH SPACE :: (DEL)",
"GREEK CAPITAL LUNATE SIGMA SYMBOL :: (C)",
"GREEK CAPITAL LETTER BETA :: (B)"
],
"original": "The message said: \"Нey, let's hang out!\"\nLoremIpsumDolorSit\nReporter: \"What colour was the individual?\"\nJames: \"They were grey\"\nSubject: Βudget Ϲuts\n",
"appearances": "The message said: \"(Н)ey, let's hang out!\"\nLorem()Ipsum()Dolor()Sit\nReporter: \"What colour was the individual?\"\nJames: \"They were grey\"\nSubject: (Β)udget (Ϲ)uts\n"
}
We’ve identified the existence of 6 steganographic characters out of 156 total characters. We’ve identified what those characters are, and we can see where they are in the original text, highlighted with ( round brackets ).
And that’s all there is!
Seriously?
The broader point of this demonstration is that given a tool like Apache Tika you can dig deeper into the contents of a digital record without a lot of effort.
SafeText is one tool designed to receive a stream of bytes from a Linux pipe but there are many others.
What if you pipe Tika’s content output into a checksum tool?
curl -X PUT -s http://localhost:9998/tika --header "Accept: text/plain" -T PDF-Sample-Document-Fully-Featured-Layout_Redacted.pdf | sha256sum 9e3ed3a85fb15f036f93acce75206e657af755959a496c641c6bb242b6436117 -
Or its metadata output?
curl -X PUT -s http://localhost:9998/meta --header "Accept: text/plain" -T PDF-Sample-Document-Fully-Featured-Layout_Redacted.pdf | sha256sum e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855 -
Now you have two checksums for metadata, AND content… not just the individual file. We know a checksum allows us to assert something about the state of “preservation” of an object. How might we use a checksum for either the metadata, or the content of an object in a future preservation scenario?
You can save the content in its different forms to any filename using the pipe command:
curl -X PUT -s http://localhost:9998/tika --header "Accept: text/plain" -T PDF-Sample-Document-Fully-Featured-Layout_Redacted.pdf > newfilename.txt
Contents can be “diff”-ed against a previous version of the file if you have one, e.g. diff newfilename.txt oldfilename.txt.
You can count the words in the file using tools like wc --words newfilename.txt, or what about counting unique words (ref1, ref2, ref3)?
cat newfile.txt | tr ' ' '\n' | tr 'A-Z' 'a-z' | sort -r | uniq -c | awk '{print $1"--"$2}' | sort -n
... ... ... 12--16) 12--but 15--20) 15--digital 15--(h2) 15--preservation 16--some 17--a 20--(h1) 20--to 21--and 21--document 21--upon 22--[bookmark: 24--billions 27--this 31--(arial 33--of 37--is 54--the ... ... ...
A utility that I use in demystify is Cooper Hewitt’s UCD for listing the Unicode Names of characters in strings.
Try it with newfilename.txt and you will see ALL the Unicode names of ALL the characters in the file, not just the steganographic ones (if there are any).
./ucd $(cat newfilename.txt) | sort | uniq
COLON COMMA COMMERCIAL AT DIGIT EIGHT DIGIT FIVE DIGIT FOUR DIGIT NINE DIGIT ONE DIGIT SEVEN DIGIT SIX DIGIT THREE DIGIT TWO DIGIT ZERO FULL STOP HORIZONTAL ELLIPSIS HYPHEN-MINUS LATIN CAPITAL LETTER A LATIN CAPITAL LETTER B LATIN CAPITAL LETTER C LATIN CAPITAL LETTER D LATIN CAPITAL LETTER E LATIN CAPITAL LETTER F LATIN CAPITAL LETTER G LATIN CAPITAL LETTER H LATIN CAPITAL LETTER I LATIN CAPITAL LETTER L LATIN CAPITAL LETTER N LATIN CAPITAL LETTER O LATIN CAPITAL LETTER P LATIN CAPITAL LETTER R LATIN CAPITAL LETTER S LATIN CAPITAL LETTER T LATIN CAPITAL LETTER U LATIN CAPITAL LETTER V LATIN CAPITAL LETTER W LATIN CAPITAL LETTER X LATIN CAPITAL LETTER Y LATIN SMALL LETTER A LATIN SMALL LETTER B LATIN SMALL LETTER C LATIN SMALL LETTER D LATIN SMALL LETTER E LATIN SMALL LETTER F LATIN SMALL LETTER G LATIN SMALL LETTER H LATIN SMALL LETTER I LATIN SMALL LETTER J LATIN SMALL LETTER K LATIN SMALL LETTER L LATIN SMALL LETTER M LATIN SMALL LETTER N LATIN SMALL LETTER O LATIN SMALL LETTER P LATIN SMALL LETTER Q LATIN SMALL LETTER R LATIN SMALL LETTER S LATIN SMALL LETTER T LATIN SMALL LETTER U LATIN SMALL LETTER V LATIN SMALL LETTER W LATIN SMALL LETTER X LATIN SMALL LETTER Y LATIN SMALL LETTER Z LEFT PARENTHESIS LEFT SQUARE BRACKET LOW LINE RIGHT PARENTHESIS RIGHT SINGLE QUOTATION MARK RIGHT SQUARE BRACKET RIGHTWARDS ARROW SOLIDUS SPACE
In some contexts, knowing the characters that are in a file might guide you toward additional preservation techniques. The use of Emoji for example might require asking a donor to describe their authoring platform so that you can determine if Emoji were previously displayed using Android style Emoji, Microsoft Style Emoji, or Apple. You might want to capture paradata about the original system if you haven’t already; you might want to capture a more static representation of the original record if the option is available to you.
What tools do you use?
I obviously won’t use all of these techniques all of the time. We rarely have time to look at our digital records in such detail but I do believe in the importance of content analysis, even for preservation. It gets us out of the “file” mindset as our only way of looking at records, and more into the data, content, meaning side of preservation. In the future it might help us to think of new techniques of accessing records if we haven’t been able to achieve digital preservation perfection.
Whether you’re a digital preservationist, archivist, digital humanist; what tools do you like to use on the command line to analyze content?
About SafeText
I ported SafeText in Go using the general idea of Jacobson’s code as a guide. I wrote a library so that I could import its features into other Golang projects of mine (such as tools like tikalinkextract of other projects like Steffen Fritz’s FileTrove).
This new implementation used a technique I learned while working on Siegfried and GOCFL called composite interfaces. Firstly an interface provides a definition of how work can be performed. You can imagine these interfaces as generic “abilities”. We can write multiple abilities and register them with a single interface with the same methods – this is our “composite interface”. Now when we call a common trigger function in the composite interface all of the registered abilities will be called at the same time.
The beauty of this approach is that we can easily add or remove abilities, making it very easy to customize. We can support other written languages and their nuances. We can also incorporate new techniques for analyzing text such as those we’ve explored elsewhere on the command line today. Wrapping them in code allows us to pluralize techniques in a more reliable manner than is possible by chaining commands on the command line.
SafeText is on GitHub: https://github.com/ross-spencer/safetext
And its documentation is available on the Go package registry: https://pkg.go.dev/github.com/ross-spencer/safetext@v0.0.1-rc.6
SafeText JSON
Beyond the original SafeText I felt it important to create a common resource for capturing Steganographic characters. I created SafeText JSON and made it available on GitHub.
- SafeText JSON https://github.com/ross-spencer/safetext-json
A resource such as this means that the data feeding the tool isn’t obfuscated by code. Instead it is an independent resource that can be loaded into libraries such as mine or Jacobson’s at runtime. We can version this resource differently and be more precise with the checksums of the data we are loading to make sure that the lists only change through authorized methods and releases.
NB. I am sure there are a greater number steganographic characters out there these days. Without an easy way to encode their potential use in something like the Unicode standard then information resources like this are important.
In summary
My conceit is that applying digital humanities techniques to digital content can one day help shape what preservation looks like. Not replacing anything that we currently do, i.e. never at the expense of the digital record, but by way of providing more metrics and clues and routes into what we actually have to preserve one day.
We kind of measure things by the “file format” — the encapsulation of the content of the record itself (at least for records that are’t multi-part). But the record is the content, the thing we read, or interacted with, that we rendered, displayed, performed. We should always be thinking about the need to access that one day.
Questions about content, and quality of content are even more important in today’s world where we wrestle with issues of authenticity coming from AI, but potentially corrupt government regimes or corporations.
You can protect yourself by being proactive about using tools like SafeText, editors can protect those providing vital information about corruption and scandals by being proactive too.
Using SafeText we can identify if someone is being monitored or documents contain other information that is invisible to the human eye such as non-visible characters encoded as a message to another reader not looking at the original document.
With other tools we might detect and understand other non-textual elements we might not detect when looking at something though our favorite display tools.
Using Apache Tika we have seen that it is easy to access different representations of the digital files that it supports. I hope Apache Tika has a long life and that we continue to gain new similar tools to perform similar tasks in future.
It is something I will continue to look at, and I hope it has provided food for thought for those reading. Let me know in the comments and replies below.
Importantly, let’s be careful out there.
Binary Trees
Tika was an important part of my work on Binary trees? Automatically identifying the links between born-digital records. If you have made it this far, you might also be interested in some of the topics discussed there.
GOCFL
The University of Basel’s GOCFL uses OCFL to provide an encapsulated preservation solution, incorporating different tools for format identification and creating access derivatives. Each file added to a GOCFL archive is made ready for search indexing using Tika as its text extraction mechanism. The text of every record can then be accessed by indexing tools in future enabling search and other methods of access and analysis.
Extracting binary objects with Tika
Grab Tim Allison’s presentation on the risks and challenges of embedded objects in digital files from 2020
- Embedded Files: Risks, Challenges and Options by Tim Allison on SlideShare.
Searching digital objects
Bertrand Caron has a nice blog looking at search in digital objects using Apache Tika and other tools.
Explainshell
Find out more about the shell commands in the post above above by copy and pasting them into Explainshell.com.
Permalinks in GitHub
Some of the links I use in this text that come from GitHub are permalinks. These link back to today’s version of the code or text. This can be really useful for making sure your links continue to work when files are moved or removed from GitHub. Just hit y when you are on the page of a file on GitHub to create a permalink. Your future self (and those preserving your blog!) Will thank you!
Find out more on GitHub.
Authenticity
I was thinking about authenticity when I first wrote about steganographic techniques for sign-posting AI outputs, and I found this small post on the desire for authenticity in these times. I liked it so I thought I’d amplify it here.
- ahhhthenticity by @rileylemm on Medium.
![]()

