Porting SafeText and analyzing digital content with Apache Tika

Last year I wrote about pitfalls in modern journalism, especially with regards to receiving documents and information from whistleblowers without offering them adequate protection.

The tl;dr is that you, as a whistleblower, need to protect yourself; and you, as an editor or journalist, need to protect your whistleblowers.

Steganographic fingerprints might be one method adopted to detect someone leaking information. Steganographic characters replace common textual characters with unusual but hard to detect variants, e.g. they look the same to the human eye, or are actually invisible. Using a tool called SafeText by David Jacobson we can identify these hidden fingerprints in the content that you share.

I firmly believe we can find clues about what is important to preserve, or learn to preserve, when we analyse the content of the digital record and not just the (file) format of the digital record.

A file can contain many different features and these are all challenges to their future interpretation, and thus preservation.

I wanted to use SafeText in some of my other non-Python tooling and so I decided to port the code to Golang as a composable module and binary.

By coincidence at the time I started writing this I had also just written about revisiting tikalinkextract and so I thought I would write this small explanation about how you might combine Tika and SafeText to perform some content analysis of your own.

Who knows, maybe we will find a conspiracy. Maybe we’ll find secret codes in our own digital records. Maybe we’ll learn something new about our records…

Lets have a look at putting Tika and SafeText together and see where it goes.

Tika as a Swiss Army Knife

Apache Tika has been a constant in my tooling since I learned about it.

Tika is an amazing tool that has the ability to read the metadata and content of many file formats, from PDF to Word documents.

All you need is Tika’s server implementation. With that, you can forward data to it and Tika will attempt to parse what it was given using one of its many modules.

Tika can return the textual content of the file, its metadata, or even its embedded objects.

You can download Tika server from the Apache website. using wget, e.g.

wget https://dlcdn.apache.org/tika/3.3.0/tika-server-standard-3.3.0.jar

You can download and verify the binary’s sha512 checksum as follows:

wget https://downloads.apache.org/tika/3.3.0/tika-server-standard-3.3.0.jar.sha512

cat tika-server-standard-3.3.0.jar.sha512 && sha512sum tika-server-standard-3.3.0.jar

The documentation is pretty good so take a look at that for all its features.

Running Apache Tika

You can run tika as follows:

java -jar tika-server-standard-3.3.0.jar

This will run the service on port 9998.

You can also run nohup (info) to put Tika in the background. This keeps Tika available on port 9998 until you restart your machine or kill the process. I tend to use fuser to do this, e.g. fuser -k 9998/tcp.

You can visit the service in your web-browser by visiting http://localhost:9998/ for a summary of all of its endpoints (although the documentation link above is much clearer on their function).

A command line way to see if it is running is to request Tika’s version information:

curl -X GET -s http://localhost:9998/version

I get the result Apache Tika 3.3.0.

Exploring a small selection of its tools

Before we get into the SafeText examples below, I wanted to demonstrate Tika’s more useful endpoints.

Let’s start by downloading my “fully featured PDF” from the OPF Format Corpus.

wget https://raw.githubusercontent.com/openpreserve/format-corpus/refs/heads/master/fully-featured-pdf/PDF-Sample-Document-Fully-Featured-Layout_Redacted.pdf

Extract metadata

Given the fully featured PDF we can inspect its key-value metadata through Tika’s /meta endpoint:

curl -X PUT -s http://localhost:9998/meta -T PDF-Sample-Document-Fully-Featured-Layout_Redacted.pdf

It should look as follows:

pdf:PDFVersion,1.7
pdf:has3D,true
xmp:CreatorTool,Acrobat PDFMaker 23 for Word
pdf:docinfo:title,Digital Preservation Testing Document
pdf:hasXFA,false
X-TIKA:Parsed-By-Full-Set,org.apache.tika.parser.DefaultParser,org.apache.tika.parser.pdf.PDFParser
pdf:num3DAnnotations,1
language,en
dc:format,application/pdf; version=1.7
xmp:pdf:Keywords,Digital Preservation; Digipres; Archives; OPF Test Corpus
pdf:docinfo:creator_tool,Acrobat PDFMaker 23 for Word
access_permission:fill_in_form,true
pdf:hasCollection,false
pdf:encrypted,false
dc:title,Digital Preservation Testing Document
pdf:containsNonEmbeddedFont,false
xmp:CreateDate,2023-05-05T09:34:28Z
pdf:docinfo:custom:SourceModified,D:20230505093127
pdf:docinfo:subject,Digital Preservation
pdf:hasMarkedContent,true
pdf:ocrPageCount,0
access_permission:can_print_faithful,true
xmp:ModifyDate,2023-05-05T10:56:56Z
pdf:docinfo:creator,@beet_keeper - github.com/ross-spencer
access_permission:extract_for_accessibility,true
pdf:hasAcroFormFields,true
xmp:dc:creator,@beet_keeper - github.com/ross-spencer
X-TIKA:Parsed-By,org.apache.tika.parser.DefaultParser,org.apache.tika.parser.pdf.PDFParser
dc:description:x-default,Digital Preservation
pdf:annotationTypes,14f2c2d8-1078-4f71-9736-f229c34993ba,20f602aa-5ddb-42ba-a935-5c4d49a8bc52,3D6,4a18ad93-6732-4989-a0aa-47061658812a,8deb2acc-b801-4004-b95c-bf257d23789e,RM4,RM7,af5915ed-74a6-4348-b7a8-fff23c413783,fdbca803-a2bc-4b35-aca8-2117d6dfb047,null
xmp:dc:description:x-default,Digital Preservation
pdf:docinfo:producer,Adobe PDF Library 23.1.175
pdf:annotationSubtypes,3D,Circle,FileAttachment,Line,Link,Polygon,Popup,RichMedia,Sound,Square,Widget
xmp:dc:description,Digital Preservation
pdf:containsDamagedFont,false
pdf:unmappedUnicodeCharsPerPage,0,0,0
Digital preservation testing property #1,Digital preservation testing value #1
dc:description,Digital Preservation
Digital preservation testing property #2,Digital preservation testing value #2
Digital preservation testing property #3,Digital preservation testing value #3
access_permission:modify_annotations,true
dc:title:x-default,Digital Preservation Testing Document
dc:creator,@beet_keeper - github.com/ross-spencer
dcterms:created,2023-05-05T09:34:28Z
dcterms:modified,2023-05-05T10:56:56Z
xmp:dc:title:x-default,Digital Preservation Testing Document
xmpMM:DocumentID,uuid:a9a30d32-5d69-4790-b5f6-1705102a20a4
xmp:dc:title,Digital Preservation Testing Document
pdf:docinfo:custom:Digital preservation testing property #4,Digital preservation testing value #4
pdf:overallPercentageUnmappedUnicodeChars,0.0
pdf:docinfo:custom:Digital preservation testing property #3,Digital preservation testing value #3
pdf:docinfo:keywords,Digital Preservation; Digipres; Archives; OPF Test Corpus
pdf:docinfo:modified,2023-05-05T10:56:56Z
pdf:docinfo:custom:Digital preservation testing property #2,Digital preservation testing value #2
pdf:docinfo:custom:Digital preservation testing property #1,Digital preservation testing value #1
Content-Length,22724407
Digital preservation testing property #4,Digital preservation testing value #4
Content-Type,application/pdf
xmp:MetadataDate,2023-05-05T10:56:56Z
pdf:producer,Adobe PDF Library 23.1.175
dc:subject,Digital Preservation; Digipres; Archives; OPF Test Corpus,Digital Preservation,Digipres,Archives,OPF Test Corpus
xmp:pdf:Producer,Adobe PDF Library 23.1.175
pdf:totalUnmappedUnicodeChars,0
access_permission:assemble_document,true
xmpTPg:NPages,3
pdf:hasXMP,true
pdf:charsPerPage,1062,1157,498
access_permission:extract_content,true
pdf:docinfo:custom:44 69 67 69 74 61 6C 20 70 72 65 73 65 72 76 61 74 69 6F 6E 20 74 65 73 74 69 6E 67 20 70 72 6F 70 65 72 74 79 20 23 35,44 69 67 69 74 61 6C 20 70 72 65 73 65 72 76 61 74 69 6F 6E 20 74 65 73 74 69 6E 67 20 76 61 6C 75 65
xmp:dc:subject,Digital Preservation,Digipres,Archives,OPF Test Corpus
access_permission:can_print,true
SourceModified,D:20230505093127
44 69 67 69 74 61 6C 20 70 72 65 73 65 72 76 61 74 69 6F 6E 20 74 65 73 74 69 6E 67 20 70 72 6F 70 65 72 74 79 20 23 35,44 69 67 69 74 61 6C 20 70 72 65 73 65 72 76 61 74 69 6F 6E 20 74 65 73 74 69 6E 67 20 76 61 6C 75 65
meta:keyword,Digital Preservation; Digipres; Archives; OPF Test Corpus
access_permission:can_modify,true
pdf:docinfo:created,2023-05-05T09:34:28Z
xmpMM:InstanceID,uuid:3fd68a29-56e9-4c06-8ffb-076d775fa4c8

Tika does a lot through content negotiation. Content negotiation is where you ask the web-server for a different representation if it is available. Maybe Tika can display the metadata as JSON? We know the MIMEType for JSON is application/json let’s ask for that using a different Accept header:

curl -X PUT -s http://localhost:9998/meta --header "Accept: application/json" -T PDF-Sample-Document-Fully-Featured-Layout_Redacted.pdf | jq

NB. JQ is used to pretty-print the result. You can remove | jq if you don’t have it installed.

Bingo!

{
  "pdf:PDFVersion": "1.7",
  "pdf:has3D": "true",
  "xmp:CreatorTool": "Acrobat PDFMaker 23 for Word",
  "pdf:docinfo:title": "Digital Preservation Testing Document",
  "pdf:hasXFA": "false",
  "X-TIKA:Parsed-By-Full-Set": [
    "org.apache.tika.parser.DefaultParser",
    "org.apache.tika.parser.pdf.PDFParser"
  ],
  "pdf:num3DAnnotations": "1",
  "language": "en",
  "dc:format": "application/pdf; version=1.7",
  "xmp:pdf:Keywords": "Digital Preservation; Digipres; Archives; OPF Test Corpus",
  "pdf:docinfo:creator_tool": "Acrobat PDFMaker 23 for Word",
  "access_permission:fill_in_form": "true",
  "pdf:hasCollection": "false",
  "pdf:encrypted": "false",
  "dc:title": "Digital Preservation Testing Document",
  "pdf:containsNonEmbeddedFont": "false",
  "xmp:CreateDate": "2023-05-05T09:34:28Z",
  "pdf:docinfo:custom:SourceModified": "D:20230505093127",
  "pdf:docinfo:subject": "Digital Preservation",
  "pdf:hasMarkedContent": "true",
  "pdf:ocrPageCount": "0",
  "access_permission:can_print_faithful": "true",
  "xmp:ModifyDate": "2023-05-05T10:56:56Z",
  "pdf:docinfo:creator": "@beet_keeper - github.com/ross-spencer",
  "access_permission:extract_for_accessibility": "true",
  "pdf:hasAcroFormFields": "true",
  "xmp:dc:creator": "@beet_keeper - github.com/ross-spencer",
  "X-TIKA:Parsed-By": [
    "org.apache.tika.parser.DefaultParser",
    "org.apache.tika.parser.pdf.PDFParser"
  ],
  "dc:description:x-default": "Digital Preservation",
  "pdf:annotationTypes": [
    "14f2c2d8-1078-4f71-9736-f229c34993ba",
    "20f602aa-5ddb-42ba-a935-5c4d49a8bc52",
    "3D6",
    "4a18ad93-6732-4989-a0aa-47061658812a",
    "8deb2acc-b801-4004-b95c-bf257d23789e",
    "RM4",
    "RM7",
    "af5915ed-74a6-4348-b7a8-fff23c413783",
    "fdbca803-a2bc-4b35-aca8-2117d6dfb047",
    "null"
  ],
  "xmp:dc:description:x-default": "Digital Preservation",
  "pdf:docinfo:producer": "Adobe PDF Library 23.1.175",
  "pdf:annotationSubtypes": [
    "3D",
    "Circle",
    "FileAttachment",
    "Line",
    "Link",
    "Polygon",
    "Popup",
    "RichMedia",
    "Sound",
    "Square",
    "Widget"
  ],
  "xmp:dc:description": "Digital Preservation",
  "pdf:containsDamagedFont": "false",
  "pdf:unmappedUnicodeCharsPerPage": [
    "0",
    "0",
    "0"
  ],
  "Digital preservation testing property #1": "Digital preservation testing value #1",
  "dc:description": "Digital Preservation",
  "Digital preservation testing property #2": "Digital preservation testing value #2",
  "Digital preservation testing property #3": "Digital preservation testing value #3",
  "access_permission:modify_annotations": "true",
  "dc:title:x-default": "Digital Preservation Testing Document",
  "dc:creator": "@beet_keeper - github.com/ross-spencer",
  "dcterms:created": "2023-05-05T09:34:28Z",
  "dcterms:modified": "2023-05-05T10:56:56Z",
  "xmp:dc:title:x-default": "Digital Preservation Testing Document",
  "xmpMM:DocumentID": "uuid:a9a30d32-5d69-4790-b5f6-1705102a20a4",
  "xmp:dc:title": "Digital Preservation Testing Document",
  "pdf:docinfo:custom:Digital preservation testing property #4": "Digital preservation testing value #4",
  "pdf:overallPercentageUnmappedUnicodeChars": "0.0",
  "pdf:docinfo:custom:Digital preservation testing property #3": "Digital preservation testing value #3",
  "pdf:docinfo:keywords": "Digital Preservation; Digipres; Archives; OPF Test Corpus",
  "pdf:docinfo:modified": "2023-05-05T10:56:56Z",
  "pdf:docinfo:custom:Digital preservation testing property #2": "Digital preservation testing value #2",
  "pdf:docinfo:custom:Digital preservation testing property #1": "Digital preservation testing value #1",
  "Content-Length": "22724407",
  "Digital preservation testing property #4": "Digital preservation testing value #4",
  "Content-Type": "application/pdf",
  "xmp:MetadataDate": "2023-05-05T10:56:56Z",
  "pdf:producer": "Adobe PDF Library 23.1.175",
  "dc:subject": [
    "Digital Preservation; Digipres; Archives; OPF Test Corpus",
    "Digital Preservation",
    "Digipres",
    "Archives",
    "OPF Test Corpus"
  ],
  "xmp:pdf:Producer": "Adobe PDF Library 23.1.175",
  "pdf:totalUnmappedUnicodeChars": "0",
  "access_permission:assemble_document": "true",
  "xmpTPg:NPages": "3",
  "pdf:hasXMP": "true",
  "pdf:charsPerPage": [
    "1062",
    "1157",
    "498"
  ],
  "access_permission:extract_content": "true",
  "pdf:docinfo:custom:44 69 67 69 74 61 6C 20 70 72 65 73 65 72 76 61 74 69 6F 6E 20 74 65 73 74 69 6E 67 20 70 72 6F 70 65 72 74 79 20 23 35": "44 69 67 69 74 61 6C 20 70 72 65 73 65 72 76 61 74 69 6F 6E 20 74 65 73 74 69 6E 67 20 76 61 6C 75 65",
  "xmp:dc:subject": [
    "Digital Preservation",
    "Digipres",
    "Archives",
    "OPF Test Corpus"
  ],
  "access_permission:can_print": "true",
  "SourceModified": "D:20230505093127",
  "44 69 67 69 74 61 6C 20 70 72 65 73 65 72 76 61 74 69 6F 6E 20 74 65 73 74 69 6E 67 20 70 72 6F 70 65 72 74 79 20 23 35": "44 69 67 69 74 61 6C 20 70 72 65 73 65 72 76 61 74 69 6F 6E 20 74 65 73 74 69 6E 67 20 76 61 6C 75 65",
  "meta:keyword": "Digital Preservation; Digipres; Archives; OPF Test Corpus",
  "access_permission:can_modify": "true",
  "pdf:docinfo:created": "2023-05-05T09:34:28Z",
  "xmpMM:InstanceID": "uuid:3fd68a29-56e9-4c06-8ffb-076d775fa4c8"
}

Extract text

We can extract the textual content of the file using a different endpoint. This endpoint is just /tika (I am not sure why the name isn’t more descriptive).

curl -X PUT -s http://localhost:9998/tika --header "Accept: text/plain" -T PDF-Sample-Document-Fully-Featured-Layout_Redacted.pdf

Results in:

Digital preservation testing document header 
Courier new 11 

Document to test digital preservation tooling (footer) 

Footer font Atkinson Hyperlegible size 8 

1 

Document (Title) Centred (Arial 26) 
This is a sample document that contains some bits and pieces for digital preservation 
testing. Paragraph text below should be Arial font 11.  

...
...
...

We can use content negotiation to retrieve this as JSON:

curl -X PUT -s http://localhost:9998/tika --header "Accept: application/json" -T PDF-Sample-Document-Fully-Featured-Layout_Redacted.pdf | jq

There is also a markdown endpoint listed in the documentation and we can retrieve a semantically accurate rendition of the file as markdown:

curl -X PUT -s http://localhost:9998/tika/md --header "Accept: text/plain" -T PDF-Sample-Document-Fully-Featured-Layout_Redacted.pdf

Results in:

...
...
...

# []()Document (Title) Centred (Arial 26)  

This is a sample document that contains some bits and pieces for digital preservation testing. Paragraph text below should be Arial font 11.   

# []()Introduction (h1) (Arial 20)  

This is another introduction. This next sentence uses strikethrough. This next sentence is all superscripted. This next sentence is all subscripted.  

# []()Redacted (h1) (Arial 20)  

In the original version of this document → THIS IS NOT REDACTED AND CONTAINS THE WORD SUPERCALIFRAGILISTICEXPIALIDOCIOUS → but it will be using Adobe’s PDF redaction features.   

# []()Content (h1) (Arial 20)  

Some more content.  

## []()Content (h2) (Arial 16) right justified  

Some right justified content.  

...
...
...

Extract objects

My very favorite endpoint asks Tika to extract the file’s artifacts if there are any embedded objects (embedded objects are not an insignificant issue to think about in digital preservation). We know from the fully featured PDF’s documentation that we should be able to extract seven other files embedded in this PDF:

filename : 'PDF-Sample-Document-Fully-Featured-Layout.docx', 'fmt/412'
filename : 'Floppy Disks.mp3', 'fmt/134'
filename : 'Floppy Disks.wav', 'fmt/141'
filename : 'circles.png', 'fmt/11'
filename : 'lineset_anim.u3d', 'fmt/702'
filename : 'salt_lake_utah.mov', 'x-fmt/384'
filename : 'sound.png', 'fmt/13'

Lets see if Tika can work its magic!

curl -X PUT -s http://localhost:9998/unpack --header "Accept: application/zip" -T PDF-Sample-Document-Fully-Featured-Layout_Redacted.pdf > extracted_files.zip

List archive contents:

lsar extracted_files.zip

Outputs:

PDF-Sample-Document-Fully-Featured-Layout.docx
salt_lake_utah.mov
aebaaa94-97c8-4baf-a6e4-14862d549f5c-Floppy Disks.mp3
2023-05-05-PDF-Sample-Document.docx
Floppy Disks.mp3

Tika is able to extract all but the wav and u3d files. It’s not quite perfect, but it’s not bad.

NB. I will try to follow up with the Tika team to ask why there might be an issue with the remaining objects. Tika’s issue tracking is available online.

Also, as I was reading the documentation about this endpoint, I saw that if you change the call from /unpack to /unpack/all you will also get all the text content AND metadata! Handy!

Tika 🤝 SafeText

Having introduced Tika, and SafeText earlier in this blog, what happens if we put them together?

You can copy the sample text from David Jacobson’s original repository. This contains some steganographic characters (you might not be able to see them!). Save the content to a file called sample.txt that is relative to the folder you have been working in today.

The message said: "Нey, let's hang out!"
LoremIpsumDolorSit
Reporter: "What colour was the individual?"
James: "They were grey"
Subject: Βudget Ϲuts

Then download a version of SafeText from the repository: https://github.com/ross-spencer/safetext/releases/tag/v0.0.1-rc.6.

Unpack it, and then check that it works:

./safetext -version

You should see something like:

safetext/0.0.1-rc.6 (commit: '455a19' (2025-03-02T19:48:06Z))

Now take sample.txt and pipe its contents to the SafeText binary:

cat sample.txt | ./safetext

You should get the following result:

{
  "count": 156,
  "total_steganographic": 6,
  "percent_steganographic": 3.846154,
  "positives": [
    "CYRILLIC CAPITAL LETTER EN :: (H)",
    "ZERO WIDTH SPACE :: (DEL)",
    "GREEK CAPITAL LUNATE SIGMA SYMBOL :: (C)",
    "GREEK CAPITAL LETTER BETA :: (B)"
  ],
  "original": "The message said: \"Нey, let's hang out!\"\nLoremIpsumDolorSit\nReporter: \"What colour was the individual?\"\nJames: \"They were grey\"\nSubject: Βudget Ϲuts\n",
  "appearances": "The message said: \"(Н)ey, let's hang out!\"\nLorem()Ipsum()Dolor()Sit\nReporter: \"What colour was the individual?\"\nJames: \"They were grey\"\nSubject: (Β)udget (Ϲ)uts\n"
}

We’ve identified the existence of 6 steganographic characters out of 156 total characters. We’ve identified what those characters are, and we can see where they are in the original text, highlighted with ( round brackets ).

And that’s all there is!

Seriously?

The broader point of this demonstration is that given a tool like Apache Tika you can dig deeper into the contents of a digital record without a lot of effort.

SafeText is one tool designed to receive a stream of bytes from a Linux pipe but there are many others.

What if you pipe Tika’s content output into a checksum tool?

curl -X PUT -s http://localhost:9998/tika --header "Accept: text/plain" -T PDF-Sample-Document-Fully-Featured-Layout_Redacted.pdf | sha256sum
9e3ed3a85fb15f036f93acce75206e657af755959a496c641c6bb242b6436117 -

Or its metadata output?

curl -X PUT -s http://localhost:9998/meta --header "Accept: text/plain" -T PDF-Sample-Document-Fully-Featured-Layout_Redacted.pdf | sha256sum
e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855 -

Now you have two checksums for metadata, AND content… not just the individual file. We know a checksum allows us to assert something about the state of “preservation” of an object. How might we use a checksum for either the metadata, or the content of an object in a future preservation scenario?

You can save the content in its different forms to any filename using the pipe command:

curl -X PUT -s http://localhost:9998/tika --header "Accept: text/plain" -T PDF-Sample-Document-Fully-Featured-Layout_Redacted.pdf > newfilename.txt

Contents can be “diff”-ed against a previous version of the file if you have one, e.g. diff newfilename.txt oldfilename.txt.

You can count the words in the file using tools like wc --words newfilename.txt, or what about counting unique words (ref1, ref2, ref3)?

cat newfile.txt | tr ' ' '\n' | tr 'A-Z' 'a-z' | sort -r | uniq -c | awk '{print $1"--"$2}' | sort -n

...
...
...
12--16)
12--but
15--20)
15--digital
15--(h2)
15--preservation
16--some
17--a
20--(h1)
20--to
21--and
21--document
21--upon
22--[bookmark:
24--billions
27--this
31--(arial
33--of
37--is
54--the
...
...
...

A utility that I use in demystify is Cooper Hewitt’s UCD for listing the Unicode Names of characters in strings.

Try it with newfilename.txt and you will see ALL the Unicode names of ALL the characters in the file, not just the steganographic ones (if there are any).

./ucd $(cat newfilename.txt) | sort | uniq

COLON
COMMA
COMMERCIAL AT
DIGIT EIGHT
DIGIT FIVE
DIGIT FOUR
DIGIT NINE
DIGIT ONE
DIGIT SEVEN
DIGIT SIX
DIGIT THREE
DIGIT TWO
DIGIT ZERO
FULL STOP
HORIZONTAL ELLIPSIS
HYPHEN-MINUS
LATIN CAPITAL LETTER A
LATIN CAPITAL LETTER B
LATIN CAPITAL LETTER C
LATIN CAPITAL LETTER D
LATIN CAPITAL LETTER E
LATIN CAPITAL LETTER F
LATIN CAPITAL LETTER G
LATIN CAPITAL LETTER H
LATIN CAPITAL LETTER I
LATIN CAPITAL LETTER L
LATIN CAPITAL LETTER N
LATIN CAPITAL LETTER O
LATIN CAPITAL LETTER P
LATIN CAPITAL LETTER R
LATIN CAPITAL LETTER S
LATIN CAPITAL LETTER T
LATIN CAPITAL LETTER U
LATIN CAPITAL LETTER V
LATIN CAPITAL LETTER W
LATIN CAPITAL LETTER X
LATIN CAPITAL LETTER Y
LATIN SMALL LETTER A
LATIN SMALL LETTER B
LATIN SMALL LETTER C
LATIN SMALL LETTER D
LATIN SMALL LETTER E
LATIN SMALL LETTER F
LATIN SMALL LETTER G
LATIN SMALL LETTER H
LATIN SMALL LETTER I
LATIN SMALL LETTER J
LATIN SMALL LETTER K
LATIN SMALL LETTER L
LATIN SMALL LETTER M
LATIN SMALL LETTER N
LATIN SMALL LETTER O
LATIN SMALL LETTER P
LATIN SMALL LETTER Q
LATIN SMALL LETTER R
LATIN SMALL LETTER S
LATIN SMALL LETTER T
LATIN SMALL LETTER U
LATIN SMALL LETTER V
LATIN SMALL LETTER W
LATIN SMALL LETTER X
LATIN SMALL LETTER Y
LATIN SMALL LETTER Z
LEFT PARENTHESIS
LEFT SQUARE BRACKET
LOW LINE
RIGHT PARENTHESIS
RIGHT SINGLE QUOTATION MARK
RIGHT SQUARE BRACKET
RIGHTWARDS ARROW
SOLIDUS
SPACE

In some contexts, knowing the characters that are in a file might guide you toward additional preservation techniques. The use of Emoji for example might require asking a donor to describe their authoring platform so that you can determine if Emoji were previously displayed using Android style Emoji, Microsoft Style Emoji, or Apple. You might want to capture paradata about the original system if you haven’t already; you might want to capture a more static representation of the original record if the option is available to you.

What tools do you use?

I obviously won’t use all of these techniques all of the time. We rarely have time to look at our digital records in such detail but I do believe in the importance of content analysis, even for preservation. It gets us out of the “file” mindset as our only way of looking at records, and more into the data, content, meaning side of preservation. In the future it might help us to think of new techniques of accessing records if we haven’t been able to achieve digital preservation perfection.

Whether you’re a digital preservationist, archivist, digital humanist; what tools do you like to use on the command line to analyze content?

About SafeText

I ported SafeText in Go using the general idea of Jacobson’s code as a guide. I wrote a library so that I could import its features into other Golang projects of mine (such as tools like tikalinkextract of other projects like Steffen Fritz’s FileTrove).

This new implementation used a technique I learned while working on Siegfried and GOCFL called composite interfaces. Firstly an interface provides a definition of how work can be performed. You can imagine these interfaces as generic “abilities”. We can write multiple abilities and register them with a single interface with the same methods – this is our “composite interface”. Now when we call a common trigger function in the composite interface all of the registered abilities will be called at the same time.

The beauty of this approach is that we can easily add or remove abilities, making it very easy to customize. We can support other written languages and their nuances. We can also incorporate new techniques for analyzing text such as those we’ve explored elsewhere on the command line today. Wrapping them in code allows us to pluralize techniques in a more reliable manner than is possible by chaining commands on the command line.

SafeText is on GitHub: https://github.com/ross-spencer/safetext

And its documentation is available on the Go package registry: https://pkg.go.dev/github.com/ross-spencer/safetext@v0.0.1-rc.6

SafeText JSON

Beyond the original SafeText I felt it important to create a common resource for capturing Steganographic characters. I created SafeText JSON and made it available on GitHub.

SafeText JSON https://github.com/ross-spencer/safetext-json

A resource such as this means that the data feeding the tool isn’t obfuscated by code. Instead it is an independent resource that can be loaded into libraries such as mine or Jacobson’s at runtime. We can version this resource differently and be more precise with the checksums of the data we are loading to make sure that the lists only change through authorized methods and releases.

NB. I am sure there are a greater number steganographic characters out there these days. Without an easy way to encode their potential use in something like the Unicode standard then information resources like this are important.

In summary

My conceit is that applying digital humanities techniques to digital content can one day help shape what preservation looks like. Not replacing anything that we currently do, i.e. never at the expense of the digital record, but by way of providing more metrics and clues and routes into what we actually have to preserve one day.

We kind of measure things by the “file format” — the encapsulation of the content of the record itself (at least for records that are’t multi-part). But the record is the content, the thing we read, or interacted with, that we rendered, displayed, performed. We should always be thinking about the need to access that one day.

Questions about content, and quality of content are even more important in today’s world where we wrestle with issues of authenticity coming from AI, but potentially corrupt government regimes or corporations.

You can protect yourself by being proactive about using tools like SafeText, editors can protect those providing vital information about corruption and scandals by being proactive too.

Using SafeText we can identify if someone is being monitored or documents contain other information that is invisible to the human eye such as non-visible characters encoded as a message to another reader not looking at the original document.

With other tools we might detect and understand other non-textual elements we might not detect when looking at something though our favorite display tools.

Using Apache Tika we have seen that it is easy to access different representations of the digital files that it supports. I hope Apache Tika has a long life and that we continue to gain new similar tools to perform similar tasks in future.

It is something I will continue to look at, and I hope it has provided food for thought for those reading. Let me know in the comments and replies below.

Importantly, let’s be careful out there.

Binary Trees

Tika was an important part of my work on Binary trees? Automatically identifying the links between born-digital records. If you have made it this far, you might also be interested in some of the topics discussed there.

GOCFL

The University of Basel’s GOCFL uses OCFL to provide an encapsulated preservation solution, incorporating different tools for format identification and creating access derivatives. Each file added to a GOCFL archive is made ready for search indexing using Tika as its text extraction mechanism. The text of every record can then be accessed by indexing tools in future enabling search and other methods of access and analysis.

Extracting binary objects with Tika

Grab Tim Allison’s presentation on the risks and challenges of embedded objects in digital files from 2020

Embedded Files: Risks, Challenges and Options by Tim Allison on SlideShare.

Searching digital objects

Bertrand Caron has a nice blog looking at search in digital objects using Apache Tika and other tools.

Explainshell

Find out more about the shell commands in the post above above by copy and pasting them into Explainshell.com.

Permalinks in GitHub

Some of the links I use in this text that come from GitHub are permalinks. These link back to today’s version of the code or text. This can be really useful for making sure your links continue to work when files are moved or removed from GitHub. Just hit y when you are on the page of a file on GitHub to create a permalink. Your future self (and those preserving your blog!) Will thank you!

Find out more on GitHub.

Authenticity

I was thinking about authenticity when I first wrote about steganographic techniques for sign-posting AI outputs, and I found this small post on the desire for authenticity in these times. I liked it so I thought I’d amplify it here.

ahhhthenticity by @rileylemm on Medium.

2 thoughts on “Porting SafeText and analyzing digital content with Apache Tika”

Bertrand Caron says:

2026-05-20 at 12:38

@exponentialdecay I also enjoyed very much this post Ross. The suggestion of going beyond the visual rendering and / or file format to uncover deeper information layers is something I Insist on almost every day.

I also discovered SafeText. And finally the shoutout to Apache Tika as one of the best tools for #digipres is very well-deserved!
Tanguy ⧓ Herrmann says:

2026-05-21 at 21:57

@exponentialdecay not all heroes wear cape!

Thanks