New Emoji

How Exactly Does the iPhone Transmit Diverse Emoji Characters?

It started a short time ago when Cooper Hewitt released their Unicode tools for Python: https://github.com/cooperhewitt/py-cooperhewitt-unicode

Also available in Golang here: https://github.com/cooperhewitt/go-ucd

I had an immediate need to use this tool in translating out of bounds, non-ASCII characters for archivists when analysing digital collections.

How do you describe the character 0xC481 to users in a more user friendly way? – How do you highlight that the character with a macron in the word Māori may hide the slightest digital preservation risk in our not-very-Unicode-ready world right now?

You don’t, you analyse it, you respect it, note it outside of ASCII bounds and then ascribe to it the plain-English name to it:

‘ā LATIN SMALL LETTER A WITH MACRON ‘

This is what the Unicode Consortium calls it, that is what we should be calling it. Aware that we need to preserve this information, and then move on.

The result of my work to incorporate Cooper Hewitt’s tool into my own is part of the DROID CSV Analysis Engine: http://openpreservation.org/blog/2014/06/03/analysis-engine-droid-csv-export/

On the way to the combining of those tools, and with the help of the Golang version of the tool, I took a little detour documenting the Latin names for each of the Emoji found in the iPhone’s IOS8 at the time: https://gist.github.com/ross-spencer/b59856dbcaa4621654f1

The combination of the glyph alongside its Latin descriptive counterpart; a curious list!

The benefit of the Gist above is that you can look at it on any browser on any platform and see how your own operating system displays the data it is asked to display. I had hoped to blog about the differences cross-platform (Android’s literal display of HEAVY BLACK HEART) but as most reverse engineering stories go, I found my way to the source proper, Unicode.org, already documenting Emoji and differences wonderfully, and this information can be found here: http://unicode.org/emoji/charts/full-emoji-list.html and here: http://unicode.org/emoji/charts/index.html.

The remainder of this blog is therefore about adding a little extra technical information to the new diversity updates we saw in the draft to Unicode 8.0, and now ratified standard, and added in Apple’s IOS 8.3.

Described by Mashable, equally wonderfully as how Unicode documents its own work; on 9 April of this year (2015), Apple released IOS 8.3 jumping on a yet to be finalized draft by the Unicode Consortium to add different skin tones to appropriate glyphs.

They point to a technical report by Unicode.org on diversity: http://www.unicode.org/reports/tr51/tr51-2.html#Diversity

The question I had when I started investigating this was, with the new additions to IOS did we now have (x)*5 – hundreds! Of new Unicode characters for representing Emoji with different skin tones. My research showed me the following pattern when run through the Cooper Hewitt Unicode tools:

? OLDER WOMAN

? OLDER WOMAN

?

? OLDER WOMAN

?

? OLDER WOMAN

?

? OLDER WOMAN

?

? OLDER WOMAN

?

In hex:

Older Woman Emoji in Binary

First, we note all the primary Emoji names are the same, that is, they all use the same character, in this case:

0xF09F91B5, or: http://www.fileformat.info/info/unicode/char/1f475/index.htm

Second, after the first two references we see an additional character, and then again a different new character after each other ‘Older Woman’ character, the complete set is:

0xF09F8FBB , 0xF09F8FBC , 0xF09F8FBD , 0xF09F8FBE , 0xF09F8FBF

If we look up the character numbers on FileFormat.info we find, for example, that the first different new character refers to: EMOJI MODIFIER FITZPATRICK TYPE-1-2

An odd name!

Though, if we look at the Unicode.org technical report again, describing additions to the standard, we see:

Five symbol modifier characters that provide for a range of skin tones for human emoji are planned for Unicode Version 8.0 (scheduled for mid-2015). These characters are based on the six tones of the Fitzpatrick scale, a recognized standard for dermatology (there are many examples of this scale online, such as FitzpatrickSkinType.pdf). The exact shades may vary between implementations.”

These characters have been designed so that even where diverse color images for human emoji are not available, readers can see what the intended meaning was.

The default representation of these modifier characters when used alone is as a color swatch. Whenever one of these characters immediately follows certain characters (such as WOMAN), then a font should show the sequence as a single glyph corresponding to the image for the person(s) or body part with the specified skin tone, such as the following…

So the answer to my question is that we don’t have hundreds of new Emoji characters, we have five modifier characters; those modifiers change the colour tone of the characters that they are transmitted with. These modifier characters are:

When transmitted alongside an applicable primary character then the device should interpret the modifier alongside and present an appropriate glyph with the appropriate colour tone. Hence, we get hundreds of new appearances for many different pre-existing Emoji.

If one looks at any of the modifier listings on Emojipedia.org they can see the list of Emoji the modifiers apply to, e.g. http://emojipedia.org/emoji-modifier-fitzpatrick-type-6/.

If you dissect the byte sequences differently you can make your device show the modifiers in detail:

 

Dissected glyphs on iPhone, older woman

or another glyph:

Dissected glyphs on iPhone, no good gesture

From a digital preservation and interpretation perspective it is difficult to say whether or not an external modifier to an existing Unicode character makes things more complicated to preserve than if we just had hundreds of new characters to interpret. There is an elegance to encapsulation, but given we’ve the instructions on how to interpret this new step forward, and as long as we preserve them, what then is the difference between eight bytes and four? – little – but we do need to ensure we preserve both the sequences, and the instructions for preserving those sequences; not that different to how we’re hoping to preserve file-formats is it?

Technical report 51, #diversity: http://www.unicode.org/reports/tr51/tr51-2.html#Diversity

Unicode 8.0 standard: http://www.unicode.org/versions/Unicode8.0.0/