Emojis: how they're made

2023 December 22, 00:46

Currently, The Unicode Standard sees one major release per year (barring unusual exceptions), with an occasional additional point release outside of that normal pace. 2023 has been one of the exceptions, in that there was only minor release: 15.1.

Both minor and major version updates can come with new emojis. These new emojis subsequently become available in various pieces of software... eventually, one hopes. To understand how new emojis make their way to actual software in use, it helps to understand what the data behind emojis is, where it comes from, and how it is used.

The Unicode Standard

What is often referred to as Unicode is, more precisely, The Unicode Standard. The Unicode Standard, published by the Unicode Consortium, consists of tables of codepoint assignments, as well as a number of documents describing how the standard's data should be used, and broader practices for handling Unicode text. The Standard includes machine-readable data files, the Unicode Character Database (UCD), which describes codepoint assignments, plus some related extra data.

A codepoint (or code point) is essentially a number, and so The Unicode Standard determines what—if any—character is assigned to a given number. Plenty of codepoints are yet unassigned, and so new assignments can be introduced in new versions of the standard. Each character also has certain additional information associated with it, in the form of character properties. These include things like the identifying name, general category, and directionality (some characters are associated with left-to-right scripts, some are right-to-left, and some are more complicated than that). The UCD consists of a number of files which contain this data.

Some—but not all—emojis are single characters, and these a get a single codepoint each. UnicodeData.txt is the file within the UCD that lists codepoint assignments, and so we can find such emojis in it:

1F979;FACE HOLDING BACK TEARS;So;0;ON;;;;;N;;;;;
1F97A;FACE WITH PLEADING EYES;So;0;ON;;;;;N;;;;;
1F97B;SARI;So;0;ON;;;;;N;;;;;
1F97C;LAB COAT;So;0;ON;;;;;N;;;;;
1F97D;GOGGLES;So;0;ON;;;;;N;;;;;
1F97E;HIKING BOOT;So;0;ON;;;;;N;;;;;
1F97F;FLAT SHOE;So;0;ON;;;;;N;;;;;

Excerpt of UnicodeData.txt, part of the UCD. The columns are separated using ;. The first two columns are the codepoint, and the name. The first line is the 🥹 emoji.

The UCD also includes files which specify which characters are actually emoji, and which should have emoji presentation (that is, which should show up in color). Furthermore, The Unicode Standard comes with additional emoji-related, computer-readable data tables that are not part of the UCD proper.

Not all emojis are single characters—some are sequences. For example, the astronaut emoji (🧑‍🚀) is assembled from the adult emoji (🧑) and the rocket emoji (🚀), plus some glue characters. Similarly, the trans flag emoji (🏳️‍⚧️) involves combining the white flag emoji (🏳) and the trans symbol emoji (⚧), plus some glue characters. Such sequences do not require new codepoint assignments, and could theoretically be devised by vendors on an ad-hoc basis. The standard, however, includes a number of such sequences which are expected to be widely deployed—or, as Unicode puts it, recommended for general interchange (RGI). These sequences are one of the things specified in the emoji data tables that exist outside of the UCD, in a file that looks like this:

1F3F3 FE0F 200D 26A7 FE0F   ; RGI_Emoji_ZWJ_Sequence  ; transgender flag   # E13.0  [1] (🏳️‍⚧️)
1F3F3 FE0F 200D 1F308       ; RGI_Emoji_ZWJ_Sequence  ; rainbow flag       # E4.0   [1] (🏳️‍🌈)
1F3F4 200D 2620 FE0F        ; RGI_Emoji_ZWJ_Sequence  ; pirate flag        # E11.0  [1] (🏴‍☠️)

Excerpt of emoji-zwj-sequences.txt, with whitespace trimmed for the sake of display. The columns are separated by ;s, and #s indicate comments. The first column lists the characters in the sequence. The E-numbers in the comment say which version of emoji introduced this particular sequence.

Emoji additions to The Unicode Standard are handled by the Emoji Subcommittee (ESC), which operates under the Unicode Technical Committee. Proposals for new emojis (either codepoint or sequence) are open to the public, although they also sometimes originate from within the Consortium. For proposals that make it to the later stages of the approval process, the ESC will generally make the proposal public, and report on its progress. It is therefore possible to tell, ahead of time, what will end up in the next Unicode Standard release.

Common Locale Data Repository

The Common Locale Data Repository (CLDR) is a project aimed at maintaining a standard repository of a variety of locale data. It is a project maintained by the Unicode Consortium, though it is not part of The Unicode Standard.

The CLDR contains locale data for a wide range of language, region, and script combinations. This includes some of the more usual locale data, such as the way numbers or dates are formatted, or how the days of the week are written. The CLDR also contains some less obvious locale data, such as information on how person names in a given culture are usually collated, or how many different plural forms a language uses.

Among the less obvious CLDR data are character annotations. These are used to assign a short name, and any number of keywords to every emoji. The short name is the main name that an emoji will have, and may be used for text-to-speech systems that need to read the emoji out loud. The keywords are any additional terms that may be useful when searching for that particular emoji. While The Unicode Standard gives each codepoint an English name, that name is intended more as an internal identifier, suitable for use in source code, and such names are only given to codepoints, not to emoji sequences. CLDR annotations, on the other hand, are for both single character emojis and sequences, and are different in every language, making them more suitable for the end user.

Annotation data is in XML, and looks like this:

<!-- en.xml -->
<annotation cp="🍆">aubergine | eggplant | vegetable</annotation>
<annotation cp="🍆" type="tts">eggplant</annotation>

<!-- zh.xml -->
<annotation cp="🍆">茄子 | 蔬菜</annotation>
<annotation cp="🍆" type="tts">茄子</annotation>

<!-- hi.xml -->
<annotation cp="🍆">बेंगन | बैंगन | सब्जी</annotation>
<annotation cp="🍆" type="tts">बैंगन</annotation>

The English, Chinese, and Hindi annotations for 🍆. Note that 🍆 is called AUBERGINE in The Unicode Standard, but the CLDR annotations for English use the more common name of eggplant; aubergine is included in the keywords, so that the emoji comes up if the user types either of the names.

The CLDR is managed by the CLDR Technical Committee (CLDR-TC) within the Unicode Consortium. The project is maintained by Consortium members, as well as affiliated institutions and individuals with relevant language expertise. Unnaffiliated public can also participate in a limited way, by filing requests for changes, to be reviewed by the members. The CLDR is released on its own six month schedule, independent of The Unicode Standard.

Display

In order to render new emojis, software will generally require new graphics that depict them, in addition to possibly needing the latest version of the Unicode Character Database. Standards changes that would require code changes to the rendering stack are less common.

As the UCD describes which characters are emojis, updating it may be necessary to inform software about which characters it should treat as emojis. When it comes to emoji handling by the operating system, this generally means a lower-level component or library, which may either vendor the UCD, take it in at build time, or vendor an already pre-processed version of the UCD.

The other thing needed to display new emojis is the graphics to actually display them. While the Unicode Character Database does not contain any instructions on what a given emoji should look like, the proposals and other documentation on the Unicode website generally do include examples. These are also usually available prior to the formal release date for the given Unicode Standard version, so it is technically possible for vendors to have graphics ready before that date.

Two popular permissively licensed emoji sets are Google's Noto Emoji, and Twitter's Twemoji. Both of these projects often put out a release that supports the new version of The Unicode Standard within about a month after The Unicode Standard's release, though some outliers do happen. In particular, Twemoji was affected by a petulant billionaire buying Twitter, which resulted in several people involved in handling of the project being fired, and apparent halt to Twitter's (now called "X") releasing of emojis under a permissive license. The Twemoji project now continues as a fork, maintained by some of the original authors.

Both Noto Emoji and Twemoji are shipped as a series of vector graphics. These graphics can be built into font files, and such fonts are generally how new emojis end up making it into desktop and mobile operating systems. Google ships the tooling needed to turn Noto Emojis into an OpenType font, and also makes pre-built font files available; Twemoji fonts can be built with a third party tool twemoji-color-font, with pre-built fonts likewise distributed by that project.

Other times, the graphics are used more directly. Web applications, for example, may replace emoji characters by images loaded via HTML. In such situations, there is often a dedicated library which either brings its own graphics, or can be configured to use a separately provided graphics set It is these components which need to be updated in such a situation.

Input

It is technically possible to input emojis the same way as any other arbitrary Unicode character: through character pickers, numerical codepoint entry, or copy-and-paste from an outside source. These methods are not very practical, though, which is why emoji pickers, and other emoji input methods exist.

Emoji pickers can source their data from multiple places, but the Common Locale Data Repository is often among them. Like with the UCD, the CLDR data is often vendored, and is often pre-processed in some way prior to being vendored. As such, building the picker with an updated CLDR may require an extra step to update the CLDR-derived data files within the picker's source code.

Emoji pickers also often use an intermediate dependency as a way to access emoji data. While with the UCD, software will generally get its Unicode data just from the UCD, emoji pickers may use CLDR partially, indirectly, or even not at all. Projects such as emoji-data or emojibase maintain their own lists of emoji shortcodes which may be derived from CLDR data, but are not always a one-to-one mapping. Such projects may take the CLDR into consideration, but the shortcodes and other keywords are ultimately assigned by the project, and so subject to individual editorial control. This means updates are not automated.

As with displaying emojis, emoji input also ends up implemented on multiple layers: there are OS-level emoji input methods, and individual apps or websites can have emoji input methods of their own. Of note is the fact that, while a website replacing emojis with its own images means the original glyphs are not rendered, a website providing shortcode expansion usually does not prevent an OS-level picker from inserting emojis like any other Unicode character.

Considerations

When writing software that may require emojis data from the UCD or the CLDR, there are some things to consider.

Parsing UCD or CLDR data directly at runtime may be impractical, and so pre-processing it into a format that can be included at build time often makes sense. In this case, it is also common to vendor the data, so that the UCD or CLDR data does not have to be supplied as a build-time dependency.

New emojis are generally introduced in the UCD once a year, and the CLDR is updated twice a year. This represents an update that has to be performed on a regular basis, albeit rarely. When vendoring this data directly, it can be useful to set up procedures for updating it in the future. If possible, it can also be useful to set up a way for packagers to patch in the latest UCD or CLDR data at build time, even if it is otherwise vendored.

When implementing emoji pickers on app level, it is good to remember that the user may actually have a functioning method for inputting emojis at the operating system level. Most of the time, when dealing with text input fields, this doesn't require any extra handling. There are, however, situations where the user is forced to input an emoji through the picker, like when adding a reaction to an item in a chat app. In such cases, the user may still wish to, for example, select one of the recent emojis from their phone's keyboard, which is something that the picker should facilitate.

Note on license

Both the UCD, and the CLDR, including the parts excerpted here, are available under the Unicode Data Files and Software License. More details are available on the Unicode website.