Skip to main content

Basic concepts and building blocks

The world of i18n is a mixture of standards governed by official bodies, de-facto usage pre-dating standards, and mistakes or edge cases that need to be dealt with pragmatically. Let's get ready for this journey!

Language

The ISO 639 standard governs the nomenclature and mostly you'll be using the 2-letter codes defined in ISO 639-1 (for example, English, French, and Japanese are represented as "en", "fr", "ja").

One pitfall to notice is that these 2-letter codes are not enough to identify a unique language used in a translation. For example, you may want to support different variants of English (such as American English and British English) and both would map to the same "en" language code despite requiring different translations. To differentiate these, we'll have to use the concept of "locale", explained below.

Script

When writing down a language, we use scripts. While English is always written in the Latin script, some languages can be written in multiple ones and we need to specify which one we want. For example, Serbian can be written in Latin or Cyrillic. The ISO 15924 standard governs the script nomenclature into 4-letter codes (for example "Latn", "Cyrl").

Region

To differentiate among different cultures and customs, the ISO 3166-1 2-letter codes are normally used. These usually map to countries but sometimes to other administrative regions (like Hong Kong). Please note that the ISO standard is not identical to the top-level country codes you may find in DNS names. The most notable exception being the Great Britain / United Kingdom that uses different GB / UK entries respectively.

Locale

The combination of language + script + region carries a good summary representation of culture and customs in a region and is usually combined in a single identifier called "locale". The locale identifier syntax has been standardized in BCP 47 as part of the IETF RFC 5646. For example, "en-Latn-US" represents English written in the Latin script used in the USA, while "sr-Cyrl-RS" represents Serbian written in Cyrillic used in Serbia. 

The BCP 47 standard has powerful extension capabilities but is not always fully supported. Standards and de-facto practices evolved over the years and we need a pragmatic and flexible approach to deal with locale codes that are passed over by different implementations. For example, the "_" character may be used to separate each component and the script may not be present and may have to be inferred. For example, "en_US" and "sr_RS" may be used instead of the examples above. Even the region and script may have to be inferred. For example "en" should default to the "US" region, while "zh-HK" (Chinese / Hong Kong) should default to the "Hant" script (Traditional Han) as that's what's commonly used in that region. Keep also in mind that the region may not always be a single specific country but could be a macro-region. For example "es-419" refers to Spanish in Latin America. 

CLDR

The "Common Locale Data Repository" is a collection of metadata about hundreds of locales that is maintained by the Unicode organization. It is the de-facto standard for most locale-specific libraries and helps dealing with complexities like inferring and normalizing locales. Is also collects cultural aspects such as number, date, time parsing and formatting.

Encodings

A big challenge in computing has always been converting human concepts into bits that can be transformed, transmitted, and stored. In the context of writing systems, a "character encoding" maps "glyphs" (letters for the latin alphabet, but there are many writing systems) into numbers. Over the years, many different encodings have been invented for different writing systems and sets of supported characters. These have mostly been incompatible (meaning that one number mapped to a different glyph in a different encoding) and caused many i18n headaches (such as "corrupted characters" when one system is used for encoding and a different one for decoding). In the last 20 years a single universal encoding for all writing systems called "Unicode" has been generally adopted and reduced these issues a lot. Nevertheless, it's important to understand the different variants and idiosyncrasies as they pose some challenges too.

Continue reading about Unicode...