Localization Tips, Part 3: Sorting and Collating

This is the third entry in a multi-part series about localization.

Sorting and collating data is something everyone does on a frequent basis and typically won't spend a lot of time thinking about. If you've ever seen a physical rolodex before, this is the essence of sorting and collating: putting a list of contacts in a particular order, collated (or grouped) in a manner which everyone agrees upon. For example:

A
Aaron Burr
Alexander Hamilton
B
Benjamin Franklin
C
Carter Braxton
Charles Pickney
D
Daniel Carroll
S
Samuel Adams

Unless you happen to be Danish:

A
Alexander Hamilton
B
Benjamin Franklin
C
Charles Pickney
Carter Braxton
D
Daniel Carroll
S
Samuel Adams
Å
Aaron Burr

Or Czech:

A
Aaron Burr
Alexander Hamilton
B
Benjamin Franklin
C
Carter Braxton
D
Daniel Carroll
CH
Charles Pickney
S
Samuel Adams

So what's going on here? There are a few different considerations on how these characters get sorted.

Exemplar Characters

Each language has a set of "exemplar characters" which contain the commonly used letters fo a given modern form of a language. More simply, you can think of them as the "alphabet" or the "native characters" for a given language. Here are a few examples of these characters:

English: a b c d e f g h i j k l m n o p q r s t u v w x y z
Danish: a b c d e f g h i j k l m n o p q r s t u v w x y z æ ø å
Czech: a á b c č d ď e é ě f g h ch i í j k l m n ň o ó p q r ř s š t ť u ú ů v w x y ý z ž

Sometimes a sequence of Latin characters will map to a particular native character. In Danish, "AA" is treated as "Å" which sorts after "Z." Sometimes an exemplar character is actually expressed by multiple characters. In Czech, for example, strings starting with "CH" sort after "H" but strings starting with just "C" sort after "B."

Unicode Sorting

The Unicode collation algorithm defines the default sorting applied to the characters supported by Unicode. The order defined here is important because it serves as the basis for how other languages and locales customize collation. However, collating data based on this standard isn't particularly useful to end users who expect data sorted in a manner that is specific to their own conventions.

Non-Native Characters

Different languages will define different rules for how non-native characters are sorted and collated. Frequently they will appear after the native characters in order defined by the Unicode collation algorithm. However, Japanese's default collation will show latin characters first, then Japanese characters, then non-native.

Tips

Here is a brief list of things to keep in mind when dealing with collation in your application:

  1. If your application's user interface displays data in a collated or "grouped" view, ensure it does so in a locale-aware manner. For example, if your application displays a list of contacts with a header row for each letter, the letters should be appropriate for that locale. Define that set of exemplar characters for each language in the appropriate localization files. There typically should also be an "other" section after the last exemplar character for text starting with "non-native" characters.
  2. Ensure your application sorts data according to the user's selected locale. Frequently this will involve passing the proper locale to the API's "sort" function. Using the Unicode collation might seem like a fair compromise but will lead to a result that is displeasing to most non-English speakers.
  3. If your application is multi-tier, ensure that all tiers of the application are sorting in the same manner. For instance, an n-tier application might use JavaScript, Java and MySQL. If MySQL queries are sorted in a locale-aware manner but then JavaScript or Java execute a subsequence locale-agnostic sort, data will be inconsistent.
  4. If one of the tiers doesn't provide support for locale-aware sorting, remove sorting code from that tier and rely on another tier to perform all sorting.
  5. If the application relies on caching of data, ensure that a locale-aware sort is performed after the data is retrieved from cache. This issue can manifest in many different places, including caching of JSON or XML responses over HTTP.

Testacular Quickstart for Linux

Testacular is a great test runner for JavaScript that's powered by Node.js. It has great support for Jasmine and AngularJS, makes it easy to debug tests and will automatically re-run tests as you change your application or test code.

To get Testacular, first you need to install Node.js. Node has installers for Windows and Mac and is available in some Linux package managers. Node is not yet available for installation via yum on Fedora, RedHat and CentOS but the install is still straightforward using the precompiled binaries.

  • Download the "Linux binaries" from the Node download page.
  • Use "tar -xvf" to expand the archive and move to your location of choice
  • Add $NODE_HOME/bin to your path
  • Execute "npm -g install testacular"
  • Run "testacular" to confirm installation

You can now navigate to the application's source code directory and perform "testacular init" and follow the prompts. Then simply use "testacular start" to launch the browser and use "testacular run" to run the tests, whether or not you have it to set up to autowatch files.

Localization Tips, Part 2: The Evils of Date Formatting

This is the second entry in a multi-part series about localization.

Every modern platform provides a mechanism for formatting dates. We've all dealt with people on the business side who prefer dates formatted as YYYY-mm-dd or dd MMM YY or countless other permutations. The slippering slope of date formatting is that once a date is formatted, it should be formatted correctly for each and every locale that the application supports.

Here is a sample of correct date formatting for 30 common locales:

Even after so many managed to agree on the Gregorian calendar system it's amazing to see the wide variety of date formats. Even if an application only needed to support a few locales, the developer would need to define a few date formats in each translation file and tweak as appropriate for that locale's conventions. I didn't know Romanians formatted their dates as 17.01.2013, did you? :)

While nearly every platform's date formatting APIs provide a way to explicitly set a date format, some also support the notion of "format styles." These styles define a short, medium and long style for formatting date and time for a given locale. Check out this example from the Flash Globalization APIs.

Specifying a date style of SHORT, MEDIUM or LONG is a very convenient way to ensure that dates are always formatted correctly for the current locale. Here's an example snippet of Flash code which prints out the MEDIUM version of the current date for the supported locales:

The output of this script is the same as the snippet of date formats a few paragraphs up. Pretty convenient stuff here. For added flexibility, a developer could also write code to support explicit date formats for certain locales and then fall back to a date style for other locales.

So before you format another date, dig a little deeper into their localization APIs to see what's available. You might be surprised what you find.

Localization Tips, Part 1: It's Okay to Repeat Yourself

This is the first entry in a multi-part series about localization.

I recently worked on a mobile application that needed to support 27 different languages and numerous locales. My team was responsible for building the app's user interface and ensuring it was properly localized. We learned a lot throughout this process and I wanted to share some of it here.

Tip 1: It’s Okay To Repeat Yourself

Most platforms provide basic localization support by exposing an API to retrieve localized strings or "resources" that have been externalized into text files (key/value pair “properties” files, JSON, XML, etc). As software developers, we tend to follow the principle of "Do not Repeat Yourself" (DRY) and identify a single unambiguous place for each bit of logic or configuration. This principle doesn't always apply when dealing with translated text.

If the same string of text appears in several places in the UI there may be a temptation to define a single key for that word and utilize it in several places in the app. Unfortunately, different languages will sometimes have variations on the same word depending on the context in which it's used. For example, the word "search" when translated to Spanish will be "buscar" in some contexts and "búsqueda" in others. If the UI contains a main navigation element, page header and several buttons that all use the text "Search", define a separate value in the translation file for each instance rather than reusing a single value called "search."

Another subtle issue can arise when dealing with phrases that contain variable values such as numbers or formatted dates which are intermingled with translated text. For example, the UI may define a string that - in English - is represented as "$DATE at $TIME." It's commonplace for developers to define an entry for “at” in the resource file and then concatenate the values together in code. Some languages use a very different grammatical structure for certain phrases which may change the order of words and add certain punctuation elements not found in other languages. The best practice here is to define the entire phrase in the translation file as "$1 at $2" and use a string formatting utility provided by the language or localization API.

So here's a quick recap of Part 1's tips:

  • Define a value for each place in the user interface where a piece of text appears. It may appear duplicative and redundant but will offer a lot of flexibility as the UI is translated into additional languages.
  • Define phrases which contain variable values inside of the translation file using placeholder elements. Don't concatenate strings and variables in code.

Integrating Sencha Touch and Hogan.js for Easy HTML Templating

I've been working with Sencha Touch lately and I've been impressed with it so far.  It's a really nice full-stack solution for HTML5 mobile web development.  There was however a pain point I came across that I needed to solve.

Sencha Touch has a nice component called DataView that can dynamically render HTML based on a collection of data. With a DataView, you have the ability to specify a template ("itemTpl") which is a chunk of HTML rendered for each element in the Store. Unfortunately the framework only supports declaring this in Javascript, so you are left with some pretty nasty syntax:

Fortunately I was able to hack together a quick and dirty workaround for this based off Hogan.js, Twitter's Mustache-compatible HTML templating system. First I wrote a simple HTML template named myTemplate.mustache:

Then I used Hogan's "hulk" executable to compile the template to a Javascript file. If you have multiple templates, hulk will merge them into a single file. The template will be available in Hogan via its name, in this case "myTemplate". The output of the file isn't important nor is it very readable, so we'll skip that part.

Then I edited my main HTML file to include the Hogan library and my compiled templates:

Next I wrote a subclass of Ext's XTemplate to serve as an adapter between their template API and Hogan's. I haven't tested this very thoroughly yet; there are likely a bunch of other methods that need overriding to get this solid:

And then updated the original code snippet to use the following:

and presto, it just works :)

Adapting this for Mustache.js should be straightforward provided they expose a similar API with a synchronous render() method.