Localization Tips, Part 3: Sorting and Collating

published: Jan 17, 2013 4:0 PM

This is the third entry in a multi-part series about localization.

Sorting and collating data is something everyone does on a frequent basis and typically won’t spend a lot of time thinking about. If you’ve ever seen a physical rolodex before, this is the essence of sorting and collating: putting a list of contacts in a particular order, collated (or grouped) in a manner which everyone agrees upon. For example:

A
Aaron Burr
Alexander Hamilton
B
Benjamin Franklin
C
Carter Braxton
Charles Pickney
D
Daniel Carroll
S
Samuel Adams

Unless you happen to be Danish:

A
Alexander Hamilton
B
Benjamin Franklin
C
Charles Pickney
Carter Braxton
D
Daniel Carroll
S
Samuel Adams
Å
Aaron Burr

Or Czech:

A
Aaron Burr
Alexander Hamilton
B
Benjamin Franklin
C
Carter Braxton
D
Daniel Carroll
CH
Charles Pickney
S
Samuel Adams

So what’s going on here? There are a few different considerations on how these characters get sorted.

Exemplar Characters

Each language has a set of “exemplar characters” which contain the commonly used letters fo a given modern form of a language. More simply, you can think of them as the “alphabet” or the “native characters” for a given language. Here are a few examples of these characters:

English: a b c d e f g h i j k l m n o p q r s t u v w x y z
Danish: a b c d e f g h i j k l m n o p q r s t u v w x y z æ ø å
Czech: a á b c č d ď e é ě f g h ch i í j k l m n ň o ó p q r ř s š t ť u ú ů v w x y ý z ž

Sometimes a sequence of Latin characters will map to a particular native character. In Danish, “AA” is treated as “Å” which sorts after “Z.” Sometimes an exemplar character is actually expressed by multiple characters. In Czech, for example, strings starting with “CH” sort after “H” but strings starting with just “C” sort after “B.”

Unicode Sorting

The Unicode collation algorithm defines the default sorting applied to the characters supported by Unicode. The order defined here is important because it serves as the basis for how other languages and locales customize collation. However, collating data based on this standard isn’t particularly useful to end users who expect data sorted in a manner that is specific to their own conventions.

Non-Native Characters

Different languages will define different rules for how non-native characters are sorted and collated. Frequently they will appear after the native characters in order defined by the Unicode collation algorithm. However, Japanese’s default collation will show latin characters first, then Japanese characters, then non-native.

Tips

Here is a brief list of things to keep in mind when dealing with collation in your application:

If your application’s user interface displays data in a collated or “grouped” view, ensure it does so in a locale-aware manner. For example, if your application displays a list of contacts with a header row for each letter, the letters should be appropriate for that locale. Define that set of exemplar characters for each language in the appropriate localization files. There typically should also be an “other” section after the last exemplar character for text starting with “non-native” characters.
Ensure your application sorts data according to the user’s selected locale. Frequently this will involve passing the proper locale to the API’s “sort” function. Using the Unicode collation might seem like a fair compromise but will lead to a result that is displeasing to most non-English speakers.
If your application is multi-tier, ensure that all tiers of the application are sorting in the same manner. For instance, an n-tier application might use JavaScript, Java and MySQL. If MySQL queries are sorted in a locale-aware manner but then JavaScript or Java execute a subsequence locale-agnostic sort, data will be inconsistent.
If one of the tiers doesn’t provide support for locale-aware sorting, remove sorting code from that tier and rely on another tier to perform all sorting.
If the application relies on caching of data, ensure that a locale-aware sort is performed after the data is retrieved from cache. This issue can manifest in many different places, including caching of JSON or XML responses over HTTP.