Skip to content

Localization Tips, Part 3: Sorting and Collating

published: 

This is the third entry in a multi-part series about localization.

Sorting and collating data is something everyone does on a frequent basis and typically won’t spend a lot of time thinking about. If you’ve ever seen a physical rolodex before, this is the essence of sorting and collating: putting a list of contacts in a particular order, collated (or grouped) in a manner which everyone agrees upon. For example:

A
Aaron Burr
Alexander Hamilton
B
Benjamin Franklin
C
Carter Braxton
Charles Pickney
D
Daniel Carroll
S
Samuel Adams

Unless you happen to be Danish:

A
Alexander Hamilton
B
Benjamin Franklin
C
Charles Pickney
Carter Braxton
D
Daniel Carroll
S
Samuel Adams
Å
Aaron Burr

Or Czech:

A
Aaron Burr
Alexander Hamilton
B
Benjamin Franklin
C
Carter Braxton
D
Daniel Carroll
CH
Charles Pickney
S
Samuel Adams

So what’s going on here? There are a few different considerations on how these characters get sorted.

Exemplar Characters

Each language has a set of “exemplar characters” which contain the commonly used letters fo a given modern form of a language. More simply, you can think of them as the “alphabet” or the “native characters” for a given language. Here are a few examples of these characters:

English: a b c d e f g h i j k l m n o p q r s t u v w x y z
Danish: a b c d e f g h i j k l m n o p q r s t u v w x y z æ ø å
Czech: a á b c č d ď e é ě f g h ch i í j k l m n ň o ó p q r ř s š t ť u ú ů v w x y ý z ž

Sometimes a sequence of Latin characters will map to a particular native character. In Danish, “AA” is treated as “Å” which sorts after “Z.” Sometimes an exemplar character is actually expressed by multiple characters. In Czech, for example, strings starting with “CH” sort after “H” but strings starting with just “C” sort after “B.”

Unicode Sorting

The Unicode collation algorithm defines the default sorting applied to the characters supported by Unicode. The order defined here is important because it serves as the basis for how other languages and locales customize collation. However, collating data based on this standard isn’t particularly useful to end users who expect data sorted in a manner that is specific to their own conventions.

Non-Native Characters

Different languages will define different rules for how non-native characters are sorted and collated. Frequently they will appear after the native characters in order defined by the Unicode collation algorithm. However, Japanese’s default collation will show latin characters first, then Japanese characters, then non-native.

Tips

Here is a brief list of things to keep in mind when dealing with collation in your application:


tags: