Friday, December 23, 2011

A Unicode Post That Is More for Me Than You

I try to use the more world friendly UTF-8 encoding. However, in MySQL a choice generally has to be made between two leading UTF candidate encodings.

From the MySQL Documentation

For any Unicode character set, operations performed using the _general_ci collation are faster than those for the _unicode_ci collation. For example, comparisons for the utf8_general_ci collation are faster, but slightly less correct, than comparisons for utf8_unicode_ci. The reason for this is that utf8_unicode_ci supports mappings such as expansions; that is, when one character compares as equal to combinations of other characters. For example, in German and some other languages “ß” is equal to “ss”. utf8_unicode_ci also supports contractions and ignorable characters. utf8_general_ci is a legacy collation that does not support expansions, contractions, or ignorable characters. It can make only one-to-one comparisons between characters.
 In a nutshell--

UTF-8 General is usually faster because it does not factor into any cases where two glyphs or glyph combinations that are equivalent via the encoding. Either the glyph is what it is or it isn't with a one to one relationship. This is the default UTF-8 encoding.

UTF-8 Unicode does facilitate how language is actually used and operates on more complicated conventions. It is the newer, more correct implementation at the cost of frugal resource usage.

Followers