Building Air Castles: Notes on "The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)"

Wednesday, June 11, 2014

Joel Spolsky's blog post
Unicode is a "character set", which does not state how it is stored
Misconception: That Unicode is simply a 16-bit code where each character takes 16 bits and therefore there are 65,536 possible characters; Unicode is not an encoding
In Unicode, letters map to "code points", e.g. U+0041 rather than bits, e.g. 0100 0001
Different fonts display code points differently
Unicode can define more than 65,536 characters
In UTF-8, an "encoding" (how string is stored in memory, disk, etc.), every code point from 0-127 is stored in a single byte. Only code points 128 and above are stored using 2, 3, up to 6 bytes.
UTF-16 stores in two bytes only, not more than 2 bytes
It does not make sense to have a string without knowing what encoding it uses. You simply cannot display it correctly or even figure out where it ends

Building Air Castles