Labels

.net (1) *nix (1) administration (1) Android (2) Axis2 (2) best practice (5) big-data (1) business-analysis (1) code re-use (1) continuous-integration (1) Cordova-PhoneGap (1) database (2) defect (1) design (3) Eclipse (7) education (1) groovy (2) https (2) Hudson (4) Java (1) JAX-RS (2) Jersey (3) Jetty (1) localization (1) m2eclipse (2) MapForce (1) Maven (12) MySQL (1) Nexus (4) notes (4) OO (1) Oracle (4) performance (1) Perl (1) PL/SQL (1) podcast (1) PostgreSQL (1) requirement (1) scripting (1) serialization (1) shell (1) SoapUI (1) SQL (1) SSH (2) stored procedure (1) STS (2) Subclipse (1) Subversion (3) TOAD (3) Tomcat (4) UML (2) unit-testing (2) WAMP (1) WAS (3) Windows (3) WP8 (2) WTP (2) XML (4) XSLT (1)

Wednesday, June 11, 2014

Notes on "The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)"

  • Joel Spolsky's blog post
  • Unicode is a "character set", which does not state how it is stored
  • Misconception:  That Unicode is simply a 16-bit code where each character takes 16 bits and therefore there are 65,536 possible characters; Unicode is not an encoding
  • In Unicode, letters map to "code points", e.g. U+0041 rather than bits, e.g. 0100 0001
  • Different fonts display code points differently
  • Unicode can define more than 65,536 characters
  • In UTF-8, an "encoding" (how string is stored in memory, disk, etc.), every code point from 0-127 is stored in a single byte. Only code points 128 and above are stored using 2, 3, up to 6 bytes.
  • UTF-16 stores in two bytes only, not more than 2 bytes
  • It does not make sense to have a string without knowing what encoding it uses.  You simply cannot display it correctly or even figure out where it ends

No comments:

Post a Comment