[JDEV] Writings from the Journal of TCharron

Jon A. Cruz joncruz at geocities.com
Wed Aug 4 11:38:50 CDT 1999

(Note: my terms might not be the most technically accurate, but this is to convey
a good overview)

Basically, you can think of Unicode as having a character set that contains just
about all the characters you'd want to ever use, and maybe then some (there are
contingents working hard on getting Tolkien's Tengwar and Cirth, and StarTrek
Klingon in).

You can then think of actually storing this large character set using different
encodings. UTF-8 and UTF-16 would be the two most common of these. UTF-16 has the
advantage of all characters being 16-bit. UTF-8 is variable length, and has the
advantage that the 7-bit US-ASCII range is preserved as-is in 8-bit characters.

Given that commands and such would be handy to be tested via telnet, that
standard English stays one-byte, etc., it probably best to standardize on UTF-8
being the one encoding to be used over the wire. Internally, the clients can be
recommended to use UTF-16, or whatever is most efficient to them, but only UTF-8
should be allowed to be exchanged. For UI input and output, the client might
convert to and from a platform-specific charset and encoding, but then go
straight to Unicode for all manipulation.

One side-effect of standardizing the charset to Unicode would be that security
things such as passwords would be easy to handle on different systems.

On MS Windows, COM works by stating that all strings are Unicode. Period. Also,
MS Offices does all it's work internally as Unicode, and converts whenever it
needs to get data in or out of a Windows system call. (this is because Windows 9x
has all the Unicode versions of API calls present but stubbed to return errors.)
I mention this as an example of "gee, a company that mangles and avoids standards
as much as they do still complies in this area, so maybe we should too".

