[standards-jig] LAST CALL: JEP-0029 (JIDs)
ckaes at jabber.com
Mon May 6 17:04:52 UTC 2002
Fabrice DESRE - FT.BD/FTR&D/DTL/TAL wrote:
> Â§ 2.3 :
> What is the "case-normalized canonical form" ?
> Â§ 2.5 :
> UTF-8 encoded characters may theoretically be up to six bytes long. (see
> http://czyborra.com/utf/#UTF-8 for instance), so 256 bytes will provide
> at least storage for 42 characters.
Well, that link says, "Actually, UTF-8 continues to represent up to 31
bits with up to 6 bytes, but it is generally expected that the one
million code points of the 20 bits offered by UTF-16 and 4-byte UTF-8
will suffice to cover all characters and that we will never get to see
any Unicode character definitions beyond that." So while allowing up to
6 bytes, it was believed at the time that it was written that we'd never
More currently however, the unicode standard (as of 3.1) was changed to
allow only 4 bytes and less so that non-shortest form characters are
disallowed. See http://www.unicode.org/unicode/reports/tr27/, especially
D36 which states:
(a) UTF-8 is the Unicode Transformation Format that serializes a Unicode
code point as a sequence of one to four bytes, as specified in Table
3.1, UTF-8 Bit Distribution.
(b) An illegal UTF-8 code unit sequence is any byte sequence that does
not match the patterns listed in Table 3.1B, Legal UTF-8 Byte Sequences.
(c) An irregular UTF-8 code unit sequence is a six-byte sequence where
the first three bytes correspond to a high surrogate, and the next three
bytes correspond to a low surrogate. As a consequence of C12, these
irregular UTF-8 sequences shall not be generated by a conformant process.
More information about the Standards