[standards-jig] LAST CALL: JEP-0029 (JIDs)

Craig ckaes at jabber.com
Mon May 6 17:04:52 UTC 2002

Fabrice DESRE - FT.BD/FTR&D/DTL/TAL wrote:
>  § 2.3 :
> What is the "case-normalized canonical form" ?

See  http://www.unicode.org/unicode/reports/tr15/

> § 2.5 :
> UTF-8 encoded characters may theoretically be up to six bytes long. (see
> http://czyborra.com/utf/#UTF-8 for instance), so 256 bytes will provide
> at least storage for 42 characters.

Well, that link says, "Actually, UTF-8 continues to represent up to 31 
bits with up to 6 bytes, but it is generally expected that the one 
million code points of the 20 bits offered by UTF-16 and 4-byte UTF-8 
will suffice to cover all characters and that we will never get to see 
any Unicode character definitions beyond that."  So while allowing up to 
6 bytes, it was believed at the time that it was written that we'd never 
need that.

More currently however, the unicode standard (as of 3.1) was changed to 
allow only 4 bytes and less so that non-shortest form characters are 
disallowed. See http://www.unicode.org/unicode/reports/tr27/, especially 
D36 which states:

(a) UTF-8 is the Unicode Transformation Format that serializes a Unicode 
code point as a sequence of one to four bytes, as specified in Table 
3.1, UTF-8 Bit Distribution.
(b) An illegal UTF-8 code unit sequence is any byte sequence that does 
not match the patterns listed in Table 3.1B, Legal UTF-8 Byte Sequences.
(c) An irregular UTF-8 code unit sequence is a six-byte sequence where 
the first three bytes correspond to a high surrogate, and the next three 
bytes correspond to a low surrogate. As a consequence of C12, these 
irregular UTF-8 sequences shall not be generated by a conformant process.


More information about the Standards mailing list