[standards-jig] JIDs (JEP-0029)
ckaes at jabber.com
Tue Apr 30 16:24:11 UTC 2002
> 256 or so characters (not bytes - it's silly to
> penalize international users just so devs don't have to allocate 5x as
> much storage for resources
You and Temas have both brought this up so it is clear that some
clarification for why I chose bytes is in order. Once we specify
encoding (UTF-8), then we can map characters to bytes. The XML parser
is good at doing this for us. Let's not duplicate that work in the jid
parsing routine which for any server implementation to be remotely
scalable must be fast. By specifying a lower bound on _characters_, you
are forcing all implementers to interpret the encoding, which is silly.
I'll tell you how I came up with 256 bytes. I started with how many han
ideographs (our worst-case encoding) is reasonable for a username and
resource. 64 characters is more than enough since so much more
information can be encoded in those characters than in US ASCII. Okay,
so 64 han ideographs translates to 256 bytes in a UTF-8 encoding.
That's how I arrived at that byte number -- by asking what should be the
minimum number of characters allowed and accomodating that number.
(Incidently, at least as of 1.2, and probably currently (haven't looked
for a while), the lower bound on the number of characters allowed in the
open source server for a username was 8 since it validates the first 64
_bytes_ without regard to UTF-8 translation before truncating.)
So, by saying that we should specify in terms of characters, not bytes,
you and Temas are asking for the following additional behavior. You are
requiring all jid parsing to translate UTF-8 encoding into character
representation so that you can _punish_ ASCII users -- effectively
hurting performance solely to further limit latin speaking users. Wrong.
More information about the Standards