[standards-jig] JIDs (JEP-0029)

Craig ckaes at jabber.com
Tue Apr 30 16:24:11 UTC 2002

Dave wrote:
> <snip>
 > 256 or so characters (not bytes - it's silly to
> penalize international users just so devs don't have to allocate 5x as
> much storage for resources 

You and Temas have both brought this up so it is clear that some 
clarification for why I chose bytes is in order.  Once we specify 
encoding (UTF-8), then we can map characters to bytes.  The XML parser 
is good at doing this for us.  Let's not duplicate that work in the jid 
parsing routine which for any server implementation to be remotely 
scalable must be fast.  By specifying a lower bound on _characters_, you 
are forcing all implementers to interpret the encoding, which is silly.

I'll tell you how I came up with 256 bytes.  I started with how many han 
ideographs (our worst-case encoding) is reasonable for a username and 
resource.  64 characters is more than enough since so much more 
information can be encoded in those characters than in US ASCII.  Okay, 
so 64 han ideographs translates to 256 bytes in a UTF-8 encoding. 
That's how I arrived at that byte number -- by asking what should be the 
minimum number of characters allowed and accomodating that number. 
(Incidently, at least as of 1.2, and probably currently (haven't looked 
for a while), the lower bound on the number of characters allowed in the 
open source server for a username was 8 since it validates the first 64 
_bytes_ without regard to UTF-8 translation before truncating.)

So, by saying that we should specify in terms of characters, not bytes, 
you and Temas are asking for the following additional behavior.  You are 
requiring all jid parsing to translate UTF-8 encoding into character 
representation so that you can _punish_ ASCII users -- effectively 
hurting performance solely to further limit latin speaking users.  Wrong.


More information about the Standards mailing list