[standards-jig] JIDs (JEP-0029)

Dave dave at dave.tj
Tue Apr 30 16:31:38 UTC 2002


Reply inline:

 - Dave

Craig wrote:
> 
> Dave wrote:
> > <snip>
>  > 256 or so characters (not bytes - it's silly to
> > penalize international users just so devs don't have to allocate 5x as
> > much storage for resources 
> 
> You and Temas have both brought this up so it is clear that some 
> clarification for why I chose bytes is in order.  Once we specify 
> encoding (UTF-8), then we can map characters to bytes.  The XML parser 
> is good at doing this for us.  Let's not duplicate that work in the jid 
> parsing routine which for any server implementation to be remotely 
> scalable must be fast.  By specifying a lower bound on _characters_, you 
> are forcing all implementers to interpret the encoding, which is silly.
I'm not asking implementors to interpret the encoding within their code.

> 
> I'll tell you how I came up with 256 bytes.  I started with how many han 
> ideographs (our worst-case encoding) is reasonable for a username and 
> resource.  64 characters is more than enough since so much more 
> information can be encoded in those characters than in US ASCII.  Okay, 
> so 64 han ideographs translates to 256 bytes in a UTF-8 encoding. 
> That's how I arrived at that byte number -- by asking what should be the 
> minimum number of characters allowed and accomodating that number. 
> (Incidently, at least as of 1.2, and probably currently (haven't looked 
> for a while), the lower bound on the number of characters allowed in the 
> open source server for a username was 8 since it validates the first 64 
> _bytes_ without regard to UTF-8 translation before truncating.)
Incidentally, my 1.4.1 server has me as dave at dave.tj ... only 4
characters.  This has nothing to do with your argument, though, so I
won't dwell on it ;-)

> 
> So, by saying that we should specify in terms of characters, not bytes, 
> you and Temas are asking for the following additional behavior.  You are 
> requiring all jid parsing to translate UTF-8 encoding into character 
> representation so that you can _punish_ ASCII users -- effectively 
> hurting performance solely to further limit latin speaking users.  Wrong.
Nope, I'm simply asking us to specify the limit in characters rather
than bytes, and then have implementors set aside enough space for
a worst-case scenario (5*number_of_chars).  They don't have to do
any translation anyway.  However, users can deal with familiar terms
(characters), rather than being forced to encode their resources in
order to figure out whether they're legal.

The same rationale lies behind my proposal for C compiler-style
behavior, so users can do anything they want with resources, but only
the first X characters are guaranteed to be significant.  The only
really annoying clash between my two proposals occurs when two servers
implement different-length buffers for resource names, and a client
connected to the server with the shorter buffer tries to reply to a
client behind the server with the larger buffer: the server with the
longer buffer will notice that the truncated-by-short-buffered-server
resource differs within the first buffer-length bytes from the
truncated-by-long-buffered-server resource (for the same reason that
dave at dave.tj/home and dave at dave.tj/homemade-client differ), and that'll
kinda screw up some stuff, since the reply will probably end up being
treated as if it were sent to a nonexistant resource.

You must agree anyway that telling users to count the bytes required by a
resource name they choose isn't exactly "nice," forcing them to manually
encode their chosen resource name to make sure it's not too long ... so
if somebody can find a simple way to make the standard dictate in terms
of characters without requiring implementations to utf-unencode resources,
I'd definitely vote for that.  Sadly, though, we may be out of luck :-(

> 
> --C
> 
> _______________________________________________
> Standards-JIG mailing list
> Standards-JIG at jabber.org
> http://mailman.jabber.org/listinfo/standards-jig
> 




More information about the Standards mailing list