[Standards] XEP-0301 0.5 comments -Unicode characters

Mark Rejhon markybox at gmail.com
Sat Jul 28 19:14:04 UTC 2012


> [Change Made]
> I've now added this clarification to Summary of Attribute Values
>
> <GH>I returned to the definitions in Unicode now, and think now that
> "character" is too vague. Unicode has in its glossary 4 different meanings
> of character, and some of them certainly can result in multiple code points.
> So, I hope you have formulated something that very reliably tells that we
> count code points.
>
> Even this description is hard to evaluate the nomenclature from:
> http://www.unicode.org/versions/Unicode6.1.0/ch03.pdf#G2212

[Change Made]
I am pleased to say 0.6 is vastly clearer about the Unicode nomenclature now.
Version 0.6 has just been published at
http://www.xmpp.org/extensions/xep-0301.html

Though "character" is often ambiguous, other Unicode documents use
"character" terminology to describe a single code point:
- unicode.org -- FAQ http://www.unicode.org/faq/basic_q.html -- have
questions that interchangeably refers character as code points.
- unicode.org glossary -- Definition (3) of word "Character" and
definition (2) of "Code Point" is compatible with allowing a documents
to specifically refer to the equivalence.
- RFC5198 "Unicode Format for Network Interchange" --
http://tools.ietf.org/html/rfc5198 -- specifically defines character
terminology as a code point, but continues to use the word "character"

So, these changes were madE:

http://xmpp.org/extensions/xep-0301.html#attribute_values
Section 4.5.2 Attribute Values
"For the purpose of this specification, the word "character"
represents a single Unicode code point. See [[[Unicode Character
Counting(link)]]].

http://xmpp.org/extensions/xep-0301.html#accurate_processing_of_action_elements
Section 4.7 Accurate Processing of Action Elements
-- It reads a little less arduously
-- I now reference RFC5198, and clearly mention that character.
-- I now reference Normalization Form C (which is in widespread use
for networking including XMPP anyway, and is a default on many OS
platforms).
-- Section 4.7 is still big, but the guidelines have been vastly
clarified to reduce misunderstandings.
-- First two sentence of "Unicode Character Counting" now behaves as a
quick definition "For this specification, a "character" represents a
single Unicode code point. This is the same definition used in section
1.1 of IETF RFC 5198 [11]."


Also, all this wordy Unicode-related stuff has now been moved to the
bottom of Protocol, keeping the spec easier to read and tidier, while
keeping the important (arduous-to-read but unfortunately necessary)
"devil-in-the-details" stuff for extended reading near the bottom of
the Protocol section. (The "4. Protocol" section is only 1/4 the size
of the rest of the document).  At the same time, the terminology has
been made more user-friendly and compatible with widespread usage, and
the handy RFC5198 provides me a convenient reference.

Thanks,
Mark Rejhon



More information about the Standards mailing list