[Standards] XEP-0301 0.5 comments -Unicode characters

Mark Rejhon markybox at gmail.com
Sat Jul 28 19:14:04 UTC 2012

> [Change Made]
> I've now added this clarification to Summary of Attribute Values
> <GH>I returned to the definitions in Unicode now, and think now that
> "character" is too vague. Unicode has in its glossary 4 different meanings
> of character, and some of them certainly can result in multiple code points.
> So, I hope you have formulated something that very reliably tells that we
> count code points.
> Even this description is hard to evaluate the nomenclature from:
> http://www.unicode.org/versions/Unicode6.1.0/ch03.pdf#G2212

[Change Made]
I am pleased to say 0.6 is vastly clearer about the Unicode nomenclature now.
Version 0.6 has just been published at

Though "character" is often ambiguous, other Unicode documents use
"character" terminology to describe a single code point:
- unicode.org -- FAQ http://www.unicode.org/faq/basic_q.html -- have
questions that interchangeably refers character as code points.
- unicode.org glossary -- Definition (3) of word "Character" and
definition (2) of "Code Point" is compatible with allowing a documents
to specifically refer to the equivalence.
- RFC5198 "Unicode Format for Network Interchange" --
http://tools.ietf.org/html/rfc5198 -- specifically defines character
terminology as a code point, but continues to use the word "character"

So, these changes were madE:

Section 4.5.2 Attribute Values
"For the purpose of this specification, the word "character"
represents a single Unicode code point. See [[[Unicode Character

Section 4.7 Accurate Processing of Action Elements
-- It reads a little less arduously
-- I now reference RFC5198, and clearly mention that character.
-- I now reference Normalization Form C (which is in widespread use
for networking including XMPP anyway, and is a default on many OS
-- Section 4.7 is still big, but the guidelines have been vastly
clarified to reduce misunderstandings.
-- First two sentence of "Unicode Character Counting" now behaves as a
quick definition "For this specification, a "character" represents a
single Unicode code point. This is the same definition used in section
1.1 of IETF RFC 5198 [11]."

Also, all this wordy Unicode-related stuff has now been moved to the
bottom of Protocol, keeping the spec easier to read and tidier, while
keeping the important (arduous-to-read but unfortunately necessary)
"devil-in-the-details" stuff for extended reading near the bottom of
the Protocol section. (The "4. Protocol" section is only 1/4 the size
of the rest of the document).  At the same time, the terminology has
been made more user-friendly and compatible with widespread usage, and
the handy RFC5198 provides me a convenient reference.

Mark Rejhon

More information about the Standards mailing list