[Standards] Proposed XMPP Extension: Character counting in message bodies
Jonas Schäfer
jonas at wielicki.name
Wed Dec 9 15:01:23 UTC 2020
For the record:
On Dienstag, 8. Dezember 2020 23:13:08 CET Sam Whited wrote:
> I don't understand how this is part of the XML data model. Do you mean
> that only Unicode encodings are supported by XML? If so, that's fair and
> removes one of my arguments, I did not know that was the case. However,
> I still think the data on the wire should describe the other data on the
> wire, not some higher- level "decoded" representation that many XML
> libraries may not even use.
Let me dig up the references:
https://www.w3.org/TR/REC-xml/#charsets
> [Definition: A parsed entity contains text, a sequence of characters, which
may represent markup or character data.]
text = sequence of characters, representing markup or character data
https://www.w3.org/TR/REC-xml/#syntax
> [Definition: All text that is not markup constitutes the character data of
the document.]
Ok, so we have text which is a sequence of characters, and what isn’t markup
is character data.
Now what are characters in XML? Back to:
https://www.w3.org/TR/REC-xml/#charsets
> [Definition: A character is an atomic unit of text as specified by ISO/IEC
10646:2000 [ISO/IEC 10646]. Legal characters are tab, carriage return, line
feed, and the legal characters of Unicode and ISO/IEC 10646. The versions of
these standards cited in A.1 Normative References were current at the time
this document was prepared. New characters may be added to these standards by
amendments or new editions. Consequently, XML processors MUST accept any
character in the range specified for Char. ]
That is the definition of a subset of the Unicode code point range:
> [2] Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-
#xFFFD] | [#x10000-#x10FFFF] /* any Unicode character, excluding the
surrogate blocks, FFFE, and FFFF. */
kind regards,
Jonas
More information about the Standards
mailing list