[Standards] Proposed XMPP Extension: Character counting in message bodies

Jonas Schäfer jonas at wielicki.name
Wed Dec 9 15:01:23 UTC 2020


For the record:

On Dienstag, 8. Dezember 2020 23:13:08 CET Sam Whited wrote:
> I don't understand how this is part of the XML data model. Do you mean
> that only Unicode encodings are supported by XML? If so, that's fair and
> removes one of my arguments, I did not know that was the case. However,
> I still think the data on the wire should describe the other data on the
> wire, not some higher- level "decoded" representation that many XML
> libraries may not even use.

Let me dig up the references:

https://www.w3.org/TR/REC-xml/#charsets

> [Definition: A parsed entity contains text, a sequence of characters, which 
may represent markup or character data.]

text = sequence of characters, representing markup or character data

https://www.w3.org/TR/REC-xml/#syntax

>  [Definition: All text that is not markup constitutes the character data of 
the document.] 

Ok, so we have text which is a sequence of characters, and what isn’t markup 
is character data.

Now what are characters in XML? Back to: 
https://www.w3.org/TR/REC-xml/#charsets

> [Definition: A character is an atomic unit of text as specified by ISO/IEC 
10646:2000 [ISO/IEC 10646]. Legal characters are tab, carriage return, line 
feed, and the legal characters of Unicode and ISO/IEC 10646. The versions of 
these standards cited in A.1 Normative References were current at the time 
this document was prepared. New characters may be added to these standards by 
amendments or new editions. Consequently, XML processors MUST accept any 
character in the range specified for Char. ] 

That is the definition of a subset of the Unicode code point range:

> [2]   	Char	   ::=   	#x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-
#xFFFD] | [#x10000-#x10FFFF]	/* any Unicode character, excluding the 
surrogate blocks, FFFE, and FFFF. */

kind regards,
Jonas




More information about the Standards mailing list