[Standards] RTT, take 2

Simon McVittie simon.mcvittie at collabora.co.uk
Fri Jun 24 09:46:05 UTC 2011


On Fri, 24 Jun 2011 at 11:24:50 +0200, Remko Tronçon wrote:
> > So I'd say that we should refer to characters in a string, and deal with
> > Unicode code-points in the abstract.
> 
> I'm wondering whether 'code points' are any better than UTF-8 based
> positioning. Isn't it possible that a codepoint position also points
> inside a character/glyph/...?

A codepoint is the fundamental thing defined by Unicode, but there is a
related concept which could be called a character (or grapheme?), consisting
of one or more codepoints (a codepoint representing a non-combining character,
followed by zero or more codepoints representing combining characters).

(A glyph is something different, and as far as I can tell is only interesting
if you make fonts or font-rendering algorithms.)

In UTF-8 a codepoint is one or more bytes, in UTF-16 a codepoint is either
one or two 16-bit words, and in UCS-4 a codepoint is one 32-bit word.

Here are some codepoints:

* U+0041 LATIN CAPITAL LETTER A
* U+00C1 LATIN CAPITAL LETTER A WITH ACUTE
* U+0301 COMBINING ACUTE ACCENT

The grapheme Á could either be written as U+0041 U+0301 (decomposed form),
or U+00C1 (composed form). Not all graphemes have a composed form.

> For example, in Qt, this would most likely be
> implemented using a QTextCursor (
> http://doc.trolltech.com/4.7/qtextcursor.html ). However, the text
> talks about 'positioning at character X', and it doesn't seem to be
> defined what this means.

That might either be counting graphemes or codepoints, depending...

    S



More information about the Standards mailing list