Right, if you ensure that only ASCII characters are used in UTF-8, which you can...

fnl · on Feb 8, 2011

Oh, and another thing is your argument only holds for ASCII - I hardly ever encounter pure ASCII data nowadays when I work with text. On the other hand, I never have encountered a Surrogate Character except in my test libraries, either. Next, if you are working with pure ASCII, who cares about UTF-8 or -16? Last, you still need to scan every byte in UTF-8 for that, but only every second "two-byte" in UTF-16, which is half.

roel_v · on Feb 8, 2011

When you say 'ASCII' I guess you mean 'strings where all bytes have a decimal value in the range [0-127]', right? If so I agree that it's rare to encounter that, but the common use (however wrong) of ASCII is 'chars are in [0-255]', i.e. all chars are one byte; and that data is very common.

Thinking about it, I don't know what codepage the UTF-8 128-255 code points map too, if any, though; could you explain? If you treat UTF-8 as ASCII data (as one byte, one character, basically), does it generally work with chars in the [127-255] range.

fnl · on Feb 8, 2011

256 char ASCII is called "8-bit", "high", or "extended" ASCII. So, pure (7-bit) ASCII is the only thing you can hold in a "one-byte UTF-8 array". The 8th bit (or the first, depending on how you see it) is used to mark multibyte characters, i.e. any character that is not ASCII. So you only can represent 128 possible symbols into a UTF-8 character with length of one byte. In particular, UTF-8 maps these to ASCII (or, US-ASCII) characters, and the byte starts with a 0 bit (In other words, you cannot encode high ASCII into one UTF-8 byte.) For ALL other characters (no matter), the first bit in any (multi-)byte is set to 1. That's why it is easy to scan for length of ASCII chars in UTF-8, but not for any others.

roel_v · on Feb 8, 2011

Oh I see, thank you, it seems I was misinformed.

fnl · on Feb 8, 2011

The important fact lies in the last sentence; in all of my apps, I could so far safely and soundly ignore the fact that Surrogates are two characters long to establish character lengths. So by using UTF-16, all those apps are significantly faster than one that would have been based on UTF-8. And, to add to this, by the same nature, it is far easier to take char slices of a UTF-16 encoded byte array than of a UTF-8 one.

querulous · on Feb 9, 2011

how do you ensure you are not slicing a character in two? for example, how do you prevent slicing 2⁵ in half?