Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Right, if you ensure that only ASCII characters are used in UTF-8, which you can check using 0x80 & byte == 0x00, counting the number of characters is easy. But to check that this condition holds, you need to iterate over the whole string anyway.

I don't see how this gives any advantage to either of the encodings. Both for UTF-8 and UTF-16 you have to implement some decoding to reliably count the number of characters.



Oh, and another thing is your argument only holds for ASCII - I hardly ever encounter pure ASCII data nowadays when I work with text. On the other hand, I never have encountered a Surrogate Character except in my test libraries, either. Next, if you are working with pure ASCII, who cares about UTF-8 or -16? Last, you still need to scan every byte in UTF-8 for that, but only every second "two-byte" in UTF-16, which is half.


When you say 'ASCII' I guess you mean 'strings where all bytes have a decimal value in the range [0-127]', right? If so I agree that it's rare to encounter that, but the common use (however wrong) of ASCII is 'chars are in [0-255]', i.e. all chars are one byte; and that data is very common.

Thinking about it, I don't know what codepage the UTF-8 128-255 code points map too, if any, though; could you explain? If you treat UTF-8 as ASCII data (as one byte, one character, basically), does it generally work with chars in the [127-255] range.


256 char ASCII is called "8-bit", "high", or "extended" ASCII. So, pure (7-bit) ASCII is the only thing you can hold in a "one-byte UTF-8 array". The 8th bit (or the first, depending on how you see it) is used to mark multibyte characters, i.e. any character that is not ASCII. So you only can represent 128 possible symbols into a UTF-8 character with length of one byte. In particular, UTF-8 maps these to ASCII (or, US-ASCII) characters, and the byte starts with a 0 bit (In other words, you cannot encode high ASCII into one UTF-8 byte.) For ALL other characters (no matter), the first bit in any (multi-)byte is set to 1. That's why it is easy to scan for length of ASCII chars in UTF-8, but not for any others.


Oh I see, thank you, it seems I was misinformed.


The important fact lies in the last sentence; in all of my apps, I could so far safely and soundly ignore the fact that Surrogates are two characters long to establish character lengths. So by using UTF-16, all those apps are significantly faster than one that would have been based on UTF-8. And, to add to this, by the same nature, it is far easier to take char slices of a UTF-16 encoded byte array than of a UTF-8 one.


how do you ensure you are not slicing a character in two? for example, how do you prevent slicing 2⁵ in half?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: