*>In general, internally using unicode and converting to and from bytes when doi...

pilif · on Nov 27, 2013

In Python 3, you don't care about what they use internally. You don't need to.

If you want to work with strings, you work with strings. If you want to work with bytes, you work with bytes. If you want to convert bytes into strings (maybe because it's user input that you want to work with), then you tell Python what encoding these bytes are in and you have it create a string for you. You don't care what Python uses internally, because their string API is correct and correctly works on characters.

That noël example of the original article consists of 4 characters in Python 3 which is exactly what you want.

I know that just using UTF-8 everywhere would be cool, but that's not how the world works for various reasons. One is that UTF-8 is a variable length encoding which has some performance issues for some operations (like getting the length of the string. Or finding the n-th character).

UTF-8 also isn't widely used by current operating systems (Mac OS and Windows use UCS-2). It's also not what's used by way too many legacy systems still around.

So as long as the data you work with likely isn't in UTF-8, the encoding and decoding steps will be needed if you want to be correct. Otherwise, you risk mixing strings in different encodings together which is an irrecoverable error (aside of using heuristics based on the language of the content).

fauigerzigerk · on Nov 27, 2013

>In Python 3, you don't care about what they use internally. You don't need to.

I do need to know and I always care. My requirements may be different than those of most others because I write text analysis code and I need to optimize the hell out of every single step. I shiver at the thought that any representation could be chosen for me automatically.

Of course, nothing is stopping me from simply using the bytes type instead of str, but clearly the Python community has decided to go down a road I feel is entirely wrong so I'm not coming along.

>I know that just using UTF-8 everywhere would be cool, but that's not how the world works for various reasons. One is that UTF-8 is a variable length encoding which has some performance issues for some operations (like getting the length of the string. Or finding the n-th character).

I'm bound to live in a variable length character world unless I decide to use 4 byte code points everywhere, which is prohibitive in terms of memory usage. Memory usage is absolutely critical. Iterating over a few characters now and then to count them is almost always negligible.

The need to index into a string to find the nth character only comes up when I know what I'm looking for. Things like parsing syntax or protocol headers come to mind, and they are always ASCII. I don't remember a situation where I needed to know the nth character of some arbitrary piece of text and repeat that operation in a tight loop.

If one day I find myself in such a situation I will have to convert to an array of code points anyway.

wbond · on Nov 27, 2013

So in your one, specific, performance-limited situation, Python 3's implementation of unicode doesn't work for you. Mostly because you are trying to optimize based on implementation details.

I don't see how this equates to a general purpose language failing at strings, especially when the language isn't particularly focused on performance and optimization. And if memory usage is of concern, I would certainly think anything like Python and Ruby would be out of the running?

fauigerzigerk · on Nov 27, 2013

>I don't see how this equates to a general purpose language failing at strings

And I don't see where I said it did.

I used to favor a dual Python/C++ strategy, but Python's multithreading limitations and the decisions around unicode have convinced me to move on. It's not like anything has gotten worse in Python 3, it's just that there has been a major change and the opportunity to do the right thing was missed.

I happen to think that UTF-8 everywhere is the right way to go, not just for my particular requirements, but for all applications, because it reduces overall complexity.

berdario · on Nov 27, 2013

I strongly disagree

and I'd like to know what do you think the "right thing" would be

I agree that only using UTF-8 would be the right thing, but only if you don't want to have "array of codepoints"... the problem is: every language, and every developer expect to be able to have random access to codepoints in their strings...

there're some weird exceptions, like Haskell Data.Text (I think that's due to haskell laziness)

would you prefer to have O(n) indexing and slicing of strings... or you'd prefer to get rid of these operations altogheter?

if the latter, what'd you prefer to do? force the developers to use .find() and handle such things manually... or create some compatibility string type restricted to non composable codepoints?

Getting an implementation out to see it used in the wild might be an interesting endeavor... probably it'd be easier to do in a language that allows you to customize it's reader/parser... like some lisp... clojure

fauigerzigerk · on Nov 27, 2013

>I agree that only using UTF-8 would be the right thing, but only if you don't want to have "array of codepoints"

Then we agree entirely. I want all strings to be UTF-8. Period. What I said about an array of codepoints was that I would create one seperately from the string if I ever had a requirement to access individual code point positions repeatedly in a tight loop.

>the problem is: every language, and every developer expect to be able to have random access to codepoints in their strings

If by random access you mean constant time access then those developers would be very disappointed to learn that they cannot do that in Java, C#, C++, JavaScript or Python, unless they happen to know that their string cannot possibly contain any characters outside the ASCII or BMP range.

>would you prefer to have O(n) indexing and slicing of strings

I would leave indexing/slicing operators in place and make sure everyone knows that it works with bytes not codepoints. In addition to that I would provide an O(n) function to access the nth codepoint as part of the standard library.

berdario · on Nov 27, 2013

> If by random access you mean constant time access then those developers would be very disappointed to learn that they cannot do that in Java, C#, C++, JavaScript or Python, unless they happen to know that their string cannot possibly contain any characters outside the ASCII or BMP range.

Actually, you can in Python... and obviously most developers ignore such issues [citation needed]

My point is that most developers don't know these details, a lot of idioms are ingrained... get them to work with string types properly won't be easy (but a good stdlib would obviously help immensely in this regard)

> I would leave indexing/slicing operators in place and make sure everyone knows that it works with bytes not codepoints. In addition to that I would provide an O(n) function to access the nth codepoint as part of the standard library.

Ok, so with your proposal an hypothetical slicing method on a String class in a java-like language would have this signature?

byte[] slice(int start, int end);

I've been fancying the idea of writing a custom String type/protocol for clojure that deals with the shortcoming of Java's strings... I'll probably have a try with your idea as well :)

masklinn · on Nov 27, 2013

> Actually, you can in Python...

No, you can only get random access on codepoints which will break text as soon as combining characters are involved. Even if you normalize everything beforehand (which most people don't do) as not all possible combinations have precomposed forms.

Unicode makes random access useless at anything other than destroying text.

> but a good stdlib would obviously help immensely in this regard

Which is extremely rare, and which Python does not have.

fauigerzigerk · on Nov 27, 2013

>Actually, you can in Python

You are right (apart from combining characters as masklinn explained), but as I said, that's only possible if an array of 32 bit ints is used to hold string data or if it can be guaranteed that there are no characters from outside ASCII or BMP. If I understand PEP 393 correctly, what Python 3.3 does is to use 32 bit ints to hold the entire string if even one such code point occurs. So if you load a (possibly large) text file into a string and one such code point exists then the file's size is going to quadruple in memory. All of that is done just to implement one very rare operation efficiently. http://www.python.org/dev/peps/pep-0393/#new-api

judk · on Nov 28, 2013

Sounds like you want to use Go. Feels like Python, but technically correct implementations of concepts.

microtherion · on Nov 28, 2013

Mac OS and Windows use UCS-2

Which parts of Mac OS? You'd have a lot of problems with Emoji support if that were true. To the best of my knowledge, it's UTF-16 everywhere.

Or do you actually mean Mac OS as in Mac OS 9, and not OS X?

panzi · on Nov 27, 2013

Agreed with most of it except:

"because their string API is correct"

Apparently they have a bug in their UTF-7 parser that can lead to invalid unicode strings. Don't know if it's already fixed.

berdario · on Nov 27, 2013

It was a bug in the decoding: it raised an unexpected exception, nothing that couldn't be worked around with a check (afaik it didn't crash the interpreter)

and it has been fixed since more than 1 month, just 2 days after it was reported

http://bugs.python.org/issue19279

Let's avoid spreading fud, shall we? :)

zokier · on Nov 27, 2013

That would be an implementation flaw, not an API issue.

panzi · on Nov 27, 2013

Indeed.