The biggest performance issue Clojure has, which isn't mentioned in the article and is fundamentally unsolvable, is that it misses the CPU cache - a lot.
The data structure that drives the immutable variables, the Hash-Array Mapped Trie, is efficient in terms of time- and space- complexity, but it's incredibly CPU cache-unfriendly because by nature it's fragmented - it's not contiguous in RAM. Any operation on a HAMT will be steeped in cache misses, which slows you down by a factor of hundreds to thousands. The longer the trie has been around, the more fragmented it is likely to become. Looping over a HAMT is slower than looping over an array by two or three orders of magnitude.
I don't think there is really a solution to that. It's inherent. Clojure will always be slow, because it's not cache friendly. The architecture of your computer means you physically can't speed it up without sacrificing immutability.
It is mentioned in the article, one of the last optimizations done there was switching to array. Also, Clojure’s HAMT was designed with CPU cache in mind and it’s performance characteristics don’t degrade over time. Immutable data structures will be slower than arrays - that’s true, but Clojure standard library works on arrays just fine, as demonstrated in the article.
You're right - I missed that, the author does mention arrays having better access characteristics, although he doesn't really explain why HAMTs specifically are slow.
How does Clojure's HAMT avoid fragmenting over time?
> Clojure standard library works on arrays just fine, as demonstrated in the article.
Right, but then you don't have immutability - so you lose all the guarantees that you originally had with immutable-by-default.
> although he doesn't really explain why HAMTs specifically are slow
Hello, author here :)
HAMTs are certainly slower, for "churning" operations, i.e., lots of updates, which is where Clojure exposes the transient API, which gives you limited localized mutability (some terms and conditions may apply)
Where iteration is concerned, they standard library implementation is pretty good. It relies on chunks of 64 element arrays which store the keys and values contiguously. Thus, APIs which expose direct iteration, Iterator and IReduce(Init) operate on these chunks one at a time. It isn't as fast as primitive arrays, but it's pretty fast.
It only requires the input being an array, but it will return an immutable persistent vector of vectors. So it is very easy to selectively go down to an array in the performance critical sections while being immutable and idiomatic in most places.
> he doesn't really explain why HAMTs specifically are slow
Take this solution using HAMT vectors with execution time mean : 23.567174 ms
Now it is 14 times slower than my prior one which iterates over an array, but it is still pretty fast, so that's OP's point, things are often sufficiently fast, and when they are not you can selectively optimize those parts, and easily too, see how similar my two solutions are from one another.
Edit: These assume you've `(set! unchecked-math true)` prior to compiling the functions.
Are there any other key/value data structures where insertion and retrieval are less than O(n) in complexity, but where the memory layout is better ordered for cache hits during searches? Maybe good old red-black trees?
I don't think there's an immediately available answer here. The wide branch factor in Clojure's implementation is very iteration-friendly, less so for updates.
Worth checking out are BTrees and Hitchhiker trees, but I think a definitive answer will be implementation dependent even in those cases, i.e. one might win out over the other for a specific branch factor or other tune-able parameters
I wonder if the wide branching factor of them gives you some cache friendliness.
However, not every use case can benefit from cache optimization and you can use other data structures. It’s not super useful to make generalizations about performance that way.
While I do agree, with pointer chasing down a persistent data structure being basically the worst case use of a CPU, the ease of threading Clojure programs means you can often claw a lot of that penalty back.
Again, generalisations about performance like that is just not very useful. There are use cases where you get significant benefits from using persistent data structures.
Aeron comes to mind as an example in regards to high performance solutions that uses them. But a more fundamental reason is that immutability can be part of your domain logic. Think video games like braid, or business applications where you need to derive a lot of different views and go back and forth between them, or domains where data just is inherently immutable such as accounting and so on.
I don't really agree. Cache friendliness is almost always a relevant factor as soon as performance becomes an issue. I get what you're saying but as I see it immutability gives you architectural efficiency and in some cases space efficiency, but rarely processing speed.
> Think video games like braid
Braid doesn't use immutable data structures, it uses a historical record (and immutability would be incompatible with some of the mechanics).[1] The author of Braid is actually quite famous for his dislike of functional languages and immutability. He doesn't even like garbage collection because of the overhead it introduces.
Interestingly, he was talking about data structures for codebase representation (in the context of a compiler) a while back, and someone mentioned HAMTs. I'm definitely curious if they would work well there.
You’re picking out an admittedly bad example in order to miss the point. Different data structures have different tradeoffs. Not every problem lends itself to the same type of optimization, so it’s not helpful to make these generalizations.
Continuing with the example regardless, change records and persistent data structures have different performance characteristics. The former is going fast if you move incrementally between states, the latter enables arbitrary access, comparison and efficient in memory caching of multiple views.
It would be interesting to explore and measure the trade offs under certain use cases.
I understand your point. I'm saying: the subset of problems that benefit in a performance sense from immutability is very small. The vast majority of the time, cache misses slow down algorithms. That's a perfectly reasonable generalisation and I don't really understand why you think it's untrue.
Re: change records, I believe Braid uses a historical record of absolute states, not a series of diffs. The state at a particular time is recreated from a minimal representation (a couple of arrays). That's much more efficient than throwing multiple iterations of the state in a HAMT.
> The vast majority of the time, cache misses slow down algorithms. That's a perfectly reasonable generalisation and I don't really understand why you think it's untrue.
I don't say it is untrue. But not every use case lends itself to CPU cache optimization. Your access patterns might just happen to be arbitrary or fragmented from a layout perspective.
I would argue that this is a very common case, especially for IO bound applications that operate on tiny pieces of data and have deeply nested dispatch rules that each query some section of memory that you don't know it will query in advance.
Or in other words: The clearer your computation pipeline can be, the more you can benefit from CPU cache optimization.
> Re: change records, I believe Braid uses a historical record of absolute states, not a series of diffs. The state at a particular time is recreated from a minimal representation (a couple of arrays). That's much more efficient than throwing multiple iterations of the state in a HAMT.
You caught my interest, I will have to study this. I can't really imagine how it works from your explanation (my bad, not yours). I assumed when you said "change records" it meant that it would be suboptimal to access an arbitrary state (linear as opposed to constant).
Yes, it is. I'd probably ballpark Clojure at 100x slower than the kinds of low level C# I usually write (gamedev). But threading C#, especially low level imperative C#, is so awful I often don't bother unless it's very important or there's an embarrassingly parallel element (genetic algorithms and simulating sound waves on a 2D grid are two cases I've pulled the trigger where both were true).
This leaves Clojure as 1/4 the overall speed, which seems about right. However that's based on a hugely unscientific gut feeling, because generally I don't touch Clojure if performance matters and I don't touch C# if it doesn't.
By the way, I've implemented persistent data structures on disk for a problem they were particularly useful for. If stalling for a cache miss feels bad, try waiting for an SSD :)
Having to use 24 cores to get to 1/4th of the performance of a single threaded C# application seems particularly awful.
Especially as C# is a relatively pleasant language to use.
It doesn't matter how good threading in clojure is if you don't even need to use it to beat clojure on another language (also: great scaling is pointless if you start so far behind).
Yes, hammering in a nail with a screwdriver is particularly awful. I love using Clojure where it's appropriate, but I'm not in the least bit tempted to use it for game dev.
I don't have access to the Zulip chat, but the other benchmarks are basically testing allocating in a hot loop. I'm not surprised that doesn't scale linearly, and it's certainly not representative of real world code I've ever written.
If you have code you wrote to achieve something and hit a wall with scaling, I'm happy to take a look.
But code primarily slowed down by pointer chasing down big maps, which is MillenialMan's complaint and fits my own experience, will absolutely be sped up linearly.
A bunch of cores sitting around waiting for the next HAMT node to come in will not interfere with each other in the slightest.
This is entirely hypothesising but I don’t see how these wide trees are awful for the cache. In particular I don’t think they would be much worse than a flat array of objects—the problem is, I think, that objects are references and iterating the things in an array of pointers usually sucks.
For a HAMT, With depth (eg) 2, you should be able to prefetch the next node in your iteration while you iterate the current node of 64 elements. Actually iterating through the objects is going to suck but hopefully you can prefetch enough of them too. (Or maybe hotspot could online things enough to improve memory access patterns a bit to take advantage of ILP). There’s still the problem that you can’t read main memory sequentially except that often all the objects in a HAMT were allocated sequentially so you can read main memory sequentially (as allocation is typically just bumping a pointer, allocation-time-locality tends to correspond to address-locality)
Yes, this is a general problem with functional data structures. They have to be fragmented in order to share data. There's also the more nebulous issue that they encourage nesting to leverage the architectural benefits of immutability, which is a complete disaster for cache friendliness.
Replacing the critical path is an option, but that only works for numpy-style situations where you can cleanly isolate the computational work and detach it from the plumbing. If your app is slow because the inefficiencies have added up across the entire codebase (more common, I would argue), that's not an easy option.
You seem to be missing the point lukashrb made regarding using Java data structures. Your claim was "It's inherent. Clojure will always be slow" which is demonstrable false as you can use interop with your host language (Java, JavaScript, Erlang) to use data structure that don't suffer from this. Wrap how you're using them in Clojure functions and the usage code for this is idiomatic while also having cache friendly data structures.
I understand the point, I'm not arguing that you can't do that or speed up your code by doing that.
Python is also slow, but you can still go fast (in some cases) by calling into C libraries like numpy - but the performance is coming from C, not Python. Python is still slow, it's just not an issue because you're only using it for plumbing.
But Clojure is immutable-by-default, that's the point of the language - it gives you various guarantees so you don't have to worry about mutability bubbling up from lower levels. In order to wrap your heavy path you have to go outside the Clojure structure and sacrifice that guarantee. You do lose structural integrity when you do that, even if you attempt to insulate your mutable structure. The system loses some provability.
Calling C from Python is very different from calling Java code from Clojure. Clojure always runs on it's host (Java/JavaScript), so calling functions via interop usually speeds up everything with zero overhead, compared to calling C code from Python which does introduce overhead.
Everything in Clojure comes from Java, it's unfair to compare it to "performance is coming from C, not Python" as Clojure works differently.
That wasn't really my point. There is still a translation process you have to apply to cross the boundary between immutable and native data structures though, and that has its own overhead.
In general I would agree, but a significant part of Clojure's appeal is that it's immutable by default, because that allows you to make certain assumptions about the codebase. Introducing mutable datastructures means you can no longer make those assumptions, so it potentially has much wider ramifications than e.g. calling into C code from Python.
> If your app is slow because the inefficiencies have added up across the entire codebase (more common, I would argue), that's not an easy option.
This is where I would have to disagree, in my experience, that is less common. Generally there are specific places that are hot spots, and you can just optimize those.
Could be it depends what application you are writing, I tend to write backend services and web apps, for those I've not really seen the "inefficiencies have added up", generally if you profile you'll find a few places that are your offenders.
"Slow" is also very relative.
(cr/quick-bench (reduce (constantly nil) times-vector))
Execution time mean : 3.985765 ms
(cr/quick-bench (dotimes [i (.size times-arraylist)]
(.get times-arraylist i)))
Execution time mean : 775.562574 µs
(cr/quick-bench (dotimes [i (alength times-array)]
(aget times-array i)))
Execution time mean : 590.941280 µs
Yes iterating over a persistent vector of Integers is slower compared to ArrayList and Array, about 5 to 8 times slower. But for a vector of size 1000000 it only takes 4ms to do so.
In Python:
> python3 -m timeit "for i in range(1000000): None"
20 loops, best of 5: 12.1 msec per loop
It take 12ms for example.
So I would say for most uses, persistent vectors serve as a great default mix of speed and correctness.
The data structure that drives the immutable variables, the Hash-Array Mapped Trie, is efficient in terms of time- and space- complexity, but it's incredibly CPU cache-unfriendly because by nature it's fragmented - it's not contiguous in RAM. Any operation on a HAMT will be steeped in cache misses, which slows you down by a factor of hundreds to thousands. The longer the trie has been around, the more fragmented it is likely to become. Looping over a HAMT is slower than looping over an array by two or three orders of magnitude.
I don't think there is really a solution to that. It's inherent. Clojure will always be slow, because it's not cache friendly. The architecture of your computer means you physically can't speed it up without sacrificing immutability.