The biggest performance issue Clojure has, which isn't mentioned in the article ...

sesm · on Oct 24, 2021

It is mentioned in the article, one of the last optimizations done there was switching to array. Also, Clojure’s HAMT was designed with CPU cache in mind and it’s performance characteristics don’t degrade over time. Immutable data structures will be slower than arrays - that’s true, but Clojure standard library works on arrays just fine, as demonstrated in the article.

MillenialMan · on Oct 24, 2021

You're right - I missed that, the author does mention arrays having better access characteristics, although he doesn't really explain why HAMTs specifically are slow.

How does Clojure's HAMT avoid fragmenting over time?

> Clojure standard library works on arrays just fine, as demonstrated in the article.

Right, but then you don't have immutability - so you lose all the guarantees that you originally had with immutable-by-default.

_bsless · on Oct 24, 2021

> although he doesn't really explain why HAMTs specifically are slow

Hello, author here :)

HAMTs are certainly slower, for "churning" operations, i.e., lots of updates, which is where Clojure exposes the transient API, which gives you limited localized mutability (some terms and conditions may apply)

Where iteration is concerned, they standard library implementation is pretty good. It relies on chunks of 64 element arrays which store the keys and values contiguously. Thus, APIs which expose direct iteration, Iterator and IReduce(Init) operate on these chunks one at a time. It isn't as fast as primitive arrays, but it's pretty fast.

didibus · on Oct 25, 2021

> but then you don't have immutability - so you lose all the guarantees that you originally had with immutable-by-default

Not really, for example take the following solution with execution time mean : 1.678226 ms

    (defn smt-8''' [^ints times-arr]
      (loop [res (transient []) pointer-1 (int 0) pointer-2 (int 7)]
        (if (< pointer-2 (alength times-arr))
          (let [start-element (aget times-arr pointer-1)
                end-element (aget times-arr pointer-2)
                time-diff (- end-element start-element)]
            (recur
             (if (< time-diff 1000)
               (conj! res [(mapv #(aget times-arr (+ pointer-1 (int %))) (range 8))
                           time-diff])
               res)
             (inc pointer-1)
             (inc pointer-2)))
          (persistent! res))))

It only requires the input being an array, but it will return an immutable persistent vector of vectors. So it is very easy to selectively go down to an array in the performance critical sections while being immutable and idiomatic in most places.

> he doesn't really explain why HAMTs specifically are slow

Take this solution using HAMT vectors with execution time mean : 23.567174 ms

    (defn smt-8' [times-vec]
      (loop [res (transient []) pointer-1 (int 0) pointer-2 (int 7)]
        (if-let [end-element (get times-vec pointer-2)]
          (let [end-element (int end-element)
                start-element (int (get times-vec pointer-1))
                time-diff (- end-element start-element)]
            (recur
             (if (< time-diff 1000)
               (conj! res [(subvec times-vec pointer-1 (inc pointer-2))
                           time-diff])
               res)
             (inc pointer-1)
             (inc pointer-2)))
          (persistent! res))))

Now it is 14 times slower than my prior one which iterates over an array, but it is still pretty fast, so that's OP's point, things are often sufficiently fast, and when they are not you can selectively optimize those parts, and easily too, see how similar my two solutions are from one another.

Edit: These assume you've `(set! unchecked-math true)` prior to compiling the functions.

jeofken · on Oct 24, 2021

I’m familiar with the implementation of HAMTs - if anyone wants to study one in C I recommend https://github.com/Jamesbarford/hash-array-mapped-trie or my polymorphic fork of it https://github.com/fromheten/hash-array-mapped-trie-poly.

Are there any other key/value data structures where insertion and retrieval are less than O(n) in complexity, but where the memory layout is better ordered for cache hits during searches? Maybe good old red-black trees?

_bsless · on Oct 24, 2021

I don't think there's an immediately available answer here. The wide branch factor in Clojure's implementation is very iteration-friendly, less so for updates.

Worth checking out are BTrees and Hitchhiker trees, but I think a definitive answer will be implementation dependent even in those cases, i.e. one might win out over the other for a specific branch factor or other tune-able parameters

dgb23 · on Oct 24, 2021

I wonder if the wide branching factor of them gives you some cache friendliness.

However, not every use case can benefit from cache optimization and you can use other data structures. It’s not super useful to make generalizations about performance that way.

j-pb · on Oct 24, 2021

Yes-ish: https://db.in.tum.de/~leis/papers/ART.pdf

reitzensteinm · on Oct 24, 2021

While I do agree, with pointer chasing down a persistent data structure being basically the worst case use of a CPU, the ease of threading Clojure programs means you can often claw a lot of that penalty back.

MillenialMan · on Oct 24, 2021

Threading doesn't compensate for that degree of slowdown, and itself has overhead. You'll get something back, but not much.

dgb23 · on Oct 24, 2021

Again, generalisations about performance like that is just not very useful. There are use cases where you get significant benefits from using persistent data structures.

Aeron comes to mind as an example in regards to high performance solutions that uses them. But a more fundamental reason is that immutability can be part of your domain logic. Think video games like braid, or business applications where you need to derive a lot of different views and go back and forth between them, or domains where data just is inherently immutable such as accounting and so on.

MillenialMan · on Oct 24, 2021

I don't really agree. Cache friendliness is almost always a relevant factor as soon as performance becomes an issue. I get what you're saying but as I see it immutability gives you architectural efficiency and in some cases space efficiency, but rarely processing speed.

> Think video games like braid

Braid doesn't use immutable data structures, it uses a historical record (and immutability would be incompatible with some of the mechanics).[1] The author of Braid is actually quite famous for his dislike of functional languages and immutability. He doesn't even like garbage collection because of the overhead it introduces.

Interestingly, he was talking about data structures for codebase representation (in the context of a compiler) a while back, and someone mentioned HAMTs. I'm definitely curious if they would work well there.

[1] https://www.youtube.com/watch?v=8dinUbg2h70

dgb23 · on Oct 24, 2021

You’re picking out an admittedly bad example in order to miss the point. Different data structures have different tradeoffs. Not every problem lends itself to the same type of optimization, so it’s not helpful to make these generalizations.

Continuing with the example regardless, change records and persistent data structures have different performance characteristics. The former is going fast if you move incrementally between states, the latter enables arbitrary access, comparison and efficient in memory caching of multiple views.

It would be interesting to explore and measure the trade offs under certain use cases.

MillenialMan · on Oct 25, 2021

I understand your point. I'm saying: the subset of problems that benefit in a performance sense from immutability is very small. The vast majority of the time, cache misses slow down algorithms. That's a perfectly reasonable generalisation and I don't really understand why you think it's untrue.

Re: change records, I believe Braid uses a historical record of absolute states, not a series of diffs. The state at a particular time is recreated from a minimal representation (a couple of arrays). That's much more efficient than throwing multiple iterations of the state in a HAMT.

dgb23 · on Oct 25, 2021

> The vast majority of the time, cache misses slow down algorithms. That's a perfectly reasonable generalisation and I don't really understand why you think it's untrue.

I don't say it is untrue. But not every use case lends itself to CPU cache optimization. Your access patterns might just happen to be arbitrary or fragmented from a layout perspective.

I would argue that this is a very common case, especially for IO bound applications that operate on tiny pieces of data and have deeply nested dispatch rules that each query some section of memory that you don't know it will query in advance.

Or in other words: The clearer your computation pipeline can be, the more you can benefit from CPU cache optimization.

> Re: change records, I believe Braid uses a historical record of absolute states, not a series of diffs. The state at a particular time is recreated from a minimal representation (a couple of arrays). That's much more efficient than throwing multiple iterations of the state in a HAMT.

You caught my interest, I will have to study this. I can't really imagine how it works from your explanation (my bad, not yours). I assumed when you said "change records" it meant that it would be suboptimal to access an arbitrary state (linear as opposed to constant).

reitzensteinm · on Oct 24, 2021

In my experience you can usually achieve near linear speed up. My machine can run 24 threads.

MillenialMan · on Oct 24, 2021

Fair enough on scaling. But 24 is still a lot less than two to three orders of magnitude.

reitzensteinm · on Oct 24, 2021

Yes, it is. I'd probably ballpark Clojure at 100x slower than the kinds of low level C# I usually write (gamedev). But threading C#, especially low level imperative C#, is so awful I often don't bother unless it's very important or there's an embarrassingly parallel element (genetic algorithms and simulating sound waves on a 2D grid are two cases I've pulled the trigger where both were true).

This leaves Clojure as 1/4 the overall speed, which seems about right. However that's based on a hugely unscientific gut feeling, because generally I don't touch Clojure if performance matters and I don't touch C# if it doesn't.

By the way, I've implemented persistent data structures on disk for a problem they were particularly useful for. If stalling for a cache miss feels bad, try waiting for an SSD :)

gpderetta · on Oct 24, 2021

Having to use 24 cores to get to 1/4th of the performance of a single threaded C# application seems particularly awful.

Especially as C# is a relatively pleasant language to use.

It doesn't matter how good threading in clojure is if you don't even need to use it to beat clojure on another language (also: great scaling is pointless if you start so far behind).

reitzensteinm · on Oct 24, 2021

Yes, hammering in a nail with a screwdriver is particularly awful. I love using Clojure where it's appropriate, but I'm not in the least bit tempted to use it for game dev.

gnuvince · on Oct 24, 2021

Could you share an example program that does that?

_bsless · on Oct 24, 2021

Take that with a grain of salt, from experiments run by myself and others I can share this does not scale linearly.

Hope some performance experts can chime in but this might be related to data structure sharing between caches or allocation contention in the JVM.

reitzensteinm · on Oct 24, 2021

If you've got some code, I'm happy to take a look.

_bsless · on Oct 24, 2021

You could start with these:

https://clojureverse.org/t/parallel-transducing-context-figh...

https://www.reddit.com/r/Clojure/comments/emylbs/any_idea_wh...

https://clojurians.zulipchat.com/#narrow/stream/151168-cloju...

https://github.com/joinr/paralleltest/issues/1

Anything in particular you're looking for?

reitzensteinm · on Oct 24, 2021

I don't have access to the Zulip chat, but the other benchmarks are basically testing allocating in a hot loop. I'm not surprised that doesn't scale linearly, and it's certainly not representative of real world code I've ever written.

If you have code you wrote to achieve something and hit a wall with scaling, I'm happy to take a look.

_bsless · on Oct 24, 2021

Up to 4 threads I usually see linear scaling as well, but it begins to drop off afterwards, although I don't have a representative example at hand.

I'd like to see a good example if you have one available, most of my performance work was done in a single-threaded context until now

reitzensteinm · on Oct 24, 2021

I don't really have anything off hand.

But code primarily slowed down by pointer chasing down big maps, which is MillenialMan's complaint and fits my own experience, will absolutely be sped up linearly.

A bunch of cores sitting around waiting for the next HAMT node to come in will not interfere with each other in the slightest.

dan-robertson · on Oct 24, 2021

This is entirely hypothesising but I don’t see how these wide trees are awful for the cache. In particular I don’t think they would be much worse than a flat array of objects—the problem is, I think, that objects are references and iterating the things in an array of pointers usually sucks.

For a HAMT, With depth (eg) 2, you should be able to prefetch the next node in your iteration while you iterate the current node of 64 elements. Actually iterating through the objects is going to suck but hopefully you can prefetch enough of them too. (Or maybe hotspot could online things enough to improve memory access patterns a bit to take advantage of ILP). There’s still the problem that you can’t read main memory sequentially except that often all the objects in a HAMT were allocated sequentially so you can read main memory sequentially (as allocation is typically just bumping a pointer, allocation-time-locality tends to correspond to address-locality)

lukashrb · on Oct 24, 2021

If I understand you correctly, this i a general problem of functional data structures?

> Clojure will always be slow, because it's not cache friendly.

You always have the option to use the java data structures, for the cases this kind of optimization is needed.

MillenialMan · on Oct 24, 2021

Yes, this is a general problem with functional data structures. They have to be fragmented in order to share data. There's also the more nebulous issue that they encourage nesting to leverage the architectural benefits of immutability, which is a complete disaster for cache friendliness.

Replacing the critical path is an option, but that only works for numpy-style situations where you can cleanly isolate the computational work and detach it from the plumbing. If your app is slow because the inefficiencies have added up across the entire codebase (more common, I would argue), that's not an easy option.

capableweb · on Oct 24, 2021

You seem to be missing the point lukashrb made regarding using Java data structures. Your claim was "It's inherent. Clojure will always be slow" which is demonstrable false as you can use interop with your host language (Java, JavaScript, Erlang) to use data structure that don't suffer from this. Wrap how you're using them in Clojure functions and the usage code for this is idiomatic while also having cache friendly data structures.

MillenialMan · on Oct 24, 2021

I understand the point, I'm not arguing that you can't do that or speed up your code by doing that.

Python is also slow, but you can still go fast (in some cases) by calling into C libraries like numpy - but the performance is coming from C, not Python. Python is still slow, it's just not an issue because you're only using it for plumbing.

But Clojure is immutable-by-default, that's the point of the language - it gives you various guarantees so you don't have to worry about mutability bubbling up from lower levels. In order to wrap your heavy path you have to go outside the Clojure structure and sacrifice that guarantee. You do lose structural integrity when you do that, even if you attempt to insulate your mutable structure. The system loses some provability.

capableweb · on Oct 24, 2021

Calling C from Python is very different from calling Java code from Clojure. Clojure always runs on it's host (Java/JavaScript), so calling functions via interop usually speeds up everything with zero overhead, compared to calling C code from Python which does introduce overhead.

Everything in Clojure comes from Java, it's unfair to compare it to "performance is coming from C, not Python" as Clojure works differently.

MillenialMan · on Oct 24, 2021

That wasn't really my point. There is still a translation process you have to apply to cross the boundary between immutable and native data structures though, and that has its own overhead.

pjmlp · on Oct 24, 2021

I have learned to embrace polyglot programming instead of trying to use the same language for every possible scenario.

Personally I see this language X is better than Y at Z a waste of time, because I would just use X and Y depending on what I have to do.

MillenialMan · on Oct 24, 2021

In general I would agree, but a significant part of Clojure's appeal is that it's immutable by default, because that allows you to make certain assumptions about the codebase. Introducing mutable datastructures means you can no longer make those assumptions, so it potentially has much wider ramifications than e.g. calling into C code from Python.

didibus · on Oct 25, 2021

> If your app is slow because the inefficiencies have added up across the entire codebase (more common, I would argue), that's not an easy option.

This is where I would have to disagree, in my experience, that is less common. Generally there are specific places that are hot spots, and you can just optimize those.

Could be it depends what application you are writing, I tend to write backend services and web apps, for those I've not really seen the "inefficiencies have added up", generally if you profile you'll find a few places that are your offenders.

"Slow" is also very relative.

    (cr/quick-bench (reduce (constantly nil) times-vector))
    Execution time mean : 3.985765 ms

    (cr/quick-bench (dotimes [i (.size times-arraylist)]
                      (.get times-arraylist i)))
    Execution time mean : 775.562574 µs

    (cr/quick-bench (dotimes [i (alength times-array)]
                      (aget times-array i)))
    Execution time mean : 590.941280 µs

Yes iterating over a persistent vector of Integers is slower compared to ArrayList and Array, about 5 to 8 times slower. But for a vector of size 1000000 it only takes 4ms to do so.

In Python:

    > python3 -m timeit "for i in range(1000000): None"
    20 loops, best of 5: 12.1 msec per loop

It take 12ms for example.

So I would say for most uses, persistent vectors serve as a great default mix of speed and correctness.

fulafel · on Oct 25, 2021

A traditional hash table can also be pretty cache unfriendly. I wonder if there are any published measurements that compare these.

gpderetta · on Oct 24, 2021

Partially unrolled node based data structures could help. Does clojure use them?