It depends on what you consider a "complex kernel". Futhark is only for regular ...

It depends on what you consider a "complex kernel". Futhark is only for regular non-recursive data parallelism, but I'll argue that something like a genetic algorithm that does calibration of market parameters in the Heston model[0] is pretty complex. It comprises multiple levels of parallelism and several kernels (last I checked, the core work is done in four kernels which are invoked in a loop).

But more importably, this benchmark is written as a composition of two reusable parts (a genetic algorithm that is parametric in its objective function, and a specific objective function that does option pricing) that are then put together in an efficient and automatic way by the compiler. You literally could not write it this way in OpenCL or CUDA (modulo extreme amounts of template metaprogramming in the latter). While you could certainly write a specialised GPU program that did exactly this calibration, and probably outperform Futhark, you would not be able to structure it as reusable components without significant performance loss. This, I think, is the main advantage of using a high-level language together with an optimising compiler.

[0]: https://github.com/diku-dk/futhark-benchmarks/tree/master/mi...