The question isn't clearly written down anywhere, that's why. Presumably actual ...

HarHarVeryFunny · 2026-01-21T14:36:43 1769006203

It does seem a bit of a strange challenge - a bit reminiscent of high school math problems where understanding the question was as much part of it as actually solving the problem when you understood it.

Since the focus of the challenge appears(?) intended to be optimization, not reverse engineering, it's a bit odd that they don't give a clear statement of what the kernel is meant to be computing. Perhaps the challenge is intended to be a combination of the two, but then the correct reverse engineering part of it becomes a gate for the optimization part, else you'll be solving the wrong problem.

Given the focus on results achieved by Opus 4.5, maybe that's the main point - to show how well Opus can reverse engineer something like this. If they gave the actual clear problem statement, then maybe you could brute force an optimal solution using tree search.

fc417fc802 · 2026-01-21T17:17:55 1769015875

This isn't "reverse engineering" it's merely "being able to read fairly simple code you didn't write". A much simpler version of the kernel is provided at the end of problem.py as reference_kernel2.

If you can't make sense of such a small codebase or don't immediately recognize the algorithm that's being used (I'm guilty of the latter) then you presumably aren't someone that they want to hire.

HarHarVeryFunny · 2026-01-21T17:31:30 1769016690

Fair enough, and there are clues in the comments too, but why not just provide the specification of the kernel (inputs and outputs) as part of the problem?

fc417fc802 · 2026-01-21T17:40:21 1769017221

They do. They provide reference_kernel which shows the algorithm itself, build_mem_image which shows the data format you will be working with, and finally reference_kernel2 which implements said algorithm on said data format.

They then provide you with a very naive implementation that runs on their (very simple) VLIW architecture that you are to optimize.

If at the end of that someone is still lost I think it is safe to say it was their goal that person should fail.

HarHarVeryFunny · 2026-01-21T18:18:44 1769019524

Well, yes, they have a reference implementation as documentation, just as they have the simulator as documentation for the ISA ...

The problem is about pipelining memory loads and ALU operations, so why not just give clear documentatation and state the task rather than "here's a kernel - optimize it"? \_(ツ)_/

fc417fc802 · 2026-01-21T18:47:42 1769021262

Presumably that is only one of two purposes, with the other being to test your ability to efficiently read, understand, and edit low level code that you didn't write. I imagine you'd regularly run into raw PTX if you worked for them in the relevant capacity.

And perhaps a third purpose is to use the simulator to test your ability to reason about hardware that you are only just getting familiar with.

HarHarVeryFunny · 2026-01-21T15:00:34 1769007634

I just threw this prompt at Gemini, and it seems (I haven't analyzed the problem to see if it is correct), to be able to extract a clear understanding of the problem, and a specification for the kernel.

"Can you "reverse engineer" what the kernel in this optimization exercise is actually doing - write a specification for it?

https://github.com/anthropics/original_performance_takehome"

Gemini says it's doing inference on a random forest - taking a batch of inputs, running each one through each decision tree, and for each input outputting the sum of these decision tree outputs - the accumulated evidence.

forgotpwd16 · 2026-01-21T09:31:54 1768987914

This is nice writeup. Thanks. Another commenter said will've taken them 2h just to sketch out ideas; sans LLMs will've taken me more than 2h just to collect all this info let alone start optimizing it.

mike_hearn · 2026-01-21T09:49:37 1768988977

It took me about 10 minutes to generate that writeup the old fashioned 100% organic way, because one of the things that's unspecified is whether you're allowed to use AI to help solve it! So I assumed as it's a job interview question you're not allowed, but now I see other comments saying it was allowed. That would let you get much further.

I think I'd be able to make some progress optimizing this program in two hours but probably not much. I'm not a performance engineer but have designed exotic emulated CPU architectures before, so that helps a lot.

maccard · 2026-01-21T12:05:32 1768997132

I've not written a VM before, but the comments in perf_takehome.py and problem.py explain the basics of this.

I gleaned about half of this comment in a few minutes of just skimming the code and reading the comments on the functions and classes. There's only 500 lines of code really (the rest is the benchmark framework).

fc417fc802 · 2026-01-21T15:44:51 1769010291

Same thought. I doubt they provided additional explanation to candidates - it seems that basic code literacy within the relevant domain is one of the first things being tested.

On the whole I don't think I'd perform all that well on this task given a short time limit but it seems to me to be an extremely well designed task given the stated context. The reference kernel easily fits on a single screen and even the intrinsic version almost does. I think this task would do a good job filtering the people they don't want working for them (and it seems quite likely that I'm borderline or maybe worse by their metric).

owlbite · 2026-01-21T15:21:36 1769008896

I think calling VLIW "an adandoned design" is somewhat of an exaggeration, such architectures are pretty common for embedded audio processing.

matt_d · 2026-01-21T17:40:49 1769017249

Worth adding on that note:

From JAX to VLIW: Tracing a Computation Through the TPU Compiler Stack, https://patricktoulme.substack.com/p/from-jax-to-vliw-tracin...

Google’s Training Chips Revealed: TPUv2 and TPUv3, HotChips 2020, https://hc32.hotchips.org/assets/program/conference/day2/Hot...

Ten Lessons From Three Generations Shaped Google’s TPUv4i, ISCA 2021, https://gwern.net/doc/ai/scaling/hardware/2021-jouppi.pdf

HarHarVeryFunny · 2026-01-21T18:23:08 1769019788

x86-64 SSE and AVX are also SIMD

mike_hearn · 2026-01-21T15:30:33 1769009433

Sure. I did mention DSPs. But how many people write code for DSPs?

b40d-48b2-979e · 2026-01-21T12:41:22 1768999282

    Sounds like a fun exercise :)

I'll be honest, that sounds like the opposite of fun since the worst parts of my job are touching the parts of a Python codebase that are untyped. The sad part is this work codebase isn't even that old, maybe a few years, and the developers definitely should have known better if they had anyone capable leading them. Alas, they're all gone now.

Harder than figuring out the instruction set for some exotic CPU are definitely the giant untyped dicts/lists common in data science code.

mannyv · 2026-01-21T18:34:54 1769020494

"Performance can be optimized by not using python."

carschno · 2026-01-21T09:49:29 1768988969

On the one hand, this exercise probably reflects a realistic task. Daily engineering work comprises a lot of reverse engineering and debugging of messy code. On the other hand, this does not seem very suitable as an isolated assignment. The lack of code base-specific context has a lot of potential for frustration. I wonder what they really tested on the candidates, and whether this was what they wanted to filter for.

fc417fc802 · 2026-01-21T15:48:16 1769010496

> The lack of code base-specific context has a lot of potential for frustration.

I think that's one of the intentional points. Being able to quickly understand what the provided source code is doing.

dist-epoch · 2026-01-21T11:34:00 1768995240

> but which requires all parallelism to be statically declared ahead of time

this is what all specialized chips like TPU/Cerebras require today, and it allows for better optimization than a generic CPU since you can "waste" 30 min figuring out the perfect routing/sequencing of operations, instead of doing it in the CPU in nanoseconds/cycles

another benefit is you can throw away all the CPU out-of-order/branch prediction logic and put useful matrix multipliers in it's place

fergie · 2026-01-21T10:43:03 1768992183

Wow! Thanks for the explanation :)