Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

The question isn't clearly written down anywhere, that's why. Presumably actual candidates would have been given more info over the phone or email. Part of the "challenge" is reverse engineering their Python; unclear if that's intentional.

If you look at the top of perf_takehome.py then there is a brief comment saying the challenge is to optimize a kernel. Kernel in GPU land means a program that computes on data in parallel, it's not an OS kernel:

    Optimize the kernel (in KernelBuilder.build_kernel) as much as possible in the
    available time, as measured by test_kernel_cycles on a frozen separate copy
    of the simulator.
However, this kernel doesn't run on an actual GPU. It runs on a little interpreter for a custom assembly language written in Python. Thus you will be optimizing the program built in-memory by the function on this line:

https://github.com/anthropics/original_performance_takehome/...

This function is described only as:

    Like reference_kernel2 but building actual instructions.
    Scalar implementation using only scalar ALU and load/store.
The KernelBuilder class has some fields like "instrs" but we can't immediately see what they're meant to be because this is Python and types are optional. Nonetheless we can see that instructions are being added to a list, and below we can see the test_kernel_cycles function that runs the interpreter on the program. So our mission is to change the build_kernel function to make a better program. And it says this is an assembly version of the python function reference_kernel2 which is found in problem.py.

What exactly is this kernel doing? The reference_kernel2 function doesn't explain itself either - it's some sort of parallel tree walk. Let's put that to one side for a second and explore the machine, which is defined in problem.py. The machine itself is also largely undocumented, but there's a brief description in a docstring on line 66.

At this point it helps to understand the design of exotic processors. The emulator is for a fictional CPU that uses a VLIW SIMD ISA. Normal programmers will never encounter such a chip. Intel tried to make such a machine decades ago and it never took off, since then the concept has been largely dead. I believe it's still used in some mobile DSPs like Qualcomm's Hexagon. Notably, NVIDIA PTX is not such an ISA so this seems to have been chosen just to make things harder. As the comment explains, in a VLIW machine multiple instructions are packed together into a "slot" and executed in parallel. In a normal CPU the hardware reads a serial stream of instructions and works out just in time which can be executed in parallel, using fancy out-of-order circuitry. In a VLIW machine that's done ahead of time by the compiler or (in this case) the humble programmer, you. But this isn't just a VLIW machine, it's also multi-core, and multi-"engine", so there are multiple levels of execution going on. And it's SIMD, meaning each instruction can itself operate on multiple bits of data simultaneously.

This machine doesn't have registers or cache but it does have "scratch space", and so you can use the vector instructions to load data into a series of 32 bit scratch words and then do things on them in parallel. And multiple vector instructions can also run in parallel. "Broadcasting a scalar" in SIMD-speak means taking a single value and repeating it over multiple scratch space slots (or register subwords in a real machine), so you take e.g. 0xFF and get 0xFFFFFFFFFFFFFFFF.

And that's it, that's all we get. As the code says: "This comment is not meant to be full ISA documentation though, for the rest you should look through the simulator code". Possible point of confusion: real ISAs are serialized to bytes but this one is just Python tuples. The code is only partially typed; sometimes you're just left guessing.

So to recap, the problem is to optimize an undocumented program expressed in undocumented data structures returned by a Python function whose result is interpreted by a partly documented Python class that simulates a fictional exotic CPU architecture using an abandoned design that gives a lot of parallel computational capacity, but which requires all parallelism to be statically declared ahead of time, whilst simultaneously reverse engineering the Python that does all this.

Does that help? Sounds like a fun exercise :)

Edit: I just checked and Google TPUs are much more VLIW like so perhaps this simulator is designed to match a TPU. I know Anthropic rely on TPUs for serving and have done some optimization for them.





It does seem a bit of a strange challenge - a bit reminiscent of high school math problems where understanding the question was as much part of it as actually solving the problem when you understood it.

Since the focus of the challenge appears(?) intended to be optimization, not reverse engineering, it's a bit odd that they don't give a clear statement of what the kernel is meant to be computing. Perhaps the challenge is intended to be a combination of the two, but then the correct reverse engineering part of it becomes a gate for the optimization part, else you'll be solving the wrong problem.

Given the focus on results achieved by Opus 4.5, maybe that's the main point - to show how well Opus can reverse engineer something like this. If they gave the actual clear problem statement, then maybe you could brute force an optimal solution using tree search.


This isn't "reverse engineering" it's merely "being able to read fairly simple code you didn't write". A much simpler version of the kernel is provided at the end of problem.py as reference_kernel2.

If you can't make sense of such a small codebase or don't immediately recognize the algorithm that's being used (I'm guilty of the latter) then you presumably aren't someone that they want to hire.


Fair enough, and there are clues in the comments too, but why not just provide the specification of the kernel (inputs and outputs) as part of the problem?

They do. They provide reference_kernel which shows the algorithm itself, build_mem_image which shows the data format you will be working with, and finally reference_kernel2 which implements said algorithm on said data format.

They then provide you with a very naive implementation that runs on their (very simple) VLIW architecture that you are to optimize.

If at the end of that someone is still lost I think it is safe to say it was their goal that person should fail.


Well, yes, they have a reference implementation as documentation, just as they have the simulator as documentation for the ISA ...

The problem is about pipelining memory loads and ALU operations, so why not just give clear documentatation and state the task rather than "here's a kernel - optimize it"? \_(ツ)_/


Presumably that is only one of two purposes, with the other being to test your ability to efficiently read, understand, and edit low level code that you didn't write. I imagine you'd regularly run into raw PTX if you worked for them in the relevant capacity.

And perhaps a third purpose is to use the simulator to test your ability to reason about hardware that you are only just getting familiar with.


I just threw this prompt at Gemini, and it seems (I haven't analyzed the problem to see if it is correct), to be able to extract a clear understanding of the problem, and a specification for the kernel.

"Can you "reverse engineer" what the kernel in this optimization exercise is actually doing - write a specification for it?

https://github.com/anthropics/original_performance_takehome"

Gemini says it's doing inference on a random forest - taking a batch of inputs, running each one through each decision tree, and for each input outputting the sum of these decision tree outputs - the accumulated evidence.


This is nice writeup. Thanks. Another commenter said will've taken them 2h just to sketch out ideas; sans LLMs will've taken me more than 2h just to collect all this info let alone start optimizing it.

It took me about 10 minutes to generate that writeup the old fashioned 100% organic way, because one of the things that's unspecified is whether you're allowed to use AI to help solve it! So I assumed as it's a job interview question you're not allowed, but now I see other comments saying it was allowed. That would let you get much further.

I think I'd be able to make some progress optimizing this program in two hours but probably not much. I'm not a performance engineer but have designed exotic emulated CPU architectures before, so that helps a lot.


I've not written a VM before, but the comments in perf_takehome.py and problem.py explain the basics of this.

I gleaned about half of this comment in a few minutes of just skimming the code and reading the comments on the functions and classes. There's only 500 lines of code really (the rest is the benchmark framework).


Same thought. I doubt they provided additional explanation to candidates - it seems that basic code literacy within the relevant domain is one of the first things being tested.

On the whole I don't think I'd perform all that well on this task given a short time limit but it seems to me to be an extremely well designed task given the stated context. The reference kernel easily fits on a single screen and even the intrinsic version almost does. I think this task would do a good job filtering the people they don't want working for them (and it seems quite likely that I'm borderline or maybe worse by their metric).


I think calling VLIW "an adandoned design" is somewhat of an exaggeration, such architectures are pretty common for embedded audio processing.

Worth adding on that note:

From JAX to VLIW: Tracing a Computation Through the TPU Compiler Stack, https://patricktoulme.substack.com/p/from-jax-to-vliw-tracin...

Google’s Training Chips Revealed: TPUv2 and TPUv3, HotChips 2020, https://hc32.hotchips.org/assets/program/conference/day2/Hot...

Ten Lessons From Three Generations Shaped Google’s TPUv4i, ISCA 2021, https://gwern.net/doc/ai/scaling/hardware/2021-jouppi.pdf


x86-64 SSE and AVX are also SIMD

Sure. I did mention DSPs. But how many people write code for DSPs?

    Sounds like a fun exercise :)
I'll be honest, that sounds like the opposite of fun since the worst parts of my job are touching the parts of a Python codebase that are untyped. The sad part is this work codebase isn't even that old, maybe a few years, and the developers definitely should have known better if they had anyone capable leading them. Alas, they're all gone now.

Harder than figuring out the instruction set for some exotic CPU are definitely the giant untyped dicts/lists common in data science code.


"Performance can be optimized by not using python."

On the one hand, this exercise probably reflects a realistic task. Daily engineering work comprises a lot of reverse engineering and debugging of messy code. On the other hand, this does not seem very suitable as an isolated assignment. The lack of code base-specific context has a lot of potential for frustration. I wonder what they really tested on the candidates, and whether this was what they wanted to filter for.

> The lack of code base-specific context has a lot of potential for frustration.

I think that's one of the intentional points. Being able to quickly understand what the provided source code is doing.


> but which requires all parallelism to be statically declared ahead of time

this is what all specialized chips like TPU/Cerebras require today, and it allows for better optimization than a generic CPU since you can "waste" 30 min figuring out the perfect routing/sequencing of operations, instead of doing it in the CPU in nanoseconds/cycles

another benefit is you can throw away all the CPU out-of-order/branch prediction logic and put useful matrix multipliers in it's place


Wow! Thanks for the explanation :)



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: