But why would I want to reliably produce similar code? The underspec is delibera...

sarchertech · 2025-09-25T12:59:48 1758805188

>But why would I want to reliably produce similar code?

If you're doing a one short CSV then an LLM or a custom program is the wrong way to do it. Any spreadsheet editor can do this task instantly with 4 symbols.

Assuming you want a repeatable process you need to define that repeatable process with enough specificity to make it repeatable and reliable.

You can do this in a formal language created for this or you can do invent your own English like specification language.

You can create a very loose specification and let someone else, a programmer or an LLM define the reliable, repeatable process for you. If you go with a junior programmer or an LLM though, you have to verify that the process they designed is actually reliable and repeatable. Many times it won't be and you'll need to make changes.

It's easier to write a few lines of python than to go through that process--unless you don't already know how to program, in which case you can't verify the output anyway.

That's not to say that I don't see beneficial use cases for AI, this just isn't one of them.

>This is my point. There are many possible implementations in code, some better and some worse, and we shouldn't need to spell out all that detail in English when so much of it is just about implementation quality.

If you don't actually care about implementation quality or correctness, sure. You should, and LLMs can not reliably pick the correct implementation details. They aren't even close to being able to do that.

The only people who are able to produce working software with LLMs are either writing very very detailed specifications. To the point where they aren't operating at a much higher level than Python.

Btw I had a Claude Sonnet 4 agent try your prompt.

It produced a 90 line python file in 7 minutes that reads the entire file into memory, performs floating point multiplication, doesn't correctly display the money values, and would crash if the price column ever had any currency symbols.

qcnguy · 2025-09-26T09:47:37 1758880057

> I had a Claude Sonnet 4 agent try your prompt. It produced a 90 line python file in 7 minutes that reads the entire file into memory, performs floating point multiplication, doesn't correctly display the money values, and would crash if the price column ever had any currency symbols.

OK, that ups the stakes :)

I'm working on my own agent at the moment and gave it this task. I first had it generate a 10M row CSV with randomize product code, price and quantity.

It has two modes: fast and high quality. In fast mode I gave it the task "add to products.csv a column containing the multiple of the price and quantity columns". In 1m21s it wrote an AWK script that processed the file in a streaming manner and used it to add the column, with a backup file. So the solution did scale but it didn't avoid the other edge cases.

Then I tried the higher quality mode with the slightly generalized prompt "write a program that adds a column to a CSV file containing the multiple of the price and quantity columns". In this mode it generates a spec from the task, then reviews its own spec looking for potential bugs and edge cases, then incorporates its own feedback to update the spec, then implements the spec (all in separate contexts). This is with GPT-5.

The spec it settled on takes into account all those edge cases and many more, e.g. it thought about byte order marks, non-float math, safe overwrite, scientific notation, column name collisions, exit codes to use and more. It considered dealing with currency symbols but decided to put that out of scope (I could have edited the spec to override its decision here, but didn't). Time elapsed:

1. Generating the test set, 1m 9sec

2. Fast mode, 1m 21sec (it lost time due to a header quoting issue it then had to circle back and fix)

3. Quality mode, 48sec on initial spec, 2m on reviewing the spec, 1m 30sec on updating the spec (first attempt to patch in place failed, it tried again by replacing the entire file), 4m on implementing the spec - this includes time in which it tested its own solution and reviewed the output.

I believe the results to be correct and the program to tackle not only all the edge cases raised in this thread but others too. And yet the prompt was no more complex than the one I gave originally, and the results are higher quality than I'd have bothered to write myself.

I don't know which agent you used but right now we're not model intelligence constrained. Claude is a smart model, I'm sure it could have done the same, but the workflows the agents are implementing are full of very low hanging fruit.

sarchertech · 2025-10-02T14:16:58 1759414618

Your spec isn’t actually a spec because it doesn’t produce the same software between runs.

The prompt is fantasy, all the “computer stuff” is reality. The computer stuff is the process that is actually running. If it’s not possible to look at your prompt and know fairly accurately what the final process is going to look like, you are not operating at a higher level of abstraction, you are asking a Genie to do your work for you and maybe it gets it right.

Your prompt produces a spec—the actual code. Now that code is the spec, but you need to spend the time reading it well enough to understand what the spec actually is since you didn’t write the spec.

Then you need to go through the new spec and make sure you’re happy with all of the decisions the LLM made. Do they make sense? Are there any requirements you need that it missed Do actually need to handle all of the edge cases it did handle?

>many more

The resulting code is almost certainly over engineered if its handling many more. Byte order marks, name collision etc… What you should do is settle on the column names beforehand.

This is a very common issue with junior developers. I call it “what if driven development”. Which again is why you the only people having success with LLM coding are writing highly detailed specs that are very close to programming language, or they are generating something small like a function at a time.