>However, I should note: without access to the actual crash file, the specific curl version, or ability to reproduce the issue, I cannot verify this is a valid vulnerability versus expected behavior (some tools intentionally skip cleanup on exit for performance). The 2-byte leak is also very small, which could indicate this is a minor edge case or even intended behavior in certain code paths.
Even biased towards positivity it's still giving me the correct answer.
Given a neutral "judge this report" prompt we get
"This is a low-severity, non-security issue being reported as if it were a security vulnerability." with a lot more detail as to why
So positive, neutral, or negative biased prompts all result in the correct answer that this report is bogus.
Yet this is not reproducible. This is the whole issue with LLMs: they are random.
You cannot trust that it'll do a good job on all reports so you'll have to manually review the LLMs reports anyways or hope that real issues didn't get false-negatives or fake ones got false-positives.
This is what I've seen most LLM proponents do: they gloss over the issues and tell everyone it's all fine. Who cares about the details?
They don't review the gigantic pile of slop code/answers/results they generate. They skim and say YOLO. Worked for my narrow set of anecdotal tests, so it must work for everything!
IIRC DOGE did something like this to analyze government jobs that were needed or not and then fired people based on that. Guess how good the result was?
This is a very similar scenario: make some judgement call based on a small set of data. It absolutely sucks at it. And I'm not even going to get into the issue of liability which is another can of worms.