> Each case runs three agents: a Curator reads the advisory and builds an answer key, a Finder (the model under test) gets 24 shell steps to explore the code and write a structured report, and a Judge scores the blinded submission. The Finder never sees the patch. It starts from sink hints and must trace the bug through actual code.

Curator, answer key, Finder, shell steps, structured report, sink hints… I understand nothing. Did you use an LLM to generate this HN submission?

It looks like a standard LLM-as-a-judge approach. Do you manually validate or verify some of the results? Done poorly, the results can be very noisy and meaningless.

yorwba10 hours ago | | | parent | | on: 47759166
Yeah, the LLM judge is a bit too gullible. GLM 5.1 here https://ndaybench.winfunc.com/traces/trace_585887808ff443cca... claims that onnx/checker.cc doesn't reject hardlinks, even though it does (and the model output even quotes the lines that perform the check). The actual patch https://github.com/onnx/onnx/commit/4755f8053928dce18a61db8f... instead adds using std::filesystem::weakly_canonical to catch path traversal through symlinks. It also adds a Python function that does the same (?) checks when saving files. Honestly, even that patch seems LLM-generated to me, the way it duplicates code in a bunch of places instead of channeling all file accesses through a single hardened function.

Anyway, GLM 5.1 gets a score of 93 for its incorrect report.

rohansood1518 hours ago | | | parent | | on: 47759166
I worked in AppSec in the past, made sense to me. Maybe you aren't the target audience?

You don't really need manual verification for these, the CVEs (vulnerabilities) are public and can be programmatically validated.

muldvarp12 hours ago | | | parent | | on: 47759668
Manual verification that the "judge" judges correctly.

Also, how exactly do you programmatically validate CVEs?

johnfn17 hours ago | | | parent | | on: 47759166
Is this really that hard to parse?

Curator and Finder are the names of the agents. "answer key" - haven't you ever taken a test in high school? It's an explanation of the answer. "shell steps" I presume means it gets to run 24 commands on the shell. "structured report" - do I really need to explain to you what a report is? "sink hints" - I admit I didn't know this one, but a bit of searching indicates that it's a hint at where the vulnerability lies.

peyton19 hours ago | | | parent | | on: 47759166
> Did you use an LLM to generate this HN submission?

Must have.

> The Finder will never see the patch.

I wasn’t worried that this eval would show the answer to the model before evaluating it. Seems requirements leaked into this post.