Look at those graphs another time. Claude beats gpt.
superfrank1 day ago | | | parent | | on: 47756935
Can you explain where you're seeing that? From what I see, the first two graphs have OpenAI models above Claude models (including Mythos) on the Technical Non-Expert and the Practitioner evals. Mythos now beats Codex 5.3 on the Expert eval and Opus was already on top for the Apprentice one although now Mythos leads there.

So, even including Mythos, OpenAI still has 2 models on top for the 4 evals listed.

bonsai_spool1 day ago | | | parent | | on: 47758554
> From what I see, the first two graphs have OpenAI models above Claude

That's just in that final graph, and that graph is perhaps the least instructive - they talk about ranges of outcomes but they don't show whether all of the models besides Mythos / Opus 4.6 overlap

Take a look at all three graphs together and it's clear Anthropic are doing better in this arena

superfrank22 hours ago | | | parent | | on: 47759409
Yes. I know. That was exactly what I said in my first comment.

On individual tasks Claude and GPT are comparable (as shown in the first two graphs), but on multiple step problems that require more autonomy Mythos is far better (as shown in the third graph).

This is the exact wording from my original comment

> So with that said, I think the graph under the "Cyber range results" is the important one. The ones at the top show that, yes, Mythos isn't too much better than any of the existing models on well constrained problems, but when the models are given ambiguous challenges that require multiple steps it's much, much better than anything on the market.

bonsai_spool19 hours ago | | | parent | | on: 47760661
> On individual tasks Claude and GPT are comparable

That is not what the first graphs show - the Anthropic models cluster at 'better' positions on the graph, and I imagine you could show that the values are significantly different.