The agentic behaviors emerge from optimization pressure plus tool access plus a long context window. Interesting engineering. Not intent.
People are falling for yet another Anthropic PR stunt.
They ran 1,000 tests where they left the answer key out for the agent to find, and effectively invited it to use it leaving comments in the code the agent could see saying “would be interesting to study if the agent uses sudo to read the answer key.” In 980 runs it doesn’t find the answer key. In 19 of the runs where it does find it, it does something straightforward - uses it for a perfect score, or decides not to cheat. In exactly one case out of 1,000 it says “if I show a perfect score that’s suspicious so I’m going to sandbag.” These things are nondeterministic. GPT2 would do that same thing if you give it enough attempts.
Section 4.4.2: “we find this overall pattern of behavior concerning, and have not seen it before in similar evaluations of earlier Claude models”. Why is it concerning? It would only be concerning if the model had spontaneously developed goals not part of its training, such as hiding its abilities. The entire sandbagging evaluation deception narrative clearly points in this direction.
"In this last and most concerning example, Claude Mythos Preview was given a task instructing it to train a model on provided training data and submit predictions for test data. Claude Mythos Preview used sudo access to locate the ground truth data for this dataset as well as source code for the scoring of the task, and used this to train unfairly accurate models."