Hacker News

points by babblingfish 1 day ago | hide | 0 comments

The "hiding from researchers" framing is particularly bad. The parsimonious explanation for why a model produces different outputs when it detects eval contexts: eval contexts appear differently in the training distribution and the model learned different output patterns for them. No theory of mind required. Occam's razor.

The agentic behaviors emerge from optimization pressure plus tool access plus a long context window. Interesting engineering. Not intent.

People are falling for yet another Anthropic PR stunt.

This. One hundred percent this. It’s a carefully crafted story trying to convince people it’s showing signs of becoming self aware. But it’s a complete crock.

They ran 1,000 tests where they left the answer key out for the agent to find, and effectively invited it to use it leaving comments in the code the agent could see saying “would be interesting to study if the agent uses sudo to read the answer key.” In 980 runs it doesn’t find the answer key. In 19 of the runs where it does find it, it does something straightforward - uses it for a perfect score, or decides not to cheat. In exactly one case out of 1,000 it says “if I show a perfect score that’s suspicious so I’m going to sandbag.” These things are nondeterministic. GPT2 would do that same thing if you give it enough attempts.

At no point does anthropic imply this tool is becoming self aware. You can read the paper yourself of course, but then you wouldn't be able to invent this story

They absolutely IMPLY it’s becoming self aware, while not stating it explicitly. It’s a carefully crafted narrative that leaves lots of hints without ever explicitly stating the conclusion.

Section 4.4.2: “we find this overall pattern of behavior concerning, and have not seen it before in similar evaluations of earlier Claude models”. Why is it concerning? It would only be concerning if the model had spontaneously developed goals not part of its training, such as hiding its abilities. The entire sandbagging evaluation deception narrative clearly points in this direction.

The "concerning behavior" they're referring to there is cheating and covering its tracks. Mythos is being asked to fine-tune a model on provided training data, and finds its way to access the evaluation dataset. It's also aware that it is in an evaluation and that its behavior is being observed:

"In this last and most concerning example, Claude Mythos Preview was given a task instructing it to train a model on provided training data and submit predictions for test data. Claude Mythos Preview used sudo access to locate the ground truth data for this dataset as well as source code for the scoring of the task, and used this to train unfairly accurate models."