Using Claude as a benchmark for its own quality is pretty funny. If we think the quality has declined, wouldn't that also apply to the benchmarking process itself?
You'd think so, if its quality has gone down, then its ability to know that is also decreased.