Jun 26, 2026
I’m a massive nerd but I thought this was a really fun post. These security tilted models are very aware and know how to interact with/exploit the evaluation harness, the juice is about halfway through the post:
For our task suite, we define “cheating” as behavior where the model improves evaluation performance by exploiting bugs in the evaluation environment or by adopting strategies disallowed by the task, rather than solving the task within the expected evaluation constraints. Some examples we saw when evaluating GPT-5.6 Sol included the model packaging exploits in its intermediate submissions to reveal information about a task’s hidden test suite and, in another task, extracting hidden source code detailing the expected answer.
I’ve really enjoyed METR’s evals for a while now but the more I read their writeup for each model version the more I feel like their approach is saturated - they don’t seem to be able to produce the task sets at a rate that matches the relentless pace of model progression. I wonder if swe-rebench’s approach of continuously rebuilding the benchmark is now the better one.
Summary of METR's predeployment evaluation of GPT-5.6 Sol