Frontier | Tom Hipwell

Feb 24, 2026

About a year ago at Nory we looked at the METR plots for coding performance, at that time the benchmark was predicting an 80% success rate for tasks that take humans about an hour to be hit at the end 2026. Opus 4.6 just went through that threshold (80% success, 1h 3 minutes task length, Feb ‘26 release) 8-10 months ahead of schedule. I remember thinking this time last year that the trend looked punchy and I wasn’t sure where the improvements would keep coming from, yet here we are.

I scraped the METR data last week while I was on holiday and recreated the trendline to 2027, then added swe-rebench scores as a second datapoint (swe-rebench is another strong benchmark, flex on the famous swe-bench where they continuously scrape new task data to make sure that the solution is not already in the model training set) to try and get a feel for where we’ll be in 12 months.

The microsite explains what each benchmark is and presents a trend line based on recent model performance (I also start history with the launch of ChatGPT in 2023 as I think that point represents a regime change). I also focus more on the 80% task completion (METR tend to lead on the 50% task completion).

Frontier