swe-rebench | Tom Hipwell

Feb 13, 2026

A slightly different benchmark to the popular swe-bench. Instead of using a handcrafted (and public) set of tasks, the team behind swe-rebench have built a pipeline that continuously gathers problems from public repos. Each model is also given the same harness. This means the problem set is less likely to have been subsumed into training data, and each model is assessed on the same level playing field. Opus 4.6 is top, but 5.3 has not yet been assessed as it’s not yet available in the API. We all have easy access to the frontier models now, so most useful here is the independent (and slightly more trustworthy?) assessment of open models.

swe-rebench