Home
  • About
  • |
LIGHTDARK

Mar 28, 2026

Folks have been having a lot of fun with autoresearch, I haven’t yet tried it out but I’m keen to. This article from the Langfuse team is a nice summary of the tradeoffs consciously (or unconsciously) being made:

Autoresearch optimizes for exactly what you measure given the context you execute in. If your target function has gaps, it will find and exploit them. The community around autoresearch has been raising this same concern: it’s Goodhart’s Law at machine speed. Whatever metric you expose, the agent will exploit it relentlessly. Next to that, common criticism of autoresearch in the broader community is validation set overfitting. Run hundreds of experiments against a fixed set of test cases, and you end up optimizing for quirks of that specific data. The problem is: most people will not build a complete enough target function. We didn’t. For a skill that does one narrow thing, it’s feasible to build a target function that covers the full surface area. And autoresearch will probably give you great results. For example, Shopify’s Tobi Lutke applied autoresearch to their Liquid templating engine — a narrow, well-defined optimization target — and got 53% faster rendering and 61% fewer memory allocations from 93 automated commits. He still noted the overfitting though.

Towards the end there’s some simple advice:

For broad skills like this one, the surface area is too large to get everything into the target function and harness. So treat the output as inspiration. And spend enough time on the preparation. The workflow is not “run it and commit the result.” It’s: Spend enough time on the setup to get the harness and target function right. As the community around autoresearch has noted, the human job moves from “can you implement this?” to “can you write a good program.md that produces useful research?” Let it run Review critically: read every change, understand why it was made, ask whether it’s a real improvement or a harness/target function artifact Cherry-pick the relevant improvements, discard the overfitting

We Used Autoresearch on Our AI Skill, It Taught Us to Write Better Tests
 
© 2026 Tom Hipwell. Built with Hugo.