Quoting Ankit Maloo | Tom Hipwell

Similar to the Model is the Product a couple of weeks ago, the bitter lesson here is that brute forcing problems with compute wins versus clever solutions. Scaling compute at inference time with RL is the latest application of the bitter lesson, and we’re already seeing it move the needle in production use cases (customer support and soon, coding). This has big ramifications in the AI application layer:

While many companies are focused on building wrappers around generic models, essentially constraining the model to follow specific workflow paths, the real breakthrough would come from companies investing in post-training RL compute. These RL-enhanced models wouldn’t just follow predefined patterns; they are discovering entirely new ways to solve problems. Take OpenAI’s Deep Research or Claude’s computer-use capabilities - they demonstrate how investing in compute-heavy post-training processes yields better results than intricate orchestration layers. It’s not that the wrappers are wrong; they just know one way to solve the problem. RL agents, with their freedom to explore and massive compute resources, found better ways we hadn’t even considered.

Ankit Maloo