Apr 13, 2025
Another AI prediction, but I think this one pinpoints some of the blockers much more clearly. In summary:
Roughly: generalist scaling does not work or, at least, not well enough to make meaningul sense for material deployment. Instead, most development, including agentification, happens in the smaller size range with specialized, opinionated training. Any actual “general intelligence” has to take an entirely different direction — one that is almost discouraged by formal evaluation. Simply put: the first AGI will bad, also amazing but bad.
The key insight of the piece is that it’s not the capacity of the model that matters (e.g. which benchmarks it can beat and how much of the time) it’s the accuracy that matters (e.g. hallucination rate). I think anyone who has been building with AI can relate to that one:
While safe and accurate deployment can be achieved with a human in the loop, this process only makes economic sense once an AI system is already resilient enough. You don’t want to manually correct 20% of the time. In short, the dynamic that matters it is not a timeline of capacities but a timeline of accuracies.
It’ll be hard for RL models to transfer across domains as defining a reward function for knowledge work is hard, however it will be possible:
The unfortunate reality is that, for now, RL is not transferring well across domain and most domains are not covered by existing reward functions, mostly designed around code and math. Actually, most domains are not even “verifiable” in the sense there won’t be a universal unique answer but a gradient of better one. So it takes village to move toward vertical RL: operationalized rewards, rubric engineering, classifiers, llm-as-a-judge.
Nice insight on model size at Pleias:
At this point, even tiny reasoning models could suddenly become good in a large variety of fields. At Pleias, we’re doing a series of early experimentations on semantic data in regulated industries (especially banking and telecommunication). So far a gpt-2 sized model can be leveraged to a get a deeper and more accurate of industry standards than frontier models. Logical reasoning performs well even in the gpt-2 size.
The other major blocker to accuracy is model interpretability, we’re going to need new upstream tools to trace back weakness before they cascade into failures:
if you see generative models as product (I do), they prove lacking on one aspect: providing feedback metrics and failure modes. One of the best example is OCR. Vision Language Models perform now considerably better on challenging tasks and yet, fail to properly warn when things go south. Noisy texts become undetectable hallucinations. There is no standardized approach yet for accuracy estimate, which would be at the token level in the best case scenario, while LSTM provides character-level metrics.
The rest of it is speculative as you’d expect, I’m personally mostly interested in the 12 month predictions at this point - I think these are far more useful.
A Realistic AI Timeline