Hot takes on o3 | Tom Hipwell

Everywhere seems to be full of hype around o3 since Friday’s annoucement from OpenAI, I thought I’d summarise a few points I’ve seen shared in various places but not yet gathered in one place. We’re going to zoom in mostly on the ARC-AGI results, as I think that is the most interesting part. Before we do that, let’s introduce the ARC challenge.

ARC (Abstract Reasoning Corpus) was designed/created by François Chovllet, Author of both Deep Learning with Python and the Keras framework (and ex-Google). The intent behind the benchamrk was to set a “North Star” milestone for AGI. The test is not intended to answer “AGI achieved y/n?” but instead plots a waypoint on the course towards AGI.

To do this, the big idea behind the test was not to measure skill (completing a bar exam for example) but instead measure intelligence. Skill is not a good proxy for intelligence as it’s heavily influenced by prior knowledge and experience. If our benchmarks measure skill it is quite easy to create an illusion of intelligence by collecting loads of training data and then training the model as a compression of that data. If we focus too much on skill we end up with a form of AI that generalises poorly - the AI will not be able to acquire new skills outside of it’s training data. A lot of benchmarks used to test humans are easily breached by AI as they’re really targeting knowledge retention. As an example, a human domain expert on the MMLU benchamrk gets 89%, whereas Claude 3.5-Sonnet gets 88%.

To measure intelligence more effectively, the ARC test was designed to be hard for the AI to pass but easy for humans. To do this, it relys on very little prior knowledge (mostly a bit of geometry and the ability to count) and instead focuses on flexibility and adaptability; the ability to acquire a new skill rather than use an existing one - something that humans are very good at (worth noting, ARC is also a human oritentated benchmark). The ARC tasks have high generalization difficulty - there’s a lot of uncertainty about how to solve each question because all of the knowledge required to solve the task is only contained in the task itself.

What this means in practice is that the tasks are highly abstract games where you need to fill in grids of different sizes with colour based symbols based on a few examples (typically 3), but if you’re human you probably have very good intuition already of what the solution is.

To reinforce this point, a comprehensive survey from a team at NYU with ~1800 participants puts the estimated average human pass rate between 73 and 77%. If you want, you can try them out here and prove to yourself that you are still more intelligent than AI. If you check the results for o3 advanced you can also just have a look at the ones o3 failed (adjust the task id param in the URL of the player) -> here’s an example and assure yourself that you are still far smarter than o3. Clearly we are not dealing with superhuman intelligence here.

I say all this as I think it’s really important to understand in a bit more detail what ARC is before trying to make predictions about what this means about the path to AGI. The other thing that’s important to understand is how the competition itself is constructed.

The ARC test sets are split into a training set and an evaluation set. The training set features 400 tasks, while the evaluation set features 600 tasks. The evaluation set is further split into a public evaluation set (400 tasks) and a private evaluation set (200 tasks). The competiton itself has a few rules - you have to take the private test set, you have to complete the exercise in 12 hours or less and you need to do so by spending under $10,000 dollars in compute (to win the prize you have to run in Kaggle so the cost of compute is standardised). You also get get two attempts to answer each question and your code has to be open sourced (I think this last stipulation is (sadly) a neat way of ensuring that the prize goes to a lab and not a closed source model operator).

If we start to put this together, and if you’ll allow me to lean into the excitement a little, the 75% score of o3 “high efficeincy” is really cool - for $20 per task you can get an answer from o3 in 1.3 minutes. Now, we’re talking about a task here that you can get completed on Amazon Mechanical Turk for about $0.10c per minute but these tasks are really hard for a model to solve. The low efficiency approach got 85% in 13.8 minutes for roughly 172x the cost per task ($3,440) (Looking at the time taken and the ramp in the cost I’m guessing that this involved consensus voting by a lot of parallel runners but who knows!?!). Yes the model was trained on the training set, but that is allowed by the rules and the training set is quite small. Yes we should discard the low efficiency score as that is probably a more brute-forcey type solution to try and win the prize and yes it looks like if you try another iteration of the benchmark that is still easy to solve by humans the model performs (comparatively) poorly BUT what is really cool is that it looks like we have an architecture that can do well on this type of task - and that opens up a lot of interesting doors.

Here’s some more links to further reading if you want to look into this a bit more:

The original paper from Francois Chollens introducing the ARC benchamrk. On the Measure of Intelligence
Mike Knoop talks a little about his motivations around the ARC prize on the No Priors pod here
Llama-Berry an early open source attempt at getting o1 like performance on mathematical reasoning tasks
Technical report for DeepSeek R1, the top open source equivalent to o1.
A survey paper on Reinforcement Learning
Nathan Lambert’s blog
Melanie Mitchell’s blog from earlier this year on some early attemps at winning the prize