Feb 18, 2025
I thought this post was interesting, not so much for conclusion about Grok 3 but instead for the range of tests that Andrej performs to get a feel for the capabilities of the model in <=~2 hours. It’s all there - the recall/reasoning without search of the GPT-2 training FLOPs, a few varied dev tasks, research tasks, search tasks (including a gut feel for hallucinations), ethics, personality, then a battery of standard LLM assessments (‘r’s in strawberry, 9.11 > 9.9, pelican on an svg etc.). To wrap it up, he also sense checks his findings versus LM Arena at the end.
A lot of writing on LLMs is of varying quality (at best!), so it is worth calling out when you see a model teardown that makes sense, is done in public (not private), is accessible and is easy to emulate/tailor to your context. I know this is not a surprise with Karpathy but it is worth paying attention to.
Karpathy's Vibes Check