AGI Predictions

I really enjoyed the nonint post on timelines to AGI. Obviously James Betker is better placed than me to make an informed prediction, and he has inside information (he works for OpenAI) but there’s a couple of things that jump out at me if I read this prediction critically.

Firstly, given that transformers are great general approximators of behaviour, it’s very difficult to falsify any predictions about AGI without having a very specific and testable definition of what AGI is that everyone agrees on. I don’t think we have that yet. Until that time, it’s very easy to move the goalposts to make your prediction correct in future (or appear more plausible today, witness the demo disease we’re currently experiencing). It’s clear from the consumer feedback that the bar is likely to be insanely high - with whatever you are trying to do you need to be much better than a human doing the role. See self-driving cars or Google’s recent travails. This is likely to make the timelines to production-grade, scaled, applications of AGI for general use cases very long.

Secondly, if we do get AGI on a 3-5 year horizon by scaling what we have today, then clearly it will be insanely verbose and we’ll be interrupting it a lot without some changes to the architecture we’re using. I think Yann LeCun is probably right here, the energy based learning stuff is important. We probably need an innovation that allows the amount of compute per predicted token to vary, so we don’t need to hack reasoning ability by increasing verboseness to use more compute per prompt (compute is constant per predicted token in the architecture of today’s models, which is why increasing the length of the response using chain-of-thought etc. increases response quality).

Thirdly, with a 3-5 year horizon then given the training timelines I’d assume we don’t have tonnes of time to innovate on the architecture, so we’d need to be close to those breakthroughs today. I don’t think that’s the case, at least from what’s in the public domain.

Fourth, I think the language we’ve used around the Chinchilla Scaling Laws might be causing us (perhaps unconsciously) to make comparisons to Moore’s law. Moore’s law is about the number of transistors on a chip doubling every two years. It does not make predictions about what we can do with that compute. Moore’s Law is a great prediction as it is specific, verifiable and has a precise timeframe. It is expressed in a way that the level of certainty about the prediction is clear. I’ve not seen many great AGI predictions yet as I don’t think they’ve met these properties, and noticeably this type of rigour is missing from a lot of AI related discussions at the moment.

Feels unlikely to me, but multi-year predictions are obviously hard and it’s a fun exercise to try!

Anyway, worth a read -> (I’m a fan of the blog as well so I highly recommend reading back on James’s past posts).