February '24 Roundup

February feels like it’s gone in a blur. Hofy had a brilliant company retreat in Peniche, Portugal. Sora looks insane. Google returned to open source AI with the Gemma series while Mistral released a hosted, closed-source model. Here’s a few other things that caught the eye:

Self-Discover, Google DeepMind

Can we improve LLM reasoning by adjusting the way in which we prompt? Google DeepMind demonstrate an up-to 32% uplift in performance that transfers across LLMs (GPT-4, GPT-3.5, Llama2-70B, PaLM 2-L) and most types of reasoning problems (it outperforms Chain of Thought on 21/25 big bench hard problems) by using a simple, four prompt, two stage, reasoning process. In the first stage you SELECT which mental models to use (e.g. “breakdown into sub tasks”, “critical thinking”), ADAPT those models to the context of the prompt (“calculate each arithmetic operation in order”) and IMPLEMENT the approach to the solution by building an output prompt (using JSON, as this has been shown to boost the reasoning capability of LLMs). In the second stage you then SOLVE the problem using the plan you have created with the first three meta-prompts. If this feels familiar, that’s because this is just based on a standard human reasoning pattern; intuitively it fits how we solve problems. Take engineering for example, we might start by thinking through how known patterns or prior art can be used to solve a problem, we’d then work out how they needed to be tweaked to solve our challenge, then we’d probably write up a technical design doc to explain it to others. We’d use this document to split the work, crack on, and execute.

This is a neat reminder of how early we are in our journey - a research team from a world leading AI organisation takes a model of human reasoning first documented in 1958 and inuitive to all of us, adjusts how they prompt to follow that approach and we get a very large uptick in performance over what was previously considered the state of the art (6% better than using Chain-of-Thought, 23% better than not suggesting a problem solving approach at all with GPT-4).

If you want to try it out, all the prompts and “reasoning modules” are given in Appendix A of the paper, or there’s a rough and ready implementation in Github here.

Chess-GPT, Adam Karvonen

What’s impressive here is getting to 1300 ELO (playing strength) with nano-gpt and a day on 4 RTX 3090 GPUs; it shows how well the transformer architecture generalises to different problems, and how that performance is driven by the quality of data. Google dropped a paper on a similar theme in February - Grandmaster-Level Chess without Search - which doubles down on the same point:

Our work shows that it is possible to distill a good approximation of Stockfish 16 into a feed-forward neural network via standard supervised learning at sufficient scale.

The insight from this is that the architecture generalises to any different problem and the quality of the result depends on the scale and quality of the dataset. Does the data matter far more than the compute?

Gemini Pro, Gemini Team

Gemini Pro launched with a staggering 1 million token context window (10 million in research 🤯). This achievement is made impressive not just by the size of the window but by the recall across that context - when using a Needle in a Haystack validation it can pull back a required fact accurately 99.7% of the time (in Google’s research, it’s maintains this performance in text mode up to a 7 million word context length). The technical report also outlines another interesting reasoning task - learning Kalamang.

With only instructional materials (500 pages of linguistic documentation, a dictionary, and ≈ 400 parallel sentences) all provided in context, Gemini 1.5 Pro is capable of learning to translate from English to Kalamang, a language spoken by fewer than 200 speakers in western New Guinea in the east of Indonesian Papua, and therefore almost no online presence. Moreover, we find that the quality of its translations is comparable to that of a person who has learned from the same materials.

Now, this is marketing so salt as required but it hints at the capability of the model: the ability to reason about a new domain in depth while data poor. It makes you wonder if the performance gap versus finetuning (especially in weak data scenarios) will keep decreasing. It also makes you wonder if those AI strategy data moats aren’t going to be as strong as advertised. I wrote a bit more about this in February here.

Tiny LLMs, Xue-Yong Fu, Md Tahmid Rahman Laskar

Do smaller fine-tuned LLMs outperform larger LLMs with zero-shot prompting? Spoiler: No. With one exception - FLAN-T5. FLAN-T5 large can run with about 14GB of VRAM (so could run with a single RTX A4000, spot instance is $0.15) versus 24GB of VRAM for LLaMA-2-7B (about $0.19). On the benchmarks used in the paper (meeting summarisation) FLAN-T5 outperforms every LLama-2 flavour and Mixtral-8x7B in a zero shot setting. Finetuning a larger LLM gives you a performance uptick again, particularly in long format examples, where the size of the context window starts to come into play (4k tokens versus 2k for FLAN-T5).

Is Something Bugging You, Will Wilson

Will recounts building an event-based network simulation for testing PeerDB (there’s also a strangeloop talk here). The test suite they built let them simulate an entire cluster of interacting database processes, all within a single-threaded, single-process application, and all driven by the same random number generator. If they found a bug they could run the same test case over and over again with the same random seed, and the exact same series of events would happen in the exact same order. This is impressive but perhaps niche for most of us. What jumps out is this line:

We had built this sophisticated testing system to make our database more solid, but to our shock that wasn’t the biggest effect it had. The biggest effect was that it gave our tiny engineering team the productivity of a team 50x its size.

Which chimes with experience. Now obviously that’s a chunky test investment for a particular class of testing but it leads to an interesting thought experiment: WHAT IF we could all get to this point? How would you go about doing it?

Rule of X, Bessemer Venture Partners

I’m a couple of months behind on this one (it was published in December) but an oft-cited heuristic is the rule of 40 (Growth Rate + free cash flow [FCF] margin should be > 40%). This of course gives equal weighting to both free cash and growth, which Bessemer find isn’t backed by the data. A business with 30% growth and 15% FCF margins should be valued more highly than a business with 15% growth and 30% FCF margins. Bessemer propose the Rule of X (g * growth multiplier + FCF) to accommodate this. It’s a fundamental that re-investing retained earnings can have a compounding impact on value, but this is a nudge to decision making on managing growth vs burn - a timely reminder that we shouldn’t starve growth for the sake of FCF. There’s a good/better/best at the bottom of the article if you want to benchmark (spoiler: best is (~25% _ 2 + ~20% = 70%+))