March '24 Roundup | Tom Hipwell

March was the month we got Grok, OpenAI confirmed their strategy and we no longer needed to run on vibes alone as gpt-4 was displaced at the top of the leaderboards. An experiment was also kicked off to learn about the pricing power of the major LLM providers.

One of the things I most enjoyed this month was the explosion of interest in LLM agents with the launch of Devin, the AI software engineer. So this month I’ve pulled out 4 papers which expand on agent based workflows and show how things might evolve over the little while.

Design2Code: How Far Are We From Automating Front-End Engineering?

Let’s start by looking at the limits of LLMs in a zero-shot (i.e. single prompt) context. Based on this paper I would say we’re quite a long way from automating front-end engineering (Betteridge’s law applies again). The paper outlines the results of getting a bunch of today’s top models to generate html/css (note no js) to imitate websites based on an input image of the website and a small prompt (actually two prompts, there’s a review step as well). The researchers tested the outputs and supposedly the GPT-4V generated output in particular was preferred >50% of the time. I would say this is a long way from automating the work of frontend devs but I can see the power for some basic site building use cases. I suspect an agent based approach would show a lot more promise here - what makes a big difference to the end result is how you zoom in on the individual components of the page and iteratively refine. More on this shortly.

A slight aside - I had a lot of fun playing around with the prompts from the paper, mostly because the no js constraint made me try and get some animations working in CSS which I hadn’t tried before. A nice example of the type of fun side quest that LLMs enable. I find working with LLMs helps me get into that optimal fast-feedback-with-learning, rewarding, game-like state pretty quickly and this is one of the joys I get from using them - so I’d like more tools that enable me to drop into that cycle quickly that feel less instrusive.

MAGIS: LLM-Based Multi-Agent Framework for GitHub Issue Resolution

The Magis paper describes a Multi-Agent setup which gets similar performance to Devin on SWE Benchmark tasks (it resolves about 14% of provided Github Issues). It’s a multi-agent model as there’s different roles being played here - that of manager (responsible for forming the plan and farming the tasks out to the workers), repo custodian (finds the right files in the repo), developer (makes the changes) and QA (performs code review for developer changes).

I do wonder if one of the things we’ll get from all this is a better understanding of our craft - for example the agent struggling is positively correlated with files and functions touched, but lines of code changed has less of an impact. Intuitive perhaps but if we can start to better define what makes a changeset complex I think that would be useful feedback when developing. I also think a better understanding of change complexity will enable us to apply agents to the right problems earlier as we adopt these tools as part of our flow.

The limits of the technology today are working across 2 files and 190 lines in the changeset (this is the maximum size of a successful issue resolution that the agent was able to perform) and the paper tries to expand on what it is about the harder problems that makes them difficult for the agent framework to solve. It’s not spelt out in the paper but I suspect the bottleneck on raising these limits is the planning step as this needs a bit more reasoning power. If I grokked it correctly the framework works with two phases: plan -> deliver. This doesn’t really reflect how software gets crafted, so I wonder if you got a better feedback loop into the planning step (i.e. get multiple agents to debate the plan or use a prompting technique like self-reflection - or hey, just allow spikes) you’d get better results.

More Agents Is All You Need

Let’s look at agent based models of reasoning in a bit more detail to build on this point. This paper sets out to explore if we just brute force a problem by adding more and more agents to attempt a solution, do we get a better results? We’re looking for a general phenomenon here - a scaling property of the “raw” application of additional agents that enables us to have a better chance of solving any problem. The paper explores this question by tackling reasoning problems using a majority voting approach. First, a prompt is iteratively fed into a single LLM (or a multi-agent collaboration framework) repeatedly until we get a sample set of responses (the paper also explores using different LLMs in an ensemble style at this point). We then remove the dupes and use majority voting to select a winner.

The paper finds that it does work, at least across three domains general reasoning (5-11% improvement), coding (4-9%) and arithmetic reasoning (12-24%). Also, smaller LLMs can outperform larger ones by just scaling up the ensemble size (Llama2-13B model achieves 59% accuracy on the GSM8K dataset, outperforming the Llama2-70B model, which scores 54%). What’s nice is that the paper then keeps expanding this analysis, breaking apart the different reasoning problems and isolating what makes each reasoning task difficult to try and find where the performance gain is coming from. It looks at few different factors - inherent difficulty (where we see constant gains until we hit a ceiling in the reasoning ability of LLMs), number of steps (where we see gains increasing with the number of steps [the limit in the paper is 8 steps]) and problem decomposition (so using hierarchical voting to break harder reasoning problems into simpler ones that with a strong prior probablilty that the LLM would solve by applying sample-and-vote all the way down).

So, if we put all this together, I’d say it looks likely that we can probably get low single digit improvements on the Devin/Magis performance with the current generation of LLMs. At least for classes of problem which don’t hit the inherent difficulty ceiling - joining the two papers together I’d say in a software engineering context that’s probably number of files/functions in the changeset (whereas number of lines of code changed/number of steps needed is unlikely to impact the agent).

AIOS: LLM Agent Operating System

If we’re going to have all these agents knocking about, something probably needs to change at the OS layer. This paper describes an attempt to build an operating system “with soul”, basically putting an LLM at it’s heart and charging it with agent resource allocation, context switching, concurrency, access controls and toolchain. The splash that Devin caused this month tells us the agentic workflows could be here to stay, and this paper outlines an architecture for an Agent OS - at a high level, there’s an LLM specific kernel which has a few components like a scheduler, context manager (which supports snap-shotting for pause/resume) a memory manager (handy for lazily managing the context window size required by each agent) and storage, tool and access managers. Agent developers then deal with the OS through an SDK and this heavy lifting is handed off to the LLM OS layer. There’s a level of indirection as well, as there’s a separate OS Kernel for managing everything else, and the LLM Kernel can only go through this layer to interact with hardware. Neat. Note that this is a development of an idea that Andrey Karpathy shared in November of last year.

Stealing Part of a Production Language Model

A bonus paper on AI security to finish. Deepmind set out how to reverse engineer the hidden dimensionality and final output projection matrix of a language model. Hidden dimensionality gives you the size of each model layer, this tells you how much capacity for processing information each layer has. The final projection matrix describes how the output of all model layers is combined and transformed into a useful output for end users. The exploit works by generating a large number of random prompts and then analysing the statistical properties of the model responses (specifically, this is done using singular value decomposition). The result is important, as if you understand how the final projection works then your chances of, say, prompt injection or a jailbreak may increase. It’s also the first time that a paper has been published which describes how to steal a single layer from a production LLM.