Things I'd like to learn in 2024
Jan 17, 2024
I guess like everyone in 2023, I’ve thought a lot about LLMs, LMMs and all the rest of it. As an interested bystander and casual observer, I thought I’d stake out three things that I’m curious to learn more about during the course of 2024 as I try and get that bit closer to the edge. If you have similar thoughts, can correct the gaps in my reasoning or are further along the curve and can signpost me to some good reading on these themes, I’d love it.
What protocol will large models use to interact?
It was fun to see in 2023 that under the hood ChatGPT is just prompting DALLE-3 in natural language (and SHOUTING while doing it of course). It also shows how far we have to go: surely there is a denser, more precise method for large models to communicate. If open source does win the model battle over the coming years, a federated world where there are lots of models trained (or just finetuned) on slightly different datasets interacting with eachother seems like a likely outcome. How then, will they communicate? Two things come to mind here: (1) if the protocol for the exchange of information itself allowed for learning, a bit like how ants use pheromones to convey information about trails or spread alarm, that would probably be quite a useful property for the network of models to have (2) making the communication between large models open and inspectable feels like an important choke point and control (sigh, like a blockchain). To me, this feels like it is probably a nice complement to the focus on aligning the models themselves; if we live in a federated world - aligning the network will matter more than aligning the model. Or, we’ll probably just stick with json. This line of thought is not new - it has a long history, so that’s something I’d like to learn more about in ‘24.
Why are we regulating the models when we could be regulating the datasets?
One way to look at LLMs is that they are a compressed form of the data they have been trained on - the process of training the network compiles the dataset into the binary. As a quick recap, the performance of neural language models is governed by scaling laws related to the model parameters, it’s training dataset size, the number of training steps and the amount of compute used for training. Larger models, trained on more data with more compute outperform smaller models and the relationship follows something a bit like a power law. To get to AGI, we’re going to need a lot of data (and a bigger boat). A regulatory framework that followed the scaling laws would make sense, and the training data feels (to me) like the obvious bit to regulate.
So I wondered if the recent EU regs got things the wrong way around - don’t regulate the type of data used for training or require disclosure of copyrighted data only, instead force the description of the entire training dataset to be public. A bit like referencing in academia. It feels like this would open up the datasets to the right level of scrutiny, which in turn would contribute towards the other aims set out in the regulations we’ve seen so far, a bit like how financial regulation often focuses on price transparency to avoid discriminatory practices. One follow on from this is that you can imagine a secondary market springing up in the datasets themselves (this is so obvious that I’m sure it exists already and I just don’t know about it), and this would be a fair way to compensate copyright holders (and gig economy dataset labellers) for their input (perhaps an analog here is the difference between Napster and Spotify for rights holders). So this is something I’m interested in learning about in ‘24. Not so much regulatory frameworks (yuk!) but forwards thinking about new markets that will be created as things scale up and those markets will evolve.
Will working with the parameters of a large model ever feel like programming?
I really like this quote from Andrej Karparthy’s essay software 2.0:
Software 2.0 is written in much more abstract, human unfriendly language, such as the weights of a neural network. No human is involved in writing this code because there are a lot of weights (typical networks might have millions), and coding directly in weights is kind of hard (I tried).
The casual “I tried” in brackets at the end there is what gets me. I mean, being able to clearly express how a model works, and then partner in developing it further would just be an incredibly powerful mix of Software 1.0 and 2.0. There’s a rich seam of research around interpretability but (based on my limited knowledge) the scale and complexity we’re talking about here makes it unlikely that this is a realistic outcome. At the same time it feels like the API and the set of operations done over the data could soon stabilise (or at least, slow down), so using an imperative style language like python for this work and hiding the set of algorithms and statistical methods we want to apply behind a bunch of python apis (and the python ecosystem) AND getting the user to adjust the flags for different compute architectures feels like it will get old quickly. What I wonder is if there was a SQL like dialect with a narrower, more declarative style that gave you some sort of super fast and meaningful feedback loop on the results of your actions then perhaps we’d get a both a usability and a performance uptick from this abstraction. Imagine being able to adjust gpt-4 on the fly using a DSL - that would be super cool and much more expressive than the type of imprecise fudge we have today with system prompts and all the rest of it. This is an speculative thought from an amateur, but one I’ll be looking to learn more about in ‘24.