January '24 Roundup

End of month one already so its time for a round up. Here’s a few different bits I found interesting in January:

Supervised fine-tuning (SFT), Niels Rogge

All the steps in going from a base model to a useful assistant using supervised fine tuning. A slightly deeper run through of fine tuning Mistral-7B end to end, with the details coloured in - from hardware sizing to a run through the different PEFT approaches (PEFT is parameter efficient fine tuning, e.g. LoRa, QLoRa). There’s a handy Jupyter Notebook here if you want to follow along at home.

TinyML, Dan Situnayake

Great Hacker News post where Dan Situnayake, one of the the founders of the TinyML/Edge AI field, speaks with passion about how edge AI is exploding, then links out to a tonne of interesting resources to help others get involved. I love the enthusiasm and openness more than anything.

Sentence embeddings, Omar Sanseviero

More than you ever wanted to know about sentence embeddings. In short sentence embeddings are a numerical way of capturing the meaning of sentences which then allow you to make quick comparisons between sentences pairs - do two sentences mean the same thing? This is a useful task, as an example you can use it to match user queries to FAQ question/response pairs efficiently or conduct semantic search. If you wanted a comprehensive survey of the field, this is it.

Mixtral of Experts

A deep dive on the Mixtral 8x7b model and an introduction to Sparse Mixture of Experts (SMoE) architecture from the Mistral AI team. The SMoE architecture adds a routing layer which coordinates the model to decide which pair of “heads” need to be active for each token prediction in a sequence. Think of the heads as individual experts in a multi-headed hydra. As the routing layer gets the prediction to the right pair of heads, less heads in the model need to be active at the same time. This means the model is more compute efficient while also generating better results. You get another advantage as well - you can distribute the experts across GPUs, giving you a boost in the amount of parallelism you get as each expert can crank through it’s own queue of tokens (there’s some neat load balancing that has to happen here as well to prevent overwhelming any one GPU). The downside is that you pay the costs associated with the routing layer itself and the memory overheads are higher (as each expert has it’s own set of parameters and state which needs to be held in memory). If you want to know more there’s a reading list which gathers key papers in the evolution of this approach here.

Venture Capital is Dead, Long live Venture Capital

The one that got passed around by Sami at Hofy this month with that lovely sounding word: ZIRP (zero interest rate policy). A backwards looking description of what’s just happened which I think captures the consensus well. I’d love to see more forward thinking about what happens next for both venture capital and the startup ecosystem more generally as we adapt to what (to me) feels like a new, long term, reality.

Decision Making at Top VCs, Fred Destin

“Thesis usually beats products. Teams with strong product chops will iterate their way to success around a strong thesis. There is a tendency in venture firms to over-index on early product and roadmap versus considering how much of a learning organisation we are dealing with.”. This chimes with experience, I was told to read The Switch on joining Bulb and the vision of that book was central to a lot of product decisions over the following years. Execution is a lot faster when there’s a strongly defined, shared vision for the product. As Richard Hamming puts it: “It is well known the drunken sailor who staggers to the left or right with n independent random steps will, on the average, end up about sqrt(n) steps from the origin. But if there is a kebab shop in one direction, then his steps will tend to go in that direction and he will go a distance proportional to n.”. A couple of killer insights from Fred in this one, looking forwards to part two.