OpenAI Email Archives (from Musk v. Altman)

There’s been a tranche of emails released as part of the Musk vs Altman stuff around OpenAI and it makes for some interesting reading.

One of the big things that jumps out is how much focus there is on crafting the narrative and mission for OpenAI.

They’re obsessed with getting the best talent (cheaply it seems), using the mission as the motivator:

Sam Altman to Elon Musk - Jun 24, 2015

The mission would be to create the first general AI and use it for individual empowerment—ie, the …

... [... 935 words]

The Effects of Generative AI on High Skilled Work: Evidence from Three Field Experiments with Software Developers

TL;DR: A study of ~5,000 engineers across Microsoft, Accenture, and a Fortune 100 company finds Github Copilot boosts weekly PRs by 26.08% (SE: 10.2%) - but the effect varies widely, with a 95% confidence interval from 5.88% to 46.28%. Adoption patterns show junior and newly tenured engineers are more likely to use Copilot (up to 9.5% higher). 30-40% of engineers didn’t use it at all.

This is a paper I saw posted about a bit during the summer that looks at the productivity impact of Github …

... [... 934 words]

Quoting Google Big Sleep team

Pattern matching with LLMs used to find security vulns in the wild:

A key motivating factor for Naptime and now for Big Sleep has been the continued in-the-wild discovery of exploits for variants of previously found and patched vulnerabilities. As this trend continues, it’s clear that fuzzing is not succeeding at catching such variants, and that for attackers, manual variant analysis is a cost-effective approach.

We also feel that this variant-analysis task is a better fit for current …

... [... 135 words]

Quoting Graham Paterson

Love a nice data driven product feedback loop; the Jitty folk have found a nice pattern with natural language search:

Over the weekend we quietly released a highly requested feature on Jitty: search by travel time 🚌🚶🚗🚴‍♀️🚂

We’ve partnered with the good people of the aptly named TravelTime to let homebuyers search by time rather than just distance.

Since we launched natural language search, we can see what people search for. Loads of people were searching for “15 minutes cycle to …

... [... 94 words]

AI-Assisted Assessment of Coding Practices in Modern Code Review

Nice paper on AI assisted code review at Google. Three call outs that I thought were interesting (as I imagine that we’re about to be hit by a tidal wave of commercial applications of this idea):

(1) One of the issues is that the required training dataset varies by best practice - the currency of knowledge really matters. So for example the underlying model was trained on data prior to ‘22, but the canonical source of python type definitions has shifted about a fair bit from Python …

... [... 499 words]

AI, Ad Dollars

I liked Ethan Mollick’s post on ad dollars earlier this week, here it is if you missed it:

No one has figured out how you integrate advertising with LLM replies. If it is contextual ads around the LLM, then a good LLM answer should provide more guidance to the product you want than ads, making the ads useless. If ads are integrated into the prompt, with the instructions that the advertiser be recommended, that will lead to inaccurate, bad answers. This is sort of a big deal, given that …

... [... 1052 words]

AGI Predictions

I really enjoyed the nonint post on timelines to AGI. Obviously James Betker is better placed than me to make an informed prediction, and he has inside information (he works for OpenAI) but there’s a couple of things that jump out at me if I read this prediction critically.

Firstly, given that transformers are great general approximators of behaviour, it’s very difficult to falsify any predictions about AGI without having a very specific and testable definition of what AGI is that …

... [... 501 words]

dspy unpacked: continuous prompt optimisation

Omar Khattab, Chris Potts, Matei Zaharia | 2023 | Paper | Github | Docs

A lot of work with LLMs today involves working through a loop where you break a problem into steps, write a prompt for each step, then put the whole together by adjusting each prompt to feed into the next one.

dspy simplifies this process. It gives you a framework to structure your pipeline - forcing you to architect the application so your program flow is split from the variable stuff - the prompts and model weights that …

... [... 990 words]

How many customer interviews are enough?

Counts of customer interviews seem to have become a bit of a vanity metric of late. A shorthand for product or decision quality, as if one automatically implies the other.

I appreciate your sacrifice at the temple of customer research, but I worry that you may have wasted your time.

Working out the right number of interviews, wireframe tests or customers in the alpha phase of your project is quite similar to an optimal stopping problem. You’re trying to work out how much learning you …

... [... 328 words]

llm.c: The genius of Andrej Karpathy

What’s awesome about Andrej Karpathy’s llm.c isn’t just that it’s a bare-metal, from-scratch implementation of GPT-2 (safety wink definitely required!).

If you take a step back, you’ll see he’s also educating us on how one of the very best in the world hones their craft. He’s stripped away the intermediate layer of libraries - there’s no PyTorch here. Instead, we’re taken back to the basics: an attempt to implement a simple C and CUDA version …

... [... 227 words]

March '24 Roundup

March was the month we got Grok, OpenAI confirmed their strategy and we no longer needed to run on vibes alone as gpt-4 was displaced at the top of the leaderboards. An experiment was also kicked off to learn about the pricing power of the major LLM providers.

One of the things I most enjoyed this month was the explosion of interest in LLM agents with the launch of Devin, the AI software engineer. So this month I’ve pulled out 4 papers which expand on agent based workflows and show how …

... [... 1456 words]

Hot takes on Devin, the AI software engineer

I thought Devin from Cognition looked super cool this week, the UX feels like a glimpse of a new era.

I wonder how deep the moat is though? 🤔

From staring a little bit too closely at the screenshots and videos I’ve seen so far, a hot take would be that it feels like most of the performance lift in the SWE benchmarks could come from a switch in prompting technique, i.e. the size in the performance lift in the benchmark looks similar to that of shifting from chain-of-thought to something …

... [... 296 words]

Grandmaster-Level Chess Without Search

Anian Ruoss, Grégoire Delétang, Sourabh Medapati, Jordi Grau-Moya, Li Kevin Wenliang, Elliot Catt, John Reid and Tim Genewein | Grandmaster Level Chess without Search | 2024 | Paper

Walter Isaacson | Elon Musk | 2023 | Book

Towards the end of Walter Isaacson’s biography of Elon Musk, there’s a description of a breakthrough with Tesla Autopilot:

For years, Tesla’s Autopilot system relied on a rules-based approach. It took visual data from a car’s cameras and identified such things as …

... [... 1982 words]

February '24 Roundup

February feels like it’s gone in a blur. Hofy had a brilliant company retreat in Peniche, Portugal. Sora looks insane. Google returned to open source AI with the Gemma series while Mistral released a hosted, closed-source model. Here’s a few other things that caught the eye:

Self-Discover, Google DeepMind

Can we improve LLM reasoning by adjusting the way in which we prompt? Google DeepMind demonstrate an up-to 32% uplift in performance that transfers across LLMs (GPT-4, GPT-3.5, …

... [... 1227 words]

LLMs as classifiers

Lefteris Loukas, Ilias Stogiannidis, Odysseas Diamantopoulos, Prodromos Malakasiotis, Stavros Vassos | 2023 | Paper

When I’ve heard folks talking about AI strategy recently a common trope has been that things are moving so fast that it is better to hold off product investments until the pace of change slows and the stack starts to stabilise. Instead we should be focussing on the low hanging fruit from the productivity lifts of using chat assistants or GPTs. Another common argument is to …

... [... 2616 words]

January '24 Roundup

End of month one already so its time for a round up. Here’s a few different bits I found interesting in January:

Supervised fine-tuning (SFT), Niels Rogge

All the steps in going from a base model to a useful assistant using supervised fine tuning. A slightly deeper run through of fine tuning Mistral-7B end to end, with the details coloured in - from hardware sizing to a run through the different PEFT approaches (PEFT is parameter efficient fine tuning, e.g. LoRa, QLoRa). There’s a …

... [... 708 words]

Things I'd like to learn in 2024

I guess like everyone in 2023, I’ve thought a lot about LLMs, LMMs and all the rest of it. As an interested bystander and casual observer, I thought I’d stake out three things that I’m curious to learn more about during the course of 2024 as I try and get that bit closer to the edge. If you have similar thoughts, can correct the gaps in my reasoning or are further along the curve and can signpost me to some good reading on these themes, I’d love it.

What protocol will …

... [... 1011 words]

2023 In Review

2023 was an incredible year in our industry, so I thought I’d look back and share the things I’ve loved reading, watching, learning and doing this year.

Blogs

  • The Github one on Copilot, a slow and high level reveal around how Copilot is put together. Also, Jaccard similarity ftw!
  • LLM Patterns, Eugene Yan’s summary post back in the Summer described a bunch of reference architectures for an emerging field for the first time.
  • How To Do Great Work, Paul Graham. The one you wish …
... [... 470 words]

Modern code review: a case study at Google

Caitlin Sadowski, Emma Söderberg, Luke Church, Michal Sipko, Alberto Bacchelli | 2018 | Paper

Benchmarks for Code Review

It’s handy as we talk about code review to use some benchmarks that anchor our expectations for review performance in data. The best that I know of are in a 2018 paper from Google: “Modern Code Review: A Case Study at Google”. I find these helpful when breaking down qualiative feedback about reviews as if you can get the data, you can start to get a feel for …

... [... 1237 words]

See all blogs, decks or templates