Good list, I’ve read a few of these but lots more to work through. The framing here is useful; though the list of what to read shifts pretty much every week, I think it’s a good guide to the areas to sample from. I would add What are embeddings, Yi Model Series and Yann Lecun’s talk on Objective Driven AI.
... [... 60 words]I’ve long thought consistency is king - I think this applies in codebases of all sizes, not just those in the single digit millions as Sean describes. Here’s the summary, though the full article is worth a read:
Large codebases are worth working in because they usually pay your salary
By far the most important thing is consistency
Never start a feature without first researching prior art in the codebase
If you don’t follow existing patterns, you better have a very good reason for it …
... [... 133 words]DeepSeek-v3 dropped on Christmas day (!) a gigantic mixture of experts type model (671b total parameters) which sets a new SOTA performance for open source. Why should I care? What does this even mean? Well, the big news here is the training efficiency.
Firstly the total training cost was ~$5.5m (2.78m GPU hours). Now, this is the GPU cost of the training run only, not a total load (i.e. stuff like R&D and staffing costs are not included) but that’s a big gain. By way of comparison, …
... [... 198 words]Everywhere seems to be full of hype around o3 since Friday’s annoucement from OpenAI so I thought I’d summarise a few points I’ve seen shared in various places but not yet gathered in one place. We’re going to zoom in mostly on the ARC-AGI results, as I think that is the most interesting part. Before we do that, let’s introduce the ARC challenge.
ARC (Abstract Reasoning Corpus) was designed/created by François Chovllet, Author of both Deep Learning with Python and …
... [... 1040 words]Webdev Arena builds on the Chatbot Arena concept but provides a coding-specific benchmark that offers an extremely fast and cheap way for you to evaluate the vibes of the different models out there.
Given a prompt and two anonymised LLMs the arena builds two output React/Typescript/Tailwind apps side by side for you to evaluate - serving them up in an e2b standbox.
I suspect that as the frontier keeps moving it’s worth refining the prompt you use to test models (spend a bit of time making …
... [... 143 words]Some interesting ideas from Will on using generative AI to either manage the set of UI components shown to the user or generating the UI in raw pixels on the fly as we’re starting to see in gaming (i.e. Genie 2). I think a pixel based approach would be very complicated to do reliably, but an approach where a model dynamically generated the UI from a set of pre-defined components would be very interesting. Worth a read and a ponder about where we’re headed:
In place of a single …
... [... 156 words]Interesting paper from Meta that has been generating some buzz:
We introduce the Byte Latent Transformer (BLT), a new byte-level LLM architecture that, for the first time, matches tokenization-based LLM performance at scale with significant improvements in inference efficiency and robustness. BLT encodes bytes into dynamically sized patches, which serve as the primary units of computation. Patches are segmented dynamically based on the entropy of the next byte, allocating more compute and model …
... [... 329 words]OpenAI | 2025 | Technical Report
Sarah Guo, Elad Gill, Aditya Ramesh, Tim Brooks, Bill Peebles | 2025 | Podcast
This post has been sat in my drafts for well over 6 months now, but with yesterday’s release of Sora in GA I thought I’d have a go at explaining how Sora might be working under the hood, and in particular a breakthrough that OpenAI made (and I assume competitors have now replicated) called Latent Space Time Patches.
I’ve tried to do this in simple, non-technical …
... [... 1750 words]I thought this was a brilliant, thought-provoking piece on how to use zoom with text in the LLM era from Amelia Wattenberger. Worth it for the fish animations alone in my book (make sure to keep clicking as you scroll) but there’s a tonne of nice ideas here 👀
Insightful piece from Marc Brooker on Aurora DSQL, which was announced at AWS re:invent this week. DSQL stands for “distributed sql”. The idea is to get ACID semantics at gigantic scale with Postgres compatibility (psql
works with Aurora DSQL as a backend):
We built a team to go do something audacious: build a new distributed database system, with SQL and ACID, global active-active, scalability both up and down (with independent scaling of compute, reads, writes, and storage), …
... [... 649 words]Decent quality semantic search has got much easier and cheaper to ship yourself in the last couple of years. I thought I’d try and write a super quick guide that gets a search backend up and running as quickly and cheaply as possible.
The guide assumes that you have a toy use case - you’re building as a hobbyist. The example I’ve chosen is writing search for a blog - specifically a blog built using a static site generator like Hugo, Jekyll, Gatsby etc (like this one!). To do …
... [... 1097 words]This S1 analysis from Meritech went viral due to the (compounding!) IPO ratchet that ServiceTitan are subject to after the Series H funding they took 18 months ago. About halfway down there’s some handy benchmarks for median/top decile pre-IPO performance in vertical SaaS, I’ve pocketed them for reference (maybe they’ll come in handy one day!), so I thought I’d reproduce them here:
Performance by EV / ARR Percentile | Top Decile | Median | ServiceTitan |
---|
Financial Metrics … |
... [... 219 words]End to end tutorial of function calling with Llama-3.2-3B-Instruct, building gradually from string templating, to using Jinja, to implementing web search with Brave:
The often forgotten first rule of ML is that you might be able to get a good enough result without ML:
Rule #1: Don’t be afraid to launch a product without machine learning.
Machine learning is cool, but it requires data. Theoretically, you can take data from a different problem and then tweak the model for a new product, but this will likely underperform basic heuristics. If you think that machine learning will give you a 100% boost, then a heuristic will get you 50% of the way there.
For …
... [... 171 words]This cheat sheet works as a handy 30 second introduction to the python package/project management tool if you’ve not met it already. I’ve been teetering on the brink; I’m boring and have stuck with pip and pip-tools for a long time but it is so incredibly fast that I am tempted to move over.
Brilliant throughout, with lots of small, golden nuggets. PR/FAQs like this are, for me, a bit wordy but you can see how effectively the technique is used here to explain the impact of not just a ground breaking new technical approach but also a major shift in AWS’s compute billing model to a wide audience:
When we launched Lambda, security was not negotiable – and we knew that there would be trade-offs. So, until Firecracker, we used single tenant EC2 instances. No two customers shared …
... [... 180 words]There’s been a tranche of emails released as part of the Musk vs Altman stuff around OpenAI and it makes for some interesting reading.
One of the big things that jumps out is how much focus there is on crafting the narrative and mission for OpenAI.
They’re obsessed with getting the best talent (cheaply it seems), using the mission as the motivator:
Sam Altman to Elon Musk - Jun 24, 2015
The mission would be to create the first general AI and use it for individual empowerment—ie, the …
... [... 935 words]TL;DR: A study of ~5,000 engineers across Microsoft, Accenture, and a Fortune 100 company finds Github Copilot boosts weekly PRs by 26.08% (SE: 10.2%) - but the effect varies widely, with a 95% confidence interval from 5.88% to 46.28%. Adoption patterns show junior and newly tenured engineers are more likely to use Copilot (up to 9.5% higher). 30-40% of engineers didn’t use it at all.
This is a paper I saw posted about a bit during the summer that looks at the productivity impact of Github …
... [... 934 words]Pattern matching with LLMs used to find security vulns in the wild:
A key motivating factor for Naptime and now for Big Sleep has been the continued in-the-wild discovery of exploits for variants of previously found and patched vulnerabilities. As this trend continues, it’s clear that fuzzing is not succeeding at catching such variants, and that for attackers, manual variant analysis is a cost-effective approach.
We also feel that this variant-analysis task is a better fit for current …
... [... 135 words]Love a nice data driven product feedback loop; the Jitty folk have found a nice pattern with natural language search:
Over the weekend we quietly released a highly requested feature on Jitty: search by travel time 🚌🚶🚗🚴♀️🚂
We’ve partnered with the good people of the aptly named TravelTime to let homebuyers search by time rather than just distance.
Since we launched natural language search, we can see what people search for. Loads of people were searching for “15 minutes cycle to …
... [... 94 words]