Grandmaster-Level Chess Without Search

Anian Ruoss, Grégoire Delétang, Sourabh Medapati, Jordi Grau-Moya, Li Kevin Wenliang, Elliot Catt, John Reid and Tim Genewein | Grandmaster Level Chess without Search | 2024 | Paper

Walter Isaacson | Elon Musk | 2023 | Book

Towards the end of Walter Isaacson’s biography of Elon Musk, there’s a description of a breakthrough with Tesla Autopilot:

For years, Tesla’s Autopilot system relied on a rules-based approach. It took visual data from a car’s cameras and identified such things as lane markings, pedestrians, vehicles, traffic signals, and a set of rules in a range of eight cameras. … Tesla’s engineers manually wrote and updated hundreds of thousands of lines of C++ code to apply these rules to complex situations.

The neural network planner project that Shroff was working on would add a new layer. “Instead of determining the proper path of the car based only on rules,” Shroff says, “we determine the car’s proper path by also relying on a neural network that learns from millions of examples of what humans have done.”

This all leads to a moment where Musk tests out the technology for himself for the first time and what the book describes as a parting of the clouds realisation (I think this is unlikely but let’s roll with it):

For twenty-five minutes, the car drove on fast roads and neighborhood streets, handling complex turns and avoiding cyclists, pedestrians, and pets. Musk never touched the wheel. At one point the car conducted a maneuver that he thought was better than he would have done. “Oh wow,” he said, “even my human neural network failed here, but the car did the right thing.”

Tesla was going to be not just a car company and not just a clean-energy company. With Full Self-Driving and the Optimus robot and the Dojo machine-learning supercomputer, it was going to be an artificial intelligence company.

I love this - I find it so exciting and it’s probably my favourite moment in the book. As a software engineer, this is the type of breakthrough that you dream of, the years of toil followed by a 10x level up in performance that feels magical. When you read the story though you’re left wondering, how exactly did they do it?

There’s a couple of clues in the book, It’s a labelled, high quality dataset:

(T)he neural network planner project had analyzed 10 million frames of video collected from the cars of Tesla customers. Does that mean it would merely be as good as the average of human drivers? “No, because we only use data from humans when they handled a situation well,” Shroff explains. Human labelers, many of them based in Buffalo, New York, assessed the videos and gave them grades. Musk told them to look for things a five-star Uber driver would do.

The dataset is deep and wide:

Musk latched on to a key fact the team had discovered: the neural network did not work well until it had been trained on at least a million video clips, and it started getting really good after one-and-a-half million clips. This gave Tesla a huge advantage over other car and AI companies. It had a fleet of almost two million Teslas around the world collecting billions of video frames per day.

And Tesla have access to a lot of compute:

Musk’s goal for 2023 was to transition to using Dojo, the supercomputer that Tesla was building from the ground up, to use video data to train the AI system. With nearly eight exaflops (10^18 operations per second) of processing power, it has the chips to make it the world’s most powerful computer for that purpose.

So far, so familiar. Neural networks are great approximators of behaviour and this is what makes them so exciting - show them enough examples and you can train them to generate the function to roughly replicate the behaviour (the function here is known as a Borel Measurable Function, think of a rule that helps a self-driving car predictably understand and react to different road signs and obstacles, so it makes decisions based on clear, predefined categories of what it encounters on the road).

A recent paper, Grandmaster Level Chess without Search, makes a similar point: generic ingredients like high-quality data and deep transformer-type neural networks are enough to nearly match the performance of ad-hoc, non-generalisable techniques like the gametree search algorithms which have classically been used to build state of the art chess bots. Getting a model working this way means we’re not explicitly searching for the answer using a brute force approach, we’re trying to pattern match and reason strategically to get there (to be clear, we don’t know how the model works but a search based approach is unlikely).

The paper follows the current trend: (1) Find a scalable architecture, (2). Combine a human labelled dataset with a model labelled one, (3). Scale it not only with params and GPUs but also with data coverage (4). Distill/quantize.

In simple terms, what’s becoming clear is that the data is what really matters: if we have a good enough dataset and wide enough data coverage (i.e. high quality, meaningful edge cases are covered) then we can get to that function pretty efficiently with a neural network and some chunky compute.

Let’s use the paper to step through each of these and dive into what they mean in a bit more detail.

Find a scalable architecture

The first step is all about finding or designing a neural network architecture that can efficiently scale. “Scalable” here means that as you increase the size of the dataset, the number of parameters, or GPUs the architecture continues to perform well without hitting significant bottlenecks in terms of processing time or resource consumption.

In DeepMind’s Grandmaster Chess paper, this means adopting a transformer based architecture (these are good at handling sequential information as they understand about order and the relationship between data) and adapting the model to output log probabilities - the probability here being that of a chess move being a winning one (log probabilities are used to avoid floating point type issues with multiplying tiny numbers together). The model then searches for the most efficient function to generate these probabilities by performing a couple of different tasks in training - action-value prediction (assess the winning value of a move given a particular board state) and behavioural cloning (learn to play like an expert). It’s then ablated - basically the number of parameters is adjusted to have a look at how model size impacts performance, like finding the efficient frontier - this gives us our scaleable architecture. Note that the paper also ablates the dataset size and finds that the grandmaster level performance only emerges at sufficient scale, similar to how Elon found better performance once the dataset size passed through 1.5 million video clips.

Combine a human labelled dataset with a model labelled one

This step involves enhancing the dataset used to train ML models by combining data labeled by humans with data labeled by the models themselves (a technique known as semi-supervised learning). Human-labeled data is often accurate but expensive and time-consuming to produce. In contrast, model-labeled data can be generated more quickly and at a lower cost, although it might be less accurate. By combining using both, we can create a more comprehensive and varied dataset that can improve model performance while mitigating the costs and time associated with data labeling.

The DeepMind team do this by grabbing 10 million chess games played in February 2023 from lichess. Every board state (position of pieces on the board at any given turn) is then extracted and each state is encoded with a probability of winning using another, search-based, chess model - Stockfish-16. Stockfish also generates all possible legal moves from the board state and gives each one an action-value (winning value). Each probability is value binned (dumped into a discrete class) which simplifies the prediction task as we can focus on class choice rather than dealing with continuous probabilities. Stockfish is being used to enhance the data quality here - we’re generating a semi-synthetic dataset by taking a huge amount of real-world chess games and then augmenting that data with the predictive insights from the Stockfish model.

Scale it not only with params and GPU but with data coverage

Scaling an ML model traditionally involves increasing the number of parameters (making the model larger and theoretically capable of capturing more complex patterns) and utilizing more powerful GPUs to handle the increased computational load. However what we’re also doing here is expanding the “data coverage” — meaning the diversity and range of data the model is trained on.

The DeepMind team get this by the sheer scale of the dataset - about 10 million games. This dataset isn’t just large in volume but also rich in variety, encompassing a wide spectrum of game scenarios, strategies, and outcomes. Such extensive data coverage ensures that the model is exposed to nearly every conceivable board state and move sequence, mimicking the vast experience a human grandmaster accumulates over years of practice and competition.

Moreover, to test the model’s understanding and application of chess strategies in complex situations, the team uses 10,000 chess puzzles for evaluation. The puzzles are challenging board states that need a specific sequence of moves to be solved, and they require the model to demonstrate not just basic game-playing competence but advanced problem-solving skills. These puzzles serve as a high-quality benchmark, testing the model’s ability to apply its learned strategies creatively and effectively.

This leaves us with our deep (10 billion action-value moves from the board state), wide (10 million games, chess puzzle benchmark) and high quality (the winning move values are generated by the best model out there) dataset.

Distill / Quantize

After constructing a model that’s both deep in understanding and wide in knowledge, the next step is about making it run smoothly in the real world. This is where distillation and quantization come into play.

Distillation is like teaching a smaller, newer model (let’s call it the “student”) to mimic the big, experienced model (the “teacher”). The idea is to transfer the essence of what the larger model knows into a more compact form. It’s akin to distilling complex concepts into simpler, more digestible pieces of knowledge. The goal here is to retain the teacher model’s prowess while making the student model faster and more efficient, especially handy when deploying models to environments where computing power is limited.

Quantization, on the other hand, is about efficiency in representation. It takes the model’s understanding — originally stored in high-precision floating point numbers - and converts it into integer only operations that use fewer bits. Think of it as compressing a detailed image so it takes up less space but still keeps the picture clear. This process reduces the model’s size and speeds up its operation, making it more suitable for real-time applications or devices with less computational horsepower.

Both strategies are about refinement and efficiency. They ensure that the model isn’t just smart and knowledgeable but also practical to deploy, whether in a server handling millions of queries or a smartphone assisting with real-time decisions.

Wrapping Up

The conclusion we can get to from all this is the same as Musk’s, data wins and we need to move quickly:

The ability to collect and analyze vast flows of real-time data would be crucial to all forms of AI, from self-driving cars to Optimus robots to ChatGPT-like bots. And Musk now had two powerful gushers of real-time data, the video from self-driving cars and the billions of postings each week on Twitter. He told the Autopilot meeting that he had just made a major purchase of 10,000 more GPU data-processing chips for use at Twitter, and he announced that he would hold more frequent meetings on the potentially more powerful Dojo chips being designed at Tesla.