Sora: An idiot's guide
Dec 10, 2024
OpenAI | 2025 | Technical Report
Sarah Guo, Elad Gill, Aditya Ramesh, Tim Brooks, Bill Peebles | 2025 | Podcast
This post has been sat in my drafts for well over 6 months now, but with yesterday’s release of Sora in GA I thought I’d have a go at explaining how Sora might be working under the hood, and in particular a breakthrough that OpenAI made (and I assume competitors have now replicated) called Latent Space Time Patches.
I’ve tried to do this in simple, non-technical language that anyone can understand. For me, and perhaps this is my inner nerd, but there’s so much to marvel at here that should be shared widely and understood by everyone, so let’s have a go at making that possible.
A disclaimer though - I am not a deep expert and I may have got some things wrong (or over simplified things a bit) - in which case feel free to correct me. The reason I held off posting this for so long is that this blog represents my own learning process as I tried to break down and explain everything I was reading. Anyway, disclaimers and health warnings aside, let’s get on with it.
…
The big advances in AI models in the last few years has largely been driven by the process of self-supervised learning. For example, a model like BERT is trained to understand natural language by getting sentences, blanking out words in the sentence and then guessing what the missing words are.
With images and video we apply a twist to the same concept - we teach the model how to go from white noise (like the static screen from a de-tuned analog tv back in the day) to a structured image. Imagine we’ve given something like a very simple line drawing of a person’s face as our required output. The model is forced to learn all of the gradual changes that need to be made to create the output image - the shapes of a face, where a nose and eyes go and so on. The model learns how to iteratively remove noise many times over until you eventually create the sample image reliably. This process is known as diffusion.
This meant that when the OpenAI Sora team started out trying to tackle the problem of creating a generative model for high quality video they already had a reinforcement learning technique that would scale across large datasets - the diffusion process is core to how image generation with models like DALI work. From OpenAI’s research on the GPT series of models the team also had a scalable architecture; the transformer.
The self-attention mechanism in transformers allows the model to weigh the importance of different parts of input data differently. When we blank the words in our sentences we learn how words relate to each other. Self-attention enables the model to develop an understanding of the inter-dependencies and relationships within our input data. We can then take this process and scale it up. Think what would happen if you repeated this training process for all of Shakepeare’s plays, then all of Wikipedia, then all text on the internet and so on. This is important as in order to create realistic videos of any significant length the model we’re building has to have an appreciation for the details like how people move or what the trajectory of a ball might be.
A deep neural network like that used in the transformer architecture consists of a series of stacked layers. Each layer contains units called neurons that are connected to the previous layer’s units through a set of weights. Transformers can increase their capacity by adding more or wider layers, and this allows them to learn more and more detailed and nuanced patterns as compute and data scale up.
As we have seen with LLMs over the last couple of years, this transformer architecture scales well with larger and larger datasets. One of the critical successes for LLMs has been the notion of tokens. If you look at the internet, there’s all kinds of text data on it - books, code, maths. What’s beautiful about language models is that they have this singular notion of a token (roughly analogous to a word), which enables them to be trained on this vast swathe of very diverse data. All of the different types of text on the internet can be broken up into a series of tokens and then fed through the same self-supervised learning process. Because any type of text can be fed into this process, the models can build huge training datasets. It’s the amount of training that this process enables that makes the models we’re using today so amazingly capable.
Historically, there’s really not really been an analog to this token concept for the visual generative models that are needed for video. Prior to Sora, an image generative model or a video generative model would train on 256 by 256 resolution images or short duration (4 second) 256 by 256 video. This is very limiting because it restricts the types of data you can use - meaning you have to throw away much of the visual data that exists on the internet. This lack of scale in the training dataset limits the capabilities of the model.
So this brings us to the break through the Sora team made. The key innovation that enabled Sora was the introduction of the idea of space time patches that allow you to represent small pieces of the many different types of visual data that you can find online (in an image, in a really long video, in a tall vertical video shot on a mobile phone). This is done by taking out cubes from the video - separate spaces - (different parts in each frame) and arranging them by time. The cubes we’re talking here are not direct samples of pixels (this would be too computationally expensive), instead, the video is compressed down to a latent representation. A latent representation means we might sample every nth frame from a video, extract the key features of each frame and work out which bits of the frame are changing the most shot to shot. To make the process as efficient as possible we try to infer the patterns in the data.
Imagine a video as a stack of plates, with each plate representing a frame in our video (a frame is just the technical term for a still image). Rather than lifting off plates one by one (entire frames), we cut through the plates, taking the same fragment out of each plate. Each little stack of fragments is a cube, and this is our space time patch - fragments of video (space) arranged temporally (time). If rather than taking these cubes from a stack of frames, we first compress the video to it’s latent representation, we end up with the latent space time patch. This is a fast and efficient representation of video that has the neat property of being able to work with the vast array of video data in all it’s various formats that’s already out there on the internet - giving us a much larger dataset to train on.
The space time patch is the token that is fed into the diffusion transformer, just like a word (or sub word) is the token that is fed into a sentence transformer. We use the diffusion process to train on the relationship between those space-time-patch-tokens, just like blanking words in our sentence helps us understand how words impact meaning. Sora is using a diffusion transformer architecture, neatly combining two strands of research that OpenAI had running for their DALI and GPT series of models.
If you pair all this with a decoder that knows how to go back from latent space to the pixel space used in our browsers then you end up with a very powerful video generation model. This means that Sora can do a lot more than generate, say, 720p video for a fixed duration. You can generate vertical videos or widescreen videos. You can vary the aspect ratio or generate images. The new approach to tokenization made Sora the first generative model of visual content that had breadth in the way that language models have breadth, and this is the innovation that unlocks the ability to create these incredible videos that we’re now seeing.
It’s worth noting here that working out how to tokenize different scaled datasets and then feeding them into the transformer architecture seems to be key in unlocking a lot of the different AI related innovations we’re seeing. For example, the main innovation behind Suno is a way of tokenizing audio that allows for the creation of realistic songs. What’s cool about being able to tokenize video of course is that there is so much you can learn from it that you don’t necessarily get from other modalities like language. Things like the minutiae of how legs move and make contact with the ground in a physically accurate way. One of the big areas of excitement (or hype, season as you wish) around these models is seeing how this plays out in a field like, for example, robotics. I’ll leave that for another time though.
…
A brief final note for lovers of BHAGs. It seems that the journey to Sora starts with a singular, ambitious, goal - how to get a minute of HD footage from a transformer based architecture. The team chose this goal as it forced them to think about the problem from first principles - getting to a minute of footage meant that they couldn’t extend existing image generation approaches. They knew they’d need a model that was scalable and broke down data in a really simple way. What’s quite nice here is that the team talk about stepping into the future and looking back to force themselves to think bigger. Here’s Tim from the Sora team talking about it on the No Priors podcast:
“.. we were like, OK, we’d rather pick a point further in the future and just work for a year on that. There is this pressure to do things fast because AI is so fast. And the fastest thing to do is, oh, let’s take what’s working now and let’s add on something to it… But sometimes it takes taking a step back and saying, what will the solution of this look like in three years? Let’s start building that.”