Byte Latent Transformer: Patches Scale Better Than Tokens

Dec 13, 2024

Interesting paper from Meta that has been generating some buzz:

We introduce the Byte Latent Transformer (BLT), a new byte-level LLM architecture that, for the first time, matches tokenization-based LLM performance at scale with significant improvements in inference efficiency and robustness. BLT encodes bytes into dynamically sized patches, which serve as the primary units of computation. Patches are segmented dynamically based on the entropy of the next byte, allocating more compute and model capacity where increased data complexity demands it.

So the shift in architecture here is that bytes are grouped into patches dynamically; there’s no fixed vocabulary like that used in tokenization. This allows compute to be allocated more efficiently than in tokenizer based models.

Tokenizers allocate the same amount of compute to everything, regardless of how difficult the prediction is. This is wasteful - we don’t need to allocate the same amount of compute to predict the start of the word (hard) vs end of a word (easy), much like we shouldn’t allocate the same amount of compute to predict whitespace in code vs the code itself.

The patching process is dynamic because it is entropy-based - this means part of the algorithm used here looks for places in the input text with high uncertainty about predicting the next byte. So rather than using a fixed stride to split text into tokens (or using a space delimiter) we instead look for the place where things are changing the most (high points of entropy). Effectively, this means compute is allocated based on the input data complexity, as we’re varying the patch size based on how confident we are about predicting around it.

What’s eye-catching with all this is the improvements in training efficiency, the BLT model performs similarly to LLama 3 at the 8b parameter scale while using 50% less FLOPs during training. FLOPS are the number of floating point operations made during training, so a model using this encoding approach is considerably faster and cheaper to train.

Byte Latent Transformer: Patches Scale Better Than Tokens