Mixture of Experts | Tom Hipwell

Apr 20, 2025

Insightful post from J Betker on the MoE architecture. Here’s a few grabs:

The fact that MoE has great scaling properties indicates that something deeper is amiss with this architectural construct. This turns out to be sparsity itself – it is a new free parameter to the scaling laws for which sparsity=1 is suboptimal. Put another way – Chinchilla scaling laws focus on the relationship between data and compute, but MoEs give us another lever: the number of parameters in a neural network. Previously compute and quantity of parameters were proportional, but sparsity allows us to modulate this ratio.

This is the true magic behind the MoE Transformer, and why every big lab has been moving to them.

However, this comes at a cost, and that cost is the ability to have a performant local version of the model:

At inference time with low batch sizes on the hardware that is currently commercially available, Transformers are notoriously memory bound. That is to say that we are often only using a small fraction of the compute cores because they are constantly waiting for network weights to be loaded from VRAM. This problem gets much worse with MoE transformers – there are simply more weights to load. You need more VRAM and will be more memory bound (read: worse performance). The open source community is starting to see this with the DeepSpeed and Llama 4 models.

Instead, MoE transformers lend themselves best to highly distributed serving environments with ultra-large batch sizes and corresponding high latency per token. This makes me sad – both because I like low-latency systems and because I’m a fan of local models.

Mixture of Experts