Dec 27, 2024
DeepSeek-v3 dropped on Christmas day (!) a gigantic mixture of experts type model (671b total parameters) which sets a new SOTA performance for open source. Why should I care? What does this even mean? Well, the big news here is the training efficiency.
Firstly the total training cost was ~$5.5m (2.78m GPU hours). Now, this is the GPU cost of the training run only, not a total load (i.e. stuff like R&D and staffing costs are not included) but that’s a big gain. By way of comparison, Llama 3.1 (405b parameters) was 9.5x more expensive (~$52m, 26m GPU hours).
The other interesting thing is the feedback loop between the DeepSeek R1 (DeepSeek’s equivalent to o1) and v3 - the reasoning capability from the R1 model is “distilled” into v3 during post-training, and it’s this shift in approach which supports the strong performance on maths/coding benchmarks (coding benchmark performance is between 4o and Sonnet). The paper suggests the maths/coding performance is a sign that a similar post-training approach will probably work well for other classes of problems which require complex reasoning - potentially an interesting new direction, especially given the progress with the recent RL style models (e.g. o3).
DeepSeek-v3