(2024-12-31) ZviM DeepSeek v3 The Six Million Dollar Model
Zvi Mowshowitz: on DeepSeek v3: The Six Million Dollar Model. What should we make of DeepSeek v3? DeepSeek v3 seems to clearly be the best open model, the best model at its price point, and the best model with 37B active parameters, or that cost under $6 million.
According to the benchmarks, it can play with GPT-4o and Claude Sonnet. Anecdotal reports and alternative benchmarks tells us it’s not as good as Claude Sonnet, but it is plausibly on the level of GPT-4o.
Table of Contents
- What is DeepSeek v3 Techncially?.
- Our Price Cheap.
- Run Model Run.
- Talent Search.
- The Amazing Incredible Benchmarks.
- Underperformance on AidanBench.
- Model in the Arena.
- Other Private Benchmarks.
- Anecdata.
- Implications and Policy.
What is DeepSeek v3 Techncially?
The big thing they did was use only 37B active tokens, but 671B total parameters, via a highly aggressive mixture of experts (MOE) structure.
This lets them still train on mostly the same 15.1 trillion tokens as everyone else.
They used their internal o1-style reasoning model for synthetic fine tuning data. Essentially all the compute costs were in the pre-training step.
Our Price Cheap
It was a scarily cheap model to train, and is a wonderfully cheap model to use.
Their estimate of $2 per hour for H800s is if anything high, so their total training cost estimate of $5.5m total is fair, if you exclude non-compute costs, which is standard.
Inference with DeepSeek v3 costs only $0.14/$0.28 per million tokens, similar to Google Gemini Flash, versus on the high end $3/$15 for Claude Sonnet. This is as cheap as worthwhile models get.
Run Model Run
The active parameter count of 37B is small, but with so many different experts it does take a bit of work to get this thing up and running.
Edited: | Tweet this! | Search Twitter for discussion