I mean, it is!
But the whole story about the stock market reacting to the news about DeepSeek V3 and R1 is a fine example of the knee-jerk nature of mass consciousness in the era of clickbait economics.
Briefly, by points:
- No, DeepSeek isn’t “head and shoulders above” every other model.
The results vary across benchmarks, but on average, GPT-4o and Gemini-2 are better. You can see this on ChatBot Arena, for example (Reddit thread). Even in the results published by DeepSeek’s authors themselves (benchmark graph), you can see that in several tests, the model lags behind GPT-4o from May 2024—which, mind you, is currently ranked 16th on ChatBot Arena. - No, training DeepSeek didn’t cost $6 million, “100 times less than GPT-4.”
The $6 million figure refers only to the final training run of the published model. It doesn’t include any prior experiments, earlier versions, or R&D costs. This is just the raw computational cost of that final training run. And guess what? That figure is pretty much in line with models of the same class. - No, Nvidia did not deserve this hit
Not that we’re shedding tears for them — they could use a push to lower hardware prices. And let's not forget that DeepSeek was still trained on Nvidia’s own hardware. And no, their GPUs aren’t suddenly obsolete. DeepSeek’s computational budget is fairly standard for training, and inference for such a massive model (reminder: it’s an MoE with 671 billion parameters, 37 billion of which are active per token generation) requires a ton of hardware. Inference costs are roughly on par with a 70B dense model. Naturally, they’ll scale this success by throwing even more hardware at it and making the model bigger.Not to mention that DeepSeek makes LLMs more accessible for the on-prem customers. Which means smaller businesses will buy more GPUs, which is still good for NVDA, am I right? - Does this mean the model is bad? No, the model is very, VERY good.
It outperforms the vast majority of open-source models, which is fantastic. DeepSeek used 8-bit floating point numbers (FP8) throughout the entire training process. This sacrifices some of that precision to save memory and boost performance. Additionally, they employed a multi-token prediction system and innovative GPU clustering/connectivity techniques. These are clever and practical engineering choices that undoubtedly contributed to their success.
In the end, though, stocks will recover, ideas will spread, models will get better, and progress will march on (hopefully).
Author Of article : Alex Yumashev Read full article