Part 1: DeepSeek's Disruptive Entry

Over the past few weeks, DeepSeek has shaken up the AI landscape with the release of DeepSeek-R1, an open-weight reasoning model that rivals the performance of OpenAI's o1 and Google's Gemini 2.0 Flash Thinking. This breakthrough has ignited debates on the future of foundational model economics, the viability of open vs. closed-source AI, and the shift toward post-training scaling laws.

DeepSeek's release is significant for three reasons:

We now have open-weight PhD-level reasoning models.
AI efficiency is improving, but DeepSeek's inference costs mark an expected milestone, not a paradigm shift in the economics of LLMs.
Future AI progress will be driven by scaling post-training and inference time compute, not model size.

This article breaks down the how DeepSeek trained its models, the true costs, and what this means for the trajectory of AI progress going forward.

Inside DeepSeek

DeepSeek is an AI research lab that has quickly emerged as a formidable competitor to the likes of OpenAI, DeepMind, and Anthropic. Unlike its Western counterparts, DeepSeek was born out of High-Flyer, a Chinese hedge fund known for its AI-driven trading strategies.

High-Flyer, co-founded by Liang Wenfeng—who now serves as DeepSeek's CEO—built its competitive edge by aggressively accumulating GPU clusters to power its proprietary models. But in 2023, recognizing that the potential of AI, High-Flyer spun out DeepSeek as an independent research lab with a broader mission: to "unravel the mystery of AGI with curiosity."

Today, DeepSeek is an incredibly well-funded research lab that operates at a significant scale. Combined with High-Flyer, the company is believed to control 50,000 Hopper-generation GPUs, a mix of H100s, H20s, H800s, and A100s—making it one of the largest AI compute clusters outside of U.S. labs. Armed with this infrastructure, DeepSeek has positioned itself as a serious competitor in the race to develop state-of-the-art models.

The Breakthrough: DeepSeek-R1 and R1-Zero Models

On January 20, 2025, DeepSeek made headlines by releasing DeepSeek-R1 and its preliminary version, R1-Zero, under a permissive MIT Open Source license. On key benchmarks, both models rivaled leading closed-source reasoning models like OpenAI's o1, demonstrating PhD-level reasoning capabilities. However, many of the architectural innovations that fueled the market reaction were first introduced in the reasoning models' predecessors: DeepSeek-V2 (released in May 2024) and DeepSeek-V3 (released in December 2024).

To understand the distinctions between all these models: V2 and V3 are base foundational models, like GPT-4 and Claude 3.5 Sonnet. These models were trained on massive text datasets (trillions of tokens) using next-token prediction—meaning they generate text one "word" at a time based on statistical likelihood from their training data. While they undergo some level of "post-training" so that the model's outputs have Q&A structure and formatting, their primary function is to produce fluent, high-quality completions without explicitly reasoning beforehand.

DeepSeek-V3 performance benchmarks compared to other leading models

DeepSeek-V3 performance across key benchmarks compared to leading models

R1 and R1-Zero, on the other hand, are advanced reasoning models. Rather than simply predicting the next word in sequence, these models learned to reason before delivering a final response. They were created by applying additional post-training to a base model (V3 to be exact), to promote reasoning behaviors. As a result, R1 benchmarked in line with state-of-the-art reasoning models, including OpenAI's o1 and Google's Gemini 2.0 Flash Thinking.

DeepSeek-R1 reasoning model performance compared to OpenAI o1 and other models

DeepSeek-R1 reasoning capabilities benchmarked against OpenAI's o1 and other leading models

Part 2: Training and Inference Costs

Training Costs

Much of the media coverage surrounding DeepSeek's release hyper-fixated on claims that DeepSeek-V3 was trained for just $6 million, implying a massive cost advantage over U.S. labs. However, this figure is misleading—it only reflects the final training run and ignores the full costs of AI model development.

The fully burdened cost of developing V3—including research, experimentation, data curation, and infrastructure—was likely in the billions. According to Dylan Patel of SemiAnalysis, DeepSeek's total CapEx is estimated at $1.6 billion, with an additional $944 million in operational expenses.

DeepSeek AI Total Cost of Ownership breakdown by GPU types and operational costs

DeepSeek's total cost of ownership breakdown showing the true scale of investment required

This underscored a fundamental truth: Building frontier models is still an exclusive game, and few organizations have the capital and talent to continuously compete. A small subset of labs will continue to push the boundaries of AI (OpenAI, DeepMind, Anthropic, Meta). However, DeepSeek's emergence signals that Chinese AI labs are proving far more competitive than previously assumed.

Inference Costs

As shown in the benchmarks in Part 1, DeepSeek's raw performance on both V3 and R1 is near-state-of-the-art. Yet, what makes DeepSeek's entry truly disruptive is not just its capabilities, but its radically lower cost structure. While DeepSeek is likely offering these models at cost to increase adoption, it has still dramatically undercut OpenAI on pricing.

DeepSeek-V3: $0.28 per million output tokens (rising to $1.10 on Feb 8) vs. GPT-4o's $10.00 per million
DeepSeek-R1: $2.19 per million tokens vs. OpenAI's o1 at ~$60—a 27x difference

DeepSeek achieved these cost reductions through key architectural optimizations in its base models, V2 and V3, that improved both training and inference efficiency. Most notably, DeepSeekMoE and DeepSeekMLA (Optimizations explained in more detail in the appendix). While these innovations are meaningful, they are not a fundamental breakthrough that changes the economics of LLMs. Instead, DeepSeek's cost structure represents an expected milestone in an ongoing trend of cost reduction driven by algorithmic progress.

As SemiAnalysis and Anthropic's CEO have noted, algorithmic progress is reducing inference requirements by 4x per year—meaning that each year, models require 4x less compute to achieve the same capability. Put simply, DeepSeek's efficiency gains are impressive, but they follow the expected trajectory of cost reductions rather than signaling a paradigm shift.

Chart showing dramatic LLM cost reduction over time from GPT-3 to current models

The dramatic reduction in LLM costs over time, showing DeepSeek's place in the broader trend

Distillation Allegations

The perhaps most controversial ingredient in DeepSeek-V3's training pipeline is model distillation. Although not explicitly stated in their paper, there is growing consensus that DeepSeek distilled OpenAI's frontier models to accelerate V3's training. Distillation, a technique where a smaller model (the student) mimics the outputs of a more advanced model (the teacher), allows for efficient learning from high-quality responses while dramatically cutting down training costs.

Reports from WSJ and TechCrunch reinforce this suspicion, noting that in controlled tests, DeepSeek-V3 identified as ChatGPT in 5 out of 8 generations. If true, this would violate OpenAI's Terms of Service and highlight a broader challenge in AI security—how leading labs can prevent proprietary models from being used to bootstrap open-source competitors.

Whether DeepSeek engaged in explicit distillation remains uncertain, but the financial impacts could have been immense and saved DeepSeek millions to curate the "14.8 trillion high-quality and diverse" tokens they used to train V3.

Part 3: R1, Post-Training, and the New Scaling Paradigm

The First AI Scaling Law: Pre-Training

For years, AI progress was driven by a simple formula: bigger models + more data + more compute = better performance. This was the foundation of the first AI scaling law, where increasing model size and expanding pre-training datasets led to predictable improvements in intelligence.

However, by early 2024, research labs recognized that this formula was reaching its limits. Data scarcity had become an unavoidable bottleneck—labs were running out of high-quality, diverse data to continue scaling models efficiently. DeepMind's Chinchilla paper had already demonstrated that there exists a compute-optimal ratio of model size to training data, meaning that without sufficient high-quality data, simply increasing model size would not yield further benefits.

With the diminishing returns of brute-force scaling, AI labs turned to a new approach: post-training. Rather than focusing solely on making models bigger, researchers began exploring how to make them think better—instilling reasoning and problem-solving abilities after the initial pre-training phase.

What is Post-Training?

Post-training refers to the set of techniques applied to a model after it has completed its initial pre-training run. Importantly, post-training is not a new concept—it has been pivotal in transforming base, pre-trained models into functional AI chatbots, like ChatGPT.

After pre-training, a model is highly proficient at predicting the next token but lacks the ability to structure responses in a way that feels natural or directly answers questions without trailing off into incoherence. This is where post-training techniques like supervised fine-tuning (SFT) and Reinforcement Learning from Human Feedback (RLHF) have traditionally been applied—helping models refine their formatting, coherence, and alignment with human expectations for structured Q&A formats.

At its core, these traditional techniques apply additional training so that model produces outputs that are more structured and conversational rather than just simply a sequence of the most-likely tokens. Now, AI research is expanding post-training beyond merely improving fluency and formatting—instead using it to fundamentally change how models think.

Imitation Learning vs. Reinforcement Learning: A Paradigm Shift

As brilliantly described by Andrej Karpathy, there are two fundamental types of learning in both humans and AI:

Imitation learning: Learning by copying examples, mirroring the behavior of others, and refining outputs to match known patterns.
Trial-and-error learning (reinforcement learning): Learning through experimentation, where outcomes (successes and failures) guide iterative adjustments to improve decision-making over time.

Up until the release of o1 and now R1, all LLM training techniques has been fundamentally imitation based. Pre-training is nothing more than next-token prediction—the model learns from an enormous corpus of text to generate text on its own, while supervised fine-tuning teaches models to mimic human-labeled responses. Even RLHF, despite its name, is not true reinforcement learning, as humans are still in the loop.

Reinforcement learning in its purest form is fundamentally different. Instead of mimicking human examples, RL enables a model to discover strategies on its own through self-play.

The best example of this is AlphaGo, the AI that famously defeated the world champion in Go. Rather than learning from human games, AlphaGo played millions of games against itself, starting with no knowledge beyond the basic rules. Through trial and error, it experimented with different moves and, based on the outcome (win or loss), optimized its strategy.

This distinction—learning by discovering solutions through trial and error vs. imitation—is at the heart of this paradigm shift towards scaling post-training LLMs.

DeepSeek-R1: The First Pure Reinforcement Learning LLM

Reinforcement learning is precisely what makes DeepSeek-R1-Zero so groundbreaking. Unlike previous LLMs, which were trained entirely with forms of imitation learning, DeepSeek-R1 incorporated pure RL techniques. Specifically, DeepSeek set up a reinforcement learning system where the model was given math, coding, and logic problems and rewarded for solving them correctly. If the model got an answer right, it reinforced the patterns that led to that success. No predefined examples of reasoning were provided—the model had to figure out how to reason on its own.

More specifically, DeepSeek started with its V3 base model and introduced a simple reinforcement learning system:

The model was given a large collection of math, coding, and logic prompts, all of which have objectively checkable solutions (these problems are in verifiable domains).
For each problem, the model generated multiple solutions (e.g., 16 answers) to encourage exploration and variability. (This is the simplified explanation of the GRPO procedure—Group Relative Policy Optimization.)
A reward model scored each response based on correctness (knowing the true answers).
The highest-scoring responses were used as examples of effective "reasoning" patterns for the LLM to learn from.

The result was nothing short of extraordinary. Without explicit examples of reasoning, DeepSeek-R1 began showcasing these behaviors on its own. Instead of generating a response in one pass, DeepSeek-R1 generated streams of tokens in which the model started breaking down problems, backtracking when it realized an error, reevaluating flawed assumptions, and self-correcting—all without being explicitly trained to imitate this behavior. DeepSeek researchers called these moments "aha moments"—emergent reasoning behaviors that surfaced naturally through reinforcement learning.

Logically, the DeepSeek team observed that as training progressed, the model's chain-of-thought responses became much longer—it effectively used more "thinking time" to solve problems. At the earliest RL stages, the model generated much shorter streams of token; by later stages, it was generating thousands of tokens before arriving at a final response. Perhaps most impressively, DeepSeek achieved these results using only 800,000 samples, raising the question: what emergent behaviors could arise at even greater scale?

Graph showing DeepSeek-R1-Zero average response length increasing during training

DeepSeek-R1-Zero's response length increased dramatically during training, showing emergent reasoning behavior

The takeaway: we are still extremely early in scaling this new paradigm. As the industry shifts toward more RL-based systems, the demand for compute is only going up. The next breakthroughs in AI will not come from larger models, but from training models to think, reason, and refine their own understanding at inference time.

What about o1

OpenAI's o1, released in December 2024, was the first reasoning model of its kind that scaled post-training to incorporate emerging reasoning capabilities. While OpenAI has not shared the full details of its training, it is widely believed that o1 incorporated human-labeled reasoning chains or AI-generated step-by-step solutions into its training. Put another way, it is widely suspected that o1's training process showed examples of reasoning data (along with Q&A and a reward system) to instill reasoning capabilities. This suggests that o1 learned reasoning in a more imitation-based manner, rather than through the pure reinforcement learning approach used by DeepSeek.

Test Time Scaling and The Future of Reasoning Models

DeepSeek's reasoning models unveiled an emergent reasoning capability: AI models do not need to be explicitly taught how to reason. Given enough compute, questions in verifiable domains, and a structured trial-and-error system, they can develop sophisticated reasoning abilities on their own. While these verifiable domains may seem limited to math and coding, their potential is far greater than it appears. One could imagine post-training a model by having it generate viral tweets and rewarding it when they reach 100,000 likes. Over time, the model could refine its approach based on what succeeds and fundamentally learn virality. The same approach could apply to robotics, scientific research, and many more fields. I suspect that post-training will become further domain-specific and will become the paradigm that may ultimately bring us to AGI.

Appendix

1. DeepSeekMoE: Optimizing Compute with Mixture of Experts

The core idea behind Mixture of Experts (MoE) is to divide a large model into multiple specialized "expert" subnetworks, activating only a subset of them for any given input. Rather than activating the entire model for every token (as in the first iteration of LLMs such as GPT-3.5), only a subset of experts is used at any given time, reducing the total number of active parameters per token, hence lowering computational costs during training and inference.

MoE is not a new concept—it has been widely used in large-scale LLMs, most notably in GPT-4, which is believed to have 16 experts, each containing approximately 110 billion parameters. However, traditional MoE implementations have struggled with inefficient token routing—assigning the right input tokens to the most relevant experts is non-trivial. Poor routing leads to underutilization of certain experts and hence inefficient compute usage or lower model performance. This has been one of the fundamental bottlenecks in making MoE-based architectures practical at scale.

DeepSeek-V3's key innovation in MoE was a more sophisticated gating mechanism—a multi-layer network that optimally distributes workloads across experts. This improved routing allows DeepSeek-V3, which has a total of 671 billion parameters, to compute only 37 billion active parameters per token, dramatically reducing the computational burden.

2. DeepSeekMLA: Solving the KV Cache Bottleneck

Arguably an even greater breakthrough than MoE, Multihead Latent Attention (MLA) is a fundamental innovation in the all-important attention mechanism of Transformers.

As a refresher, the attention mechanism allows every token in an input sequence to "attend" to every other token. During inference, this process involves generating key-value (KV) pairs for each token, which grow exponentially as context windows get longer. To mitigate the number of computations at inference, KV vectors are precomputed and stored in a form of memory called KV cache for fast retrieval. However, this KV cache quickly becomes a major bottleneck as context windows grow.

DeepSeekMLA solves this issue by introducing a compression step into the KV cache mechanism. Instead of storing full-size key-value vectors, MLA projects them into a lower-dimensional latent space, compressing the information while retaining enough fidelity to project them back when needed. This approach reduces the required KV cache per query by an astonishing 93.3% compared to standard attention, drastically lowering chip memory costs and hence inference costs per token.