3 things from the DeepSeek AI R1 paper

1️⃣ Skipping Supervised Fine-Tuning: Better Reasoning, Poorer Readability

⛳ Most LLMs follow three stages: pretraining for language understanding, supervised fine-tuning for task-specific learning, and reinforcement learning to align with human preferences using reward systems.

⛳ DeepSeek-R1-Zero broke this mold by skipping supervised fine-tuning entirely and relying solely on reinforcement learning.

⛳ According to the authors, this allowed the model to independently develop reasoning skills, including the ability to allocate extended “thinking time” and generate thousands of reasoning tokens for solving complex tasks. This unconventional approach significantly boosted performance, even surpassing OpenAI-o1 on benchmarks.

However, the text generated is notably less readabl, an acknowledged limitation of DeepSeek-R1-Zero.

2️⃣ High-Quality Data Remains the Moat
⛳ While DeepSeek-R1-Zero achieved remarkable results with pure reinforcement learning, introducing a small set of carefully curated cold-start data for Fine-Tuning (thousands of high-quality examples) in DeepSeek-R1 led to notable improvements in readability, language consistency, and reasoning.
⛳ This highlights how even minimal amounts of high-quality data can dramatically enhance the effectiveness of RL-trained models.

3️⃣ Distillation Over Training for Smaller Models
⛳ DeepSeek-R1’s reasoning capabilities were distilled into smaller models ranging from 1.5B to 70B parameters. These distilled models consistently outperformed much larger models like GPT-4o and Claude-3.5-Sonnet on multiple benchmarks.
⛳ This demonstrates that distillation allows smaller models to inherit remarkable reasoning abilities from larger, more powerful models, often outperforming models trained from scratch.

Author Of article : Durgesh Read full article