DeepSeek-R1
Summary
-
Input: Deepseek-v3-base
- trained using GRPO with data that has question and verifiable answer
- Formatting reward
- Accuracy Reward
- Output: Deepseek zero
- trained using GRPO with data that has question and verifiable answer
-
Input: Deepseek-v3-base
- SFT on the 5000 high quality reasoning data
- Output: Deepseek-v3-1
-
Input: Deepseek-v3-1
- trained using GRPO with data that question and verifiable answer
- Formatting reward
- Accuracy Reward
- Language consistency with question reward
- Output: Deepseek-v3-2
- trained using GRPO with data that question and verifiable answer
-
Input: Deepseek-v3-2 + Deepseek-v3-base
- Input: Deepseek-v3-2
- Sample reasoning data (synthetic data)
- Rejection Sampling with Deepseek-v3-base as reward based on true and sampled answer
- Output: 600k reasoning samples
- Input: Deepseek-v3-base
- Sample non-reasoning data
- writing
- factual qa
- translation
- Output: 200k non-reasoning samples
- Sample non-reasoning data
- Output: 600k data
- Input: Deepseek-v3-2
-
Input: Deepseek-v3-base
- SFT on 800k data
- trained using GRPO with data that has question and verifiable answer
- Formatting reward
- Accuracy Reward
- Reward signal for human preference (helpfulness + harmlessness)
*Reasoning data == Input, Reasoning path, output
References
- Understanding Reasoning LLMs - by Sebastian Raschka, PhD
- The Illustrated DeepSeek-R1 - by Jay Alammar
- A Visual Guide to Reasoning LLMs - by Maarten Grootendorst
- Sky-T1: Train your own O1 preview model within $450
- Open-R1: a fully open reproduction of DeepSeek-R1
Annotations
« Specifically, we use DeepSeek-V3-Base as the base model and employ GRPO (Shao et al., 2024) as the RL framework to improve model performance in reasoning. »(3)
« DeepSeek-R1-Zero encounters challenges such as poor readability, and language mixing. »(3)
« DeepSeek-R1, which incorporates a small amount of cold-start data and a multi-stage training pipeline »(3)
« R1-Zero demonstrates capabilities such as self-verification, reflection, and generating long CoTs, marking a significant milestone for the research community. »(4)
« We demonstrate that the reasoning patterns of larger models can be distilled into smaller models »(4)
« we demonstrate that reasoning capabilities can be significantly improved through large-scale reinforcement learning (RL), even without using supervised fine-tuning (SFT) as a cold start »(5)
« To train DeepSeek-R1-Zero, we adopt a rule-based reward system that mainly consists of two types of rewards »(6)
« Accuracy rewards »(6)
« Format rewards: »(6)
« Additionally, the performance of DeepSeekR1-Zero can be further augmented through the application of majority voting. »(7)
The response length never saturates, so I was wondering it will keep increasing, whereas there is no need like 2+2 = ?
« For instance, DeepSeek-R1-Zero struggles with challenges like poor readability, and language mixing. »(9)
« Can reasoning performance be further improved or convergence accelerated by incorporating a small amount of high-quality data as a cold start? »(9)
« How can we train a user-friendly model that not only produces clear and coherent Chains of Thought (CoT) but also demonstrates strong general capabilities? »(9)
« for DeepSeek-R1 we construct and collect a small amount of long CoT data to fine-tune the model as the initial RL actor »(9)
« we introduce a language consistency reward during RL training, which is calculatedas the proportion of target language words in the CoT »(10)
« we utilize the resulting checkpoint to collect SFT(Supervised Fine-Tuning) data for the subsequent round »(10)
« However, in this stage,we expand the dataset by incorporating additional data, some of which use a generative rewardmodel by feeding the ground-truth and model predictions into DeepSeek-V3 for judgment. »(10)
« oreach prompt, we sample multiple responses and retain only the correct ones »(10)
« In total, we collectabout 600k reasoning related training samples. »(10)
« In the end, we collected a total ofapproximately 200k training samples that are unrelated to reasoning. »(11)
« We fine-tune DeepSeek-V3-Base for two epochs using the above curated dataset of about800k samples. »(11)
« Specifically, we train the model using a combinationof reward signals and diverse prompt distributions. »(11)
« e directlyfine-tuned open-source models like Qwen (Qwen, 2024b) and Llama (AI@Meta, 2024) using the 800k samples curated with DeepSeek-R1, as detailed in §2.3.3 »(11)
« leaving the exploration of the RLstage to the broader research community. »(11)
« Comparison of DeepSeek-R1 distilled models and other comparable models onreasoning-related benchmarks. »(14)
« First, distilling more powerful models into smallerones yields excellent results, whereas smaller models relying on the large-scale RL mentioned in this paper require enormous computational power and may not even achieve the performanceof distillation. »(15)
« Second, while distillation strategies are both economical and effective, advancingbeyond the boundaries of intelligence may still require more powerful base models and largerscale reinforcement learning. »(15)
« To facilitate this, we prompt the model togenerate multiple tags that correspond to specific reasoning steps necessary for the search »(15)
« DeepSeek-R1 as the teacher model to generate 800K training samples, and fine-tune several smalldense models. »(16)