DeepSeek-R1

Summary

Input: Deepseek-v3-base
- trained using GRPO with data that has question and verifiable answer
  - Formatting reward
  - Accuracy Reward
- Output: Deepseek zero
Input: Deepseek-v3-base
- SFT on the 5000 high quality reasoning data
- Output: Deepseek-v3-1
Input: Deepseek-v3-1
- trained using GRPO with data that question and verifiable answer
  - Formatting reward
  - Accuracy Reward
  - Language consistency with question reward
- Output: Deepseek-v3-2
Input: Deepseek-v3-2 + Deepseek-v3-base
- Input: Deepseek-v3-2
  - Sample reasoning data (synthetic data)
  - Rejection Sampling with Deepseek-v3-base as reward based on true and sampled answer
  - Output: 600k reasoning samples
- Input: Deepseek-v3-base
  - Sample non-reasoning data
    - writing
    - factual qa
    - translation
  - Output: 200k non-reasoning samples
- Output: 600k data
Input: Deepseek-v3-base
- SFT on 800k data
- trained using GRPO with data that has question and verifiable answer
  - Formatting reward
  - Accuracy Reward
  - Reward signal for human preference (helpfulness + harmlessness)

*Reasoning data == Input, Reasoning path, output

References

Annotations

Annotation

« Specifically, we use DeepSeek-V3-Base as the base model and employ GRPO (Shao et al., 2024) as the RL framework to improve model performance in reasoning. »(3)

Annotation

« DeepSeek-R1-Zero encounters challenges such as poor readability, and language mixing. »(3)

Annotation

« DeepSeek-R1, which incorporates a small amount of cold-start data and a multi-stage training pipeline »(3)

Annotation

« R1-Zero demonstrates capabilities such as self-verification, reflection, and generating long CoTs, marking a significant milestone for the research community. »(4)

Annotation

« We demonstrate that the reasoning patterns of larger models can be distilled into smaller models »(4)

Annotation

« we demonstrate that reasoning capabilities can be significantly improved through large-scale reinforcement learning (RL), even without using supervised fine-tuning (SFT) as a cold start »(5)

Annotation

« To train DeepSeek-R1-Zero, we adopt a rule-based reward system that mainly consists of two types of rewards »(6)

Annotation

« Accuracy rewards »(6)

Annotation

« Format rewards: »(6)

Annotation

« Additionally, the performance of DeepSeekR1-Zero can be further augmented through the application of majority voting. »(7)

Annotation

The response length never saturates, so I was wondering it will keep increasing, whereas there is no need like 2+2 = ?

Annotation

« For instance, DeepSeek-R1-Zero struggles with challenges like poor readability, and language mixing. »(9)

Annotation

« Can reasoning performance be further improved or convergence accelerated by incorporating a small amount of high-quality data as a cold start? »(9)

Annotation

« How can we train a user-friendly model that not only produces clear and coherent Chains of Thought (CoT) but also demonstrates strong general capabilities? »(9)

Annotation

« for DeepSeek-R1 we construct and collect a small amount of long CoT data to fine-tune the model as the initial RL actor »(9)

Annotation

« we introduce a language consistency reward during RL training, which is calculatedas the proportion of target language words in the CoT »(10)

Annotation

« we utilize the resulting checkpoint to collect SFT(Supervised Fine-Tuning) data for the subsequent round »(10)

Annotation

« However, in this stage,we expand the dataset by incorporating additional data, some of which use a generative rewardmodel by feeding the ground-truth and model predictions into DeepSeek-V3 for judgment. »(10)

Annotation

« oreach prompt, we sample multiple responses and retain only the correct ones »(10)

Annotation

« In total, we collectabout 600k reasoning related training samples. »(10)

Annotation

« In the end, we collected a total ofapproximately 200k training samples that are unrelated to reasoning. »(11)

Annotation

« We fine-tune DeepSeek-V3-Base for two epochs using the above curated dataset of about800k samples. »(11)

Annotation

« Specifically, we train the model using a combinationof reward signals and diverse prompt distributions. »(11)

Annotation

« e directlyfine-tuned open-source models like Qwen (Qwen, 2024b) and Llama (AI@Meta, 2024) using the 800k samples curated with DeepSeek-R1, as detailed in §2.3.3 »(11)

Annotation

« leaving the exploration of the RLstage to the broader research community. »(11)

Annotation

« Comparison of DeepSeek-R1 distilled models and other comparable models onreasoning-related benchmarks. »(14)

Annotation

« First, distilling more powerful models into smallerones yields excellent results, whereas smaller models relying on the large-scale RL mentioned in this paper require enormous computational power and may not even achieve the performanceof distillation. »(15)

Annotation

« Second, while distillation strategies are both economical and effective, advancingbeyond the boundaries of intelligence may still require more powerful base models and largerscale reinforcement learning. »(15)

Annotation

« To facilitate this, we prompt the model togenerate multiple tags that correspond to specific reasoning steps necessary for the search »(15)

Annotation

« DeepSeek-R1 as the teacher model to generate 800K training samples, and fine-tune several smalldense models. »(16)

Metadata

Authors :

Paper Link :

Zotero Link: DeepSeek_R1

Tags : ##done

Citation : @article

Summary

References

Annotations

Related Notes