DeepSeek-R1

Summary

  1. Input: Deepseek-v3-base

    • trained using GRPO with data that has question and verifiable answer
      • Formatting reward
      • Accuracy Reward
    • Output: Deepseek zero
  2. Input: Deepseek-v3-base

    • SFT on the 5000 high quality reasoning data
    • Output: Deepseek-v3-1
  3. Input: Deepseek-v3-1

    • trained using GRPO with data that question and verifiable answer
      • Formatting reward
      • Accuracy Reward
      • Language consistency with question reward
    • Output: Deepseek-v3-2
  4. Input: Deepseek-v3-2 + Deepseek-v3-base

    • Input: Deepseek-v3-2
      • Sample reasoning data (synthetic data)
      • Rejection Sampling with Deepseek-v3-base as reward based on true and sampled answer
      • Output: 600k reasoning samples
    • Input: Deepseek-v3-base
      • Sample non-reasoning data
        • writing
        • factual qa
        • translation
      • Output: 200k non-reasoning samples
    • Output: 600k data
  5. Input: Deepseek-v3-base

    • SFT on 800k data
    • trained using GRPO with data that has question and verifiable answer
      • Formatting reward
      • Accuracy Reward
      • Reward signal for human preference (helpfulness + harmlessness)

*Reasoning data == Input, Reasoning path, output

References

  1. Understanding Reasoning LLMs - by Sebastian Raschka, PhD
  2. The Illustrated DeepSeek-R1 - by Jay Alammar
  3. A Visual Guide to Reasoning LLMs - by Maarten Grootendorst
  4. Sky-T1: Train your own O1 preview model within $450
  5. Open-R1: a fully open reproduction of DeepSeek-R1
    1. Open-R1: Update #1

Annotations

Annotation

« Specifically, we use DeepSeek-V3-Base as the base model and employ GRPO (Shao et al., 2024) as the RL framework to improve model performance in reasoning. »(3)

Annotation

« DeepSeek-R1-Zero encounters challenges such as poor readability, and language mixing. »(3)

Annotation

« DeepSeek-R1, which incorporates a small amount of cold-start data and a multi-stage training pipeline »(3)

Annotation

« R1-Zero demonstrates capabilities such as self-verification, reflection, and generating long CoTs, marking a significant milestone for the research community. »(4)

Annotation

« We demonstrate that the reasoning patterns of larger models can be distilled into smaller models »(4)

Annotation

« we demonstrate that reasoning capabilities can be significantly improved through large-scale reinforcement learning (RL), even without using supervised fine-tuning (SFT) as a cold start »(5)

Annotation

« To train DeepSeek-R1-Zero, we adopt a rule-based reward system that mainly consists of two types of rewards »(6)

Annotation

« Accuracy rewards »(6)

Annotation

« Format rewards: »(6)

Annotation

« Additionally, the performance of DeepSeekR1-Zero can be further augmented through the application of majority voting. »(7)

Annotation

The response length never saturates, so I was wondering it will keep increasing, whereas there is no need like 2+2 = ?

Annotation

« For instance, DeepSeek-R1-Zero struggles with challenges like poor readability, and language mixing. »(9)

Annotation

« Can reasoning performance be further improved or convergence accelerated by incorporating a small amount of high-quality data as a cold start? »(9)

Annotation

« How can we train a user-friendly model that not only produces clear and coherent Chains of Thought (CoT) but also demonstrates strong general capabilities? »(9)

Annotation

« for DeepSeek-R1 we construct and collect a small amount of long CoT data to fine-tune the model as the initial RL actor »(9)

Annotation

« we introduce a language consistency reward during RL training, which is calculatedas the proportion of target language words in the CoT »(10)

Annotation

« we utilize the resulting checkpoint to collect SFT(Supervised Fine-Tuning) data for the subsequent round »(10)

Annotation

« However, in this stage,we expand the dataset by incorporating additional data, some of which use a generative rewardmodel by feeding the ground-truth and model predictions into DeepSeek-V3 for judgment. »(10)

Annotation

« oreach prompt, we sample multiple responses and retain only the correct ones »(10)

Annotation

« In total, we collectabout 600k reasoning related training samples. »(10)

Annotation

« In the end, we collected a total ofapproximately 200k training samples that are unrelated to reasoning. »(11)

Annotation

« We fine-tune DeepSeek-V3-Base for two epochs using the above curated dataset of about800k samples. »(11)

Annotation

« Specifically, we train the model using a combinationof reward signals and diverse prompt distributions. »(11)

Annotation

« e directlyfine-tuned open-source models like Qwen (Qwen, 2024b) and Llama (AI@Meta, 2024) using the 800k samples curated with DeepSeek-R1, as detailed in §2.3.3 »(11)

Annotation

« leaving the exploration of the RLstage to the broader research community. »(11)

Annotation

« Comparison of DeepSeek-R1 distilled models and other comparable models onreasoning-related benchmarks. »(14)

Annotation

« First, distilling more powerful models into smallerones yields excellent results, whereas smaller models relying on the large-scale RL mentioned in this paper require enormous computational power and may not even achieve the performanceof distillation. »(15)

Annotation

« Second, while distillation strategies are both economical and effective, advancingbeyond the boundaries of intelligence may still require more powerful base models and largerscale reinforcement learning. »(15)

Annotation

« To facilitate this, we prompt the model togenerate multiple tags that correspond to specific reasoning steps necessary for the search »(15)

Annotation

« DeepSeek-R1 as the teacher model to generate 800K training samples, and fine-tune several smalldense models. »(16)


Related Notes