Skip to main content

DeepSeek's Recipe Evolution

·46 words·1 min · Download pdf

Interestingly DeepSeek has been using RL for reasoning with GRPO all the way since early 2024 but results weren’t as impressive.

So what changed?

  1. They asked model to generate CoT before answering.
  2. They used 2 line rule as reward instead of complicated model.

And boom!

Discussion