DeepSeek's Recipe Evolution
Interestingly DeepSeek has been using RL for reasoning with GRPO all the way since early 2024 but results weren’t as impressive.
So what changed?
- They asked model to generate CoT before answering.
- They used 2 line rule as reward instead of complicated model.
And boom!