What did Changed with DeepSeek's Recipe?
·46 words·1 min
Interestingly DeepSeek has been using RL for reasoning with GRPO all the way since early 2024 but results weren’t as impressive.
So what changed?
- They asked model to generate CoT before answering.
- They used 2 line rule as reward instead of complicated model.
And boom!