Skip to main content

OpenCoder Uses 3X Less Data

·133 words·1 min

OpenCoder is an amazing work! It’s 100% open that comes with new 960B tokens dataset RefineCode spanning 600+ programming languages. This dataset outperforms StackV2 to achieve same perf in 3X less tokens!

But that’s not it. 🧵 https://x.com/sivil_taram/status/1855301760770056246

They also use new WSD schedule that I wrote about last year (studied and published independently this year). In cooldown phase, they also use high quality synthetic data inspired by Phi models from my team. All these truly pushes OpenCoder to reach new heights.

The amazing part is that everything is reproducible with open code and data!

It comes from a team at InfTech in China that you may not heard of.

In a strange twist of events, Chinese startups are the ones practicing and advancing open science in AI.

They deserve huge kudos.

Discussion