Skip to main content

When 'C' Speeds: GPT-2 Training Cut from 14 Hours to 90 Minutes

·45 words·1 min

Training GPT2 124M params in 90 minutes is stunning! For reference, PyTorch+FlashAttention used to take 14hr on same 8xA100 and 10B tokens! !

I actually didn’t thought rewriting everything in C would make a lot of difference because Python/PyTorch aren’t exactly the bottlenecks… https://x.com/karpathy/status/1795484547267834137

Discussion