Pretraining has Gotten 10x Faster in Past Year!
While digging up some historical numbers, it hit me that LLM training is now ~10X faster than same time last year from tons of improvements like H100 availability, Flash Attention2, new kernels, torch.compile, CUDA graphs, FP8 etc!
That’s just past 12 months!!