No Excuses: Triple Your LLM Training Speed with SwiGLU, ALiBi & μP
If you are doing LLM (>1B) training runs, you ought to do these 3 things:
- Use SwiGLU
- Use ALiBi
- Use µP
Why? Your training will be almost 3X faster!
You can do 3 runs for the price of 1.
You can go for much bigger model or train longer.
There is no excuse.
1/3
It just sad to see many latest multi-million runs still use old archs and training. It might be because there hadn’t been much clarity on how much difference these arch+training upgrades make. Fortunately, this beautifully executed paper has good study: https://arxiv.org/abs/2309.11568
While I loved the ablation studies and paper is very nicely done, perf for 3B model size seems to have been more or less stuck in the ballpark. There is a serious issue in all OSS datasets for number of code tokens and when they are used in training.
Still lot to learn!