No Excuses: Triple Your LLM Training Speed with SwiGLU, ALiBi & μP

If you are doing LLM (>1B) training runs, you ought to do these 3 things:

Use SwiGLU
Use ALiBi
Use µP

Why? Your training will be almost 3X faster!

You can do 3 runs for the price of 1.

You can go for much bigger model or train longer.

There is no excuse.

1/3

It just sad to see many latest multi-million runs still use old archs and training. It might be because there hadn’t been much clarity on how much difference these arch+training upgrades make. Fortunately, this beautifully executed paper has good study: https://arxiv.org/abs/2309.11568

While I loved the ablation studies and paper is very nicely done, perf for 3B model size seems to have been more or less stuck in the ballpark. There is a serious issue in all OSS datasets for number of code tokens and when they are used in training.

Still lot to learn!

Discussion