Bias Cut: Tailoring Transformers for Faster Training

15 February 2023·38 words·1 min · Download pdf

Two new tricks I need to try out:

Omit biases for QKV and LayerNorms. They slow down things and don’t add much to quality.
Add layer norm on QK to allow for higher LR (faster training!). https://x.com/PiotrPadlewski/status/1625188301123751936