Bias Cut: Tailoring Transformers for Faster Training
·38 words·1 min
Two new tricks I need to try out:
-
Omit biases for QKV and LayerNorms. They slow down things and don’t add much to quality.
-
Add layer norm on QK to allow for higher LR (faster training!). https://x.com/PiotrPadlewski/status/1625188301123751936