Skip to main content

Bias Cut: Tailoring Transformers for Faster Training

·38 words·1 min · Download pdf

Two new tricks I need to try out:

  1. Omit biases for QKV and LayerNorms. They slow down things and don’t add much to quality.

  2. Add layer norm on QK to allow for higher LR (faster training!). https://x.com/PiotrPadlewski/status/1625188301123751936

Discussion