Learning Rate Warmup: The Hot New Stabilizer
“learning rate warmup can improve training stability just as much as batch normalization, layer normalization, MetaInit, GradInit, and Fixup initialization” https://x.com/Arxiv_Daily/status/1448198932949921801