Skip the Warm-up: Transformers with Inner Layer Norm
If the layer normalization is put inside the residual blocks, warm-up stage is not required for transformers: http://proceedings.mlr.press/v119/xiong20b/xiong20b.pdf
If the layer normalization is put inside the residual blocks, warm-up stage is not required for transformers: http://proceedings.mlr.press/v119/xiong20b/xiong20b.pdf