Skip to main content

LLM Outliers: Not So Emergent After All!

·73 words·1 min

One surprising phenomenon for LLMs has been the large activation outliers in certain dimensions which made quantization hard but were believed to be correlated with emergent properties.

Well, this paper points out that they might in fact artifacts of training choices! 1/2

https://pbs.twimg.com/media/GCMIvisasAA2IFi.jpg

Meanwhile, if you are practitioner, the takeaway message seems to be: use bf16, weight_decay = 0.1, dropout=0, grad_clip=1 if you want your model to be more quantization friendly. 2/2

https://arxiv.org/abs/2305.19268

Discussion