Skip to main content

Double or Nothing: Scaling Models and Tokens Equally

·36 words·1 min · Download pdf

“we find that for compute-optimal training, the model size and the number of training tokens should be scaled equally: for every doubling of model size the number of training tokens should also be doubled.” https://x.com/papers_daily/status/1517077669833318406

Discussion