Double or Nothing: Scaling Models and Tokens Equally
“we find that for compute-optimal training, the model size and the number of training tokens should be scaled equally: for every doubling of model size the number of training tokens should also be doubled.” https://x.com/papers_daily/status/1517077669833318406