Chinchilla's 2% Solution: Compressing Text with Tiny Models
It occurs to me that Chinchilla scaling law can also be interpreted as optimal compute neural compression law.
That is, it can be re-stated as:
To compress K bytes of text (by certain optimal lossy criteria), model capacity of K/50 bytes is required.
I find above form more…
Better numbers are at https://bellard.org/nncp/
For enwiki9, gzip only achieves 68% compression while transformers achieves 88%, both lossless.