Chinchilla's 2% Solution: Compressing Text with Tiny Models

It occurs to me that Chinchilla scaling law can also be interpreted as optimal compute neural compression law.

That is, it can be re-stated as:

To compress K bytes of text (by certain optimal lossy criteria), model capacity of K/50 bytes is required.

I find above form more…

For enwiki9, gzip only achieves 68% compression while transformers achieves 88%, both lossless.