Chinchilla's 2% Solution: Compressing Text with Tiny Models
·67 words·1 min
It occurs to me that Chinchilla scaling law can also be interpreted as optimal compute neural compression law.
That is, it can be re-stated as:
To compress K bytes of text (by certain optimal lossy criteria), model capacity of K/50 bytes is required.
I find above form more…
Better numbers are at https://bellard.org/nncp/
For enwiki9, gzip only achieves 68% compression while transformers achieves 88%, both lossless.