Zipf Happens: Transformers Love Uneven Data
Interesting paper: Transformers work so much better because they operate on “Zipfian” data. The emergent phenomenon and in-context learning do not appear if data didn’t had this property (for ex, iid data).