The Transformer Paradox: Overtrained but Not Overfit

4 October 2024·35 words·1 min

Tweets

There is something about transformer architecture whereby even a ton of overtraining doesn’t cause significant overfitting.

No other architecture features this property. Most ML folks had assumed that was simply not possible.