Skip to main content

Little LSTM Outsmarts Big Transformer in Flip-Flop Task

·25 words·1 min · Download pdf

Very interesting simple scenario where a small LSTM works perfectly but 20X larger transformer architecture fails!

Exposing Attention Glitches with Flip-Flop Language Modeling https://arxiv.org/abs/2306.00946

Discussion