Little LSTM Outsmarts Big Transformer in Flip-Flop Task
·25 words·1 min
Very interesting simple scenario where a small LSTM works perfectly but 20X larger transformer architecture fails!
Exposing Attention Glitches with Flip-Flop Language Modeling https://arxiv.org/abs/2306.00946