Skip to main content

Old but Gold: Basic Transformers Still Best at Scaling

·26 words·1 min · Download pdf

Basic transformer is still the best architecture when it comes to scaling (compared to dynamic conv, MLP-Mixer, Performer, Switch Transformer and few other varieties):

https://arxiv.org/abs/2207.10551

Discussion