Old but Gold: Basic Transformers Still Best at Scaling
Basic transformer is still the best architecture when it comes to scaling (compared to dynamic conv, MLP-Mixer, Performer, Switch Transformer and few other varieties):
Basic transformer is still the best architecture when it comes to scaling (compared to dynamic conv, MLP-Mixer, Performer, Switch Transformer and few other varieties):