Paying Attention to Sample Efficiency
·46 words·1 min
While many architectures can reach performance of Transformer, the right question to ask is at what sample efficiency.
Attention+backprop is the most sample efficient setting we have discovered so far but my guess is that we are orders of magnitudes away from what’s possible. https://x.com/jxmnop/status/1784696357892063565