Sample Efficiency

While many architectures can reach performance of Transformer, the right question to ask is at what sample efficiency.

Attention+backprop is the most sample efficient setting we have discovered so far but my guess is that we are orders of magnitudes away from what’s possible. https://x.com/jxmnop/status/1784696357892063565

Discussion