Skip to main content

Paying Attention to Sample Efficiency

·46 words·1 min

While many architectures can reach performance of Transformer, the right question to ask is at what sample efficiency.

Attention+backprop is the most sample efficient setting we have discovered so far but my guess is that we are orders of magnitudes away from what’s possible. https://x.com/jxmnop/status/1784696357892063565

Discussion