Skip to main content

Sample Efficiency

·46 words·1 min · Download pdf

While many architectures can reach performance of Transformer, the right question to ask is at what sample efficiency.

Attention+backprop is the most sample efficient setting we have discovered so far but my guess is that we are orders of magnitudes away from what’s possible. https://x.com/jxmnop/status/1784696357892063565

Discussion