Skip to main content

Encoder-Only Models: Expressive but Attention-Hungry

·49 words·1 min

So encoder only next token predictors are perfectly feasible but they are just not practical because we will need to recompute per token attention each time.

However, encoder-only models are way more expressive because in input sequence (x1, . . . , xn), any token x_i can… https://x.com/Kangwook_Lee/status/1842020800620040549

Discussion