Encoder-Only Models: Expressive but Attention-Hungry
·49 words·1 min
So encoder only next token predictors are perfectly feasible but they are just not practical because we will need to recompute per token attention each time.
However, encoder-only models are way more expressive because in input sequence (x1, . . . , xn), any token x_i can… https://x.com/Kangwook_Lee/status/1842020800620040549