Encoder-Only Models: Expressive but Attention-Hungry

5 October 2024·49 words·1 min

Tweets

So encoder only next token predictors are perfectly feasible but they are just not practical because we will need to recompute per token attention each time.

However, encoder-only models are way more expressive because in input sequence (x1, . . . , xn), any token x_i can… https://x.com/Kangwook_Lee/status/1842020800620040549

Discussion