Skip to main content

Twice the FLOPs, Half the Sense: The 405B Dense Model

·49 words·1 min

Releasing 405B is quite impressive, but I am still perplexed by choice of dense architecture. This costs at least ~2X more training flops while still performing bit lower. The only reason I can find in paper is “better stability” but many labs now already have stable MoE… continue reading