How the Transformer Sausage Was Made

Transformer paper has some rare insights on how the great sausage is made. It lists 8 authors, all assigned equal contributions, 4 researchers and 4 engineers. In a rare footnote, authors also note individual contributions. 1/n

It’s an amazing little machinery that makes me wonder about the tremendous number of experimentation required to go from ideas to SOTA. Before reading next tweet, guess how many people generated core ideas and how many people spent time implementing and experimenting.

Amusingly core ideas of replacing RNNs by attention, multi head attention, scaled dot product, positional repr are from just 2 members (Jacob and Noam, both engineers). Vast majority of efforts from everyone seems experimenting to make these ideas actually work and achieve SOTA.

Discussion