Byte Feeding Frenzy: Spiky Training with Token Overdose

16 February 2024·48 words·1 min

Tweets

Quite interestingly if you feed model bytes directly, you will be using ~5X more tokens. While this does allow you to train a model on any modality in theory, I’d found training was rather flaky with too many spikes. Need to dust off that code 🧑‍💻. https://x.com/JeremyNguyenPhD/status/1758332850922045932

Discussion