Byte Feeding Frenzy: Spiky Training with Token Overdose
·48 words·1 min
Quite interestingly if you feed model bytes directly, you will be using ~5X more tokens. While this does allow you to train a model on any modality in theory, Iād found training was rather flaky with too many spikes. Need to dust off that code š§āš». https://x.com/JeremyNguyenPhD/status/1758332850922045932