Skip to main content

Byte Feeding Frenzy: Spiky Training with Token Overdose

·48 words·1 min

Quite interestingly if you feed model bytes directly, you will be using ~5X more tokens. While this does allow you to train a model on any modality in theory, I’d found training was rather flaky with too many spikes. Need to dust off that code šŸ§‘ā€šŸ’». https://x.com/JeremyNguyenPhD/status/1758332850922045932

Discussion