Inference Memory

Inference is different than training because you don’t need to worry about grads, optimizer states, activations etc. For example, you can run just one layer and while it gets computed you load another in parallel so you never need more than 2 layers worth of accelerator RAM. https://x.com/teortaxesTex/status/1781894128424386953

A lot of people still get shocked when they find out that frameworks like llama.cpp can run 70B models even on machines with mere 4GB RAM because they don’t understand above (and a little magical thing called memory mapped files) 😇.

Discussion