Cache Me If You Can: Giving Transformers a Memory
This idea looks very cool: Extend transformer with large cache to store data at inference time (no weight change). One can then feed transformer with series of new facts which will be cached and used in subsequent inference. Memory is key missing piece in current architectures. https://x.com/ChrSzegedy/status/1503906876416798722