Large language models cache the key and value vectors produced at every self attention layer for every generated token. This cache is what makes decode efficient, because each new token can attend back to the keys and values of all earlier tokens without recomputing them. As the prompt length increases the cache grows linearly with sequence length and layers, so memory pressure increases and throughput decreases. Paged Attention was introduced to manage this cache in fixed size blocks with an indirection table. Instead of reserving a single contiguous region per sequence, we pack per sequence blocks into a shared pool and follow pointers at runtime. This reduces internal and external fragmentation, enables reuse, and supports large effective batch sizes in practice. The vLLM blog does a great job in helping understand paged attention and it’s implementation here I want to focus to on how we can model paged attention and reason about the block size trade off