LLMs are created for different tasks and thus they have having varying capabilities and efficiencies for a particular input query, while certain llms are trained to answer simple questions while being cost efficient certain are trained to solve complex IMO problems and cost is not in consideration for such tasks. However while serving llms queries ranging from simple to complex tasks need to be answered and using a single model to answer them is not only infeasible but also damages the performance of the output generated.

Large language models cache the key and value vectors produced at every self attention layer for every generated token. This cache is what makes decode efficient, because each new token can attend back to the keys and values of all earlier tokens without recomputing them. As the prompt length increases the cache grows linearly with sequence length and layers, so memory pressure increases and throughput decreases. Paged Attention was introduced to manage this cache in fixed size blocks with an indirection table. Instead of reserving a single contiguous region per sequence, we pack per sequence blocks into a shared pool and follow pointers at runtime. This reduces internal and external fragmentation, enables reuse, and supports large effective batch sizes in practice. The vLLM blog does a great job in helping understand paged attention and it’s implementation here I want to focus to on how we can model paged attention and reason about the block size trade off