In late 2022, running a state-of-the-art language model cost around $20 per million tokens. By 2025, the same capability costs roughly $0.40. Fifty times cheaper in three years. If you've used AI APIs like OpenAI or Gemini recently and been surprised by the price, this is the story behind that number, and it wasn't driven by faster hardware alone.
To understand why, it helps to separate two things that often get conflated: training and inference. Training is the one-time process of building a model from data, expensive, done once, absorbed as a capital cost. Inference is what happens every time a user sends a message: the model generates a response token by token. It happens billions of times a day across every AI product, which means inference cost is the tax on every interaction. Making inference cheaper is what determines whether AI can be in every app or only the premium-priced ones.
Here's a concrete example of why inference is expensive. Take the sentence "The cat sat on the mat." That's 6 tokens. At inference time, when the model generates the next token, it needs to compute attention between every token and every other token. That's 6 x 6 = 36 comparisons. Seems fine. Now scale that to a 128K-token document: 128,000 x 128,000 = over 16 billion comparisons per layer. That's O(n²) scaling, and it's the core reason inference costs explode with longer contexts.
The improvements that drove costs down came in six distinct layers, each solving a different piece of the problem: the attention mechanism at the model's core, the GPU kernel that computes it, the memory system that stores the working state, the generation algorithm that produces tokens, the position encoding that enables long contexts, and the architecture underneath it all. No single technique explains the full reduction. Stacked together, they do.
Notably, many of these improvements are not entirely new inventions. They are novel applications of well-established algorithmic ideas: PagedAttention borrows directly from OS virtual memory (and was published at SOSP, an operating systems conference, not an ML venue). FlashAttention applies IO-aware tiling, a technique standard in high-performance computing for decades, to the specific pattern of fused softmax-matmul attention. Speculative decoding echoes speculative execution in CPUs: predict ahead, verify, roll back if wrong. The novelty lies in recognizing which classical ideas map onto which LLM bottlenecks, and engineering them to work at GPU scale.
If you want the foundation, how transformers and attention work from first principles, Transformers: Simple Yet Wildly Versatile covers that. This post picks up where that one leaves off, at the point where the engineering problem shifts from "make it work" to "make it affordable at scale."
The Attention Tax: O(n²) in Two Places
The first cost problem starts inside the attention mechanism itself. Standard Multi-Head Attention (MHA) has two costs that both scale quadratically with sequence length. Compute is O(n²) because every token attends to every other token. Remember our 6-token sentence, "The cat sat on the mat." With MHA, attention computes all 36 token pairs per layer. That's manageable. But at 128K tokens, that's 128,000 x 128,000 = roughly 16 billion comparisons per layer, and a large model might have 80+ layers. Memory scales similarly because you store a separate key-value pair for every head at every layer.
The first generation of fixes targeted the KV cache specifically, by reducing how many distinct K/V pairs you actually need to store.
Multi-Query Attention (MQA) (Shazeer, 2019) went furthest: all query heads share a single K/V pair. Cache size drops dramatically, but quality degrades, especially on tasks requiring precise, head-specific retrieval. MQA was widely adopted early on, appearing in models like Falcon and early PaLM, but it was quickly superseded because the quality tradeoff was too steep for most use cases.
Grouped Query Attention (GQA) (Ainslie et al., 2023) landed as the practical middle ground. GQA divides the query heads into G groups. Each group shares a single set of key-value vectors instead of having its own. Back to our sentence, "The cat sat on the mat." With MHA and 32 heads, the model stores 32 separate key-value pairs per layer, one per head. GQA with 8 groups stores just 8, with 4 query heads sharing each KV pair. The attention math is identical; only the storage changes. That's a 4x cache reduction with near-identical quality. GQA with G=1 is MQA; with G equal to the head count, it's full MHA. Every major production model converged on GQA: Meta first adopted it for Llama 2 (July 2023), kept it in Llama 3, and Mistral 7B, Gemma 2/3, Qwen3, and DeepSeek R1 all ship with it.
Multi-Head Latent Attention (MLA) is DeepSeek's contribution, introduced in DeepSeek-V2 and carried into V3. Instead of reducing the number of K/V pairs, MLA compresses the K/V tensors into a low-dimensional latent vector using a down-projection matrix, then recovers the full K/V at query time via up-projection. Only the tiny latent vector is stored in the cache. The results are striking: 93.3% reduction in KV cache size compared to their previous 67B dense model, with benchmark accuracy that exceeds standard MHA, not a tradeoff, but a strict improvement.
To make the memory savings concrete, let's return to our 6-token sentence. With MHA using 32 heads, the model stores 32 separate key-value pairs per layer. MQA collapses that to just 1. GQA with 8 groups stores 8. MLA stores a single compressed latent vector that's even smaller than MQA's single pair, because the compression captures the essential information in fewer dimensions. The trend is clear: each generation finds ways to preserve attention quality while storing dramatically less.
| Mechanism | KV Storage | Cache Reduction | Quality vs. MHA | In Production |
|---|---|---|---|---|
| MHA | H K/V per layer | Baseline | Baseline | GPT-2 era models |
| MQA | 1 K/V per layer | Large | Degraded | Falcon, early PaLM |
| GQA | G groups per layer | Moderate | Near-identical | Llama 2/3, Mistral, Gemma 2/3 |
| MLA | 1 latent vector | 93.3% | Better | DeepSeek-V2/V3 |
Flash Attention: The Enemy Is IO, Not Compute
Even with GQA reducing cache size, computing attention on a GPU is I/O bound, not compute bound. To understand why, you need to know about two types of memory on a GPU. HBM (High Bandwidth Memory) is the large but slower off-chip memory, think of it like your computer's hard drive. It can hold a lot, but accessing it takes time. SRAM (Static Random-Access Memory) is the tiny but very fast on-chip memory, think of it like your CPU's L1 cache. It's blazing fast but extremely limited in size. When attention computes the full N x N matrix, it doesn't fit in SRAM, so the GPU has to continuously read and write to HBM. Most of the available TFLOPS (Tera Floating Point Operations Per Second, the measure of a GPU's raw compute power) go to waste waiting on memory transfers.
FlashAttention (Tri Dao, 2022) reframed this as an IO problem and solved it by tiling. Instead of materializing the full attention matrix in HBM, it breaks the computation into SRAM-sized blocks and fuses the softmax and matmul steps so they happen entirely in fast on-chip memory. The algorithmic complexity is still O(n²), the math is identical to standard attention. Only the memory access pattern changes. The result: 2-4x speedup with no accuracy loss. FlashAttention proved the bottleneck was never the math, it was where the data was read from and written to.
KV Cache: The Bottleneck That Keeps Growing
GQA and MLA reduce what you need to cache per request. PagedAttention solves a separate, equally painful problem: how that cache is allocated across requests.
Before PagedAttention (Berkeley, 2023, the engine inside vLLM), KV cache memory was reserved contiguously at the maximum sequence length per request. Without PagedAttention, serving 1,000 concurrent requests each with a 128K context reserve would lock up 128,000 memory slots per request, even if most of those slots go unused. A request using 10K tokens out of a reserved 128K window still locked all 128K of memory. The result: 60-80% of KV cache memory wasted across typical production workloads.
PagedAttention, borrowed directly from operating system virtual memory, breaks the cache into small fixed-size blocks that can be stored non-contiguously across GPU memory. Different requests sharing the same system prompt can reuse cached blocks. The practical outcome: under 4% memory waste, 2-4x throughput improvement over naive implementations, and up to 24x compared to naive baselines like HuggingFace Transformers on LLaMA-13B (3.5x over production-grade serving systems like TGI).
Sliding Window Attention takes a different approach to bounding cache size: instead of attending to all past tokens, each layer attends to a fixed window (e.g., 4096 tokens). Think of reading a long essay. Sliding window attention means the model can only directly reference the last N tokens, roughly like only being able to look back at the last few pages while reading. Context from earlier in the document still influences the model through earlier layers, but the direct connection weakens over distance. As the window slides forward, information from two pages ago becomes increasingly indirect, captured only through the compressed representations passed between layers rather than through direct token-to-token attention. Mistral 7B shipped with this in production. It's a practical ceiling on how large the KV cache can grow, which matters in memory-constrained serving environments.
Mixture of Experts: Only Activate What You Need
All the techniques above optimize how attention and memory are handled. Mixture of Experts (MoE) takes a fundamentally different angle: instead of making every parameter cheaper to run, just don't run most of them.
MoE models split the model's feedforward layers into many "expert" sub-networks. For each token, a learned routing layer selects only a small subset of active experts out of potentially dozens or hundreds. The model has many total parameters, but only uses a fraction of them per token, keeping compute roughly constant even as total parameter count scales up.
The numbers make this concrete. DeepSeek-V3 uses 256 experts with 8 active per token. Mixtral 8x7B uses 8 experts with 2 active. A MoE model like Mixtral 8x7B has 46B total parameters but costs about the same to run as a 13B dense model per token, because only roughly 13B parameters are active at any time. You get the knowledge capacity of a much larger model at the inference cost of a much smaller one.
MoE helps both sides of the cost equation. During training, fewer FLOPs per token means you can train larger models on the same compute budget. During inference, fewer active parameters means faster forward passes and lower memory bandwidth requirements. This is why MoE has become a default architecture choice for frontier models: it's one of the few techniques that improves both training efficiency and serving cost simultaneously.
Speculative Decoding: Parallelize the Verification
The previous layers addressed memory, how much you're caching and how efficiently it's organized. But there's a separate bottleneck in how tokens are generated. Autoregressive generation is inherently serial: each token requires a full forward pass through the model. Speculative decoding breaks this constraint.
Think of it like a junior assistant and a senior expert reviewing a document together. The junior assistant (a small, fast model) drafts several likely next words in sequence. The expert (the large LLM) then reads all of them at once and marks which ones they'd approve. Since verification is parallel, the expert can check all N candidates in one pass, you get the output of multiple expert decisions at roughly the cost of one. The catch: the junior assistant has to be good enough that the expert approves most of the suggestions.
That's the core mechanism. A small, fast "draft" model generates N tokens speculatively in one shot. The large target model then verifies all N tokens in a single parallel forward pass, because checking whether the model would have generated each token is batchable in a way that generation isn't. If k tokens are accepted, you've produced k tokens at the cost of roughly one large-model forward pass.
Acceptance rate drives everything. For predictable outputs, like code, structured data, and instruction-following, acceptance is high and speedups are large. For open-ended creative generation, acceptance drops and the gains shrink.
Recent work like EAGLE-3 (NeurIPS 2025) shows how far speculative decoding can go when the draft model is trained for direct token prediction rather than feature matching. Reported results are ~3–6.5x lower-latency generation vs vanilla autoregressive decoding and ~1.4x over EAGLE-2. In practice, all speculative decoding methods see diminishing returns at high batch sizes, where the target model becomes compute-bound rather than memory-bound, so this is primarily a low-latency optimization, not a high-throughput one.
RoPE: How Context Windows Got So Big
All of the above optimizes serving long contexts. But how did models get those long contexts in the first place? The answer is largely Rotary Position Embedding (RoPE) (Su et al., 2021).
Original transformers used fixed sinusoidal absolute position embeddings, baking the position into the token representation directly. The problem: those embeddings are fixed at training time and don't generalize beyond the training sequence length. A model trained on 8K tokens has no useful positional signal for token 9,000.
RoPE encodes position by rotating the vectors used in attention by an amount that depends on where each token sits in the sequence. The key insight: when computing attention between two tokens, only the difference in their rotations matters, which encodes relative distance, not absolute position. A model trained on sequences up to 8K tokens learns that "token A is 500 positions before token B." That relationship holds at 128K tokens too.
The practical payoff is length generalization. Because RoPE captures relative distance, you can extend a model trained on 8K tokens to 128K, or even 1M, through techniques like YaRN (which adjusts the rotational frequencies at inference time) and NTK-aware scaling (which shifts the embedding base to avoid frequency aliasing at long range). This is how Llama 3.1's 128K context works: the architecture is RoPE, and the extension is YaRN applied at inference time.
From a serving perspective, RoPE matters because it's the reason the KV cache scaling problem keeps getting harder. Every 2x increase in context length doubles the cache requirements, and context windows have grown from 4K to 1M in under three years.
Closing Thoughts
The production stack today is a composition of several of these techniques, not a choice between them. GQA in the architecture, FlashAttention-3 if you're on H100, PagedAttention in the serving layer, MoE for parameter efficiency, RoPE + YaRN for context extension, speculative decoding for latency-sensitive workloads. These compose, and each contributes its layer of efficiency independently. I covered the quantization side of this equation in Making LLMs Efficient in Production; add quantization on top and you have the full production picture.
Several of these techniques are not inference-only either. FlashAttention is used in training, where it reduces memory and speeds up the forward pass. GQA reduces training memory. MoE reduces training FLOPs per token. The efficiency gains compound in both directions, making models cheaper to build and cheaper to run.
The rate of improvement has been faster than most predicted, and the techniques driving it have mostly been algorithmic, not hardware-driven. That suggests the ceiling isn't hardware. It's how many more classical ideas we haven't yet tried at scale.
All of the above still takes the transformer as a given and optimizes around it. There's a more radical set of questions worth exploring: what if you replace attention itself? State space models, hybrid architectures, and research like Log-Linear Attention are pushing in exactly that direction. That's what I'll cover next.
Sources: MQA paper (Shazeer, 2019), GQA paper (Ainslie et al., 2023), FlashAttention paper (Tri Dao, 2022), PagedAttention / vLLM, DeepSeek-V2 MLA, EAGLE-3 (NeurIPS 2025), RoPE paper (Su et al., 2021).
Further reading: AI But Simple Issue #85 (MHA/GQA/KV cache visuals) and Issue #86 (RoPE deep-dive) for thorough illustrated breakdowns of these mechanisms.