All insights
AI Engineering· Jul 2, 2026· 13 min read

Prefill vs decode: the real economics of LLM inference

Generating a token doesn't cost you compute — it costs you memory bandwidth. Once you understand that one sentence, KV cache, batching, GQA, and speculative decoding stop being tricks and become the obvious consequences of the hardware.

AD
Aman Dhyani
Co-founder, SERP Axis

Two phases with opposite bottlenecks

An LLM request runs in two very different regimes, and conflating them is why most cost and latency intuitions are wrong.

Prefill processes your entire prompt in a single forward pass. All prompt tokens go through the network in parallel, producing large matrix-matrix multiplications with high arithmetic intensity — this phase is compute-bound and it's what your GPU's TFLOPs are for. Prefill also writes the attention keys and values for every prompt token into the KV cache, and it determines your time-to-first-token.

Decode then generates the output one token at a time. Each step feeds the single most-recent token through the whole network to produce exactly one next token, then repeats. That autoregressive, one-token-at-a-time shape is the source of nearly all the cost — and it is not compute-bound.

Decode is memory-bandwidth bound, not compute bound

Here is the fact almost everyone gets wrong. To generate one token, the GPU must read every weight in the model out of high-bandwidth memory (HBM) and into the compute units — for a dense model, all of them, every single token. But it's only doing math for one token position, so each weight is used for a tiny matrix-vector multiply and then discarded.

That ratio — a lot of bytes moved, very little math per byte — is called low arithmetic intensity, and it means the GPU's arithmetic units sit mostly idle waiting for weights to arrive. Decode speed is therefore set by memory bandwidth, not FLOPs. A rough but useful floor: minimum time per token ≈ (bytes of model weights) ÷ (HBM bandwidth). A ~14 GB (7B params, fp16) model on a GPU with ~2 TB/s of bandwidth can't beat roughly 7 ms/token no matter how many TFLOPs it has, because that's just the time to stream the weights once.

The counterintuitive consequence

Buying a GPU with more raw compute barely helps single-stream decode. Bandwidth and how well you amortize it across concurrent requests are what move the needle. This is why 'it's slow' and 'it's expensive' almost always trace back to memory, not math.

Why batching is the whole game

If one token forces you to read all the weights anyway, then reading them to serve one request is enormously wasteful — you could serve dozens of requests from that same weight read. That's batching: process many sequences' decode step together. The weights are streamed once and reused across the whole batch, turning wasteful matrix-vector products back into efficient matrix-matrix products and raising arithmetic intensity toward the compute-bound regime.

So throughput (total tokens/second across all users) climbs almost linearly with batch size until you either become compute-bound again or — far more commonly — run out of memory for the KV cache. This is why the same GPU that feels sluggish for one user serves a hundred concurrently at a fraction of the per-token cost: you're amortizing a fixed, bandwidth-limited weight read across the batch.

Static batching (wait for a fixed group, run them all to completion) wastes this, because requests finish at different lengths and finished slots sit idle. Continuous batching (a.k.a. in-flight or iteration-level scheduling) schedules at the granularity of a single decode step, evicting finished sequences and admitting new ones every iteration, so the batch stays full. It's typically the single biggest serving-throughput win available.

The KV cache is your real capacity limit

Attention lets each new token attend to every previous token. To avoid recomputing the keys and values for the whole history on every step, they're cached — the KV cache. It grows by one token's worth of K and V, per layer, on every decode step, and it must live in GPU memory for the life of the request.

Its size is exactly computable, and the number surprises people:

  • KV cache scales linearly with both context length and batch size. At long context it can exceed the size of the model weights themselves — and it, not the weights, is usually what caps how many concurrent users a GPU can hold.
  • This is the real reason long-context requests are expensive and why providers meter input tokens: a 100K-token prompt is a large, long-lived memory reservation, not just more compute.
KV cache size — per request, then batched.
text
kv_bytes = 2 (K and V)
         × num_layers
         × num_kv_heads        # << the GQA lever (see below)
         × head_dim
         × sequence_length
         × bytes_per_element   # 2 for fp16/bf16
         × batch_size

Worked example (70B-class model, GQA with 8 KV heads,
head_dim 128, 80 layers, fp16):
  per token  = 2 × 80 × 8 × 128 × 2 bytes  ≈ 320 KB
  8K context = 8192 × 320 KB               ≈ 2.5 GB  per request
  batch of 32                              ≈ 80 GB   just for KV cache

GQA, PagedAttention, and quantized KV — attacking the cache

Because the KV cache is the binding constraint, the highest-leverage optimizations shrink or pack it.

  • Grouped-Query Attention (GQA): let several query heads share one key/value head, cutting num_kv_heads (and thus the whole KV cache) by that ratio. Multi-Query Attention is the extreme (one KV head). This is why modern models ship with GQA — in the example above, going from 64 KV heads to 8 shrinks the cache 8× with negligible quality loss, directly buying you 8× the batch or context.
  • PagedAttention (vLLM): store the KV cache in fixed-size non-contiguous blocks and map them with a page table, exactly like an operating system pages virtual memory. This removes the internal fragmentation that otherwise wastes a large fraction of KV memory, so you can pack far more concurrent sequences — and it enables copy-on-write prefix sharing (many requests with the same system prompt share those KV blocks).
  • KV-cache quantization: store K/V in fp8 or int8 instead of fp16 to roughly halve the cache, trading a little accuracy for capacity.
  • Prefix caching / prompt reuse: if thousands of requests share a long system prompt, compute its KV once and reuse it, turning a repeated prefill into a lookup.

Speculative decoding: buying tokens with the compute you're wasting

Return to the key asymmetry: decode wastes compute (it's bandwidth-bound), while verifying many tokens at once is cheap (it's compute, which is idle). Speculative decoding exploits exactly this. A small, fast 'draft' model proposes the next k tokens cheaply. The big 'target' model then verifies all k in a single parallel forward pass — the same shape as prefill — and accepts the longest prefix consistent with what it would itself have produced.

The elegant part is that it's exact, not approximate. A rejection-sampling correction guarantees the accepted output is drawn from precisely the target model's distribution — the answer is identical in distribution to plain decoding, just produced in fewer expensive passes. When the draft is often right you get several tokens per target pass, commonly a 2–3× speedup, with zero quality loss. It's free tokens carved out of the compute the memory-bound decode phase was leaving on the table.

Notice every optimization here — batching, GQA, PagedAttention, speculative decoding — is a direct response to the same root fact: decode is memory-bound and the KV cache is scarce. Once you hold that model, you can predict which trick will help before you benchmark it.

Measure the two things that actually differ: TTFT and TPOT

Because the phases have different bottlenecks, a single 'latency' number hides what's going on. Track two:

MetricGoverned byImprove it with
TTFT — time to first tokenPrefill (compute-bound); prompt lengthChunked prefill, prefix caching, smaller/faster model, shorter prompts
TPOT / ITL — time per output tokenDecode (memory-bandwidth-bound)GQA, quantization, speculative decoding, higher bandwidth HW
Throughput — tokens/s (all users)Batch efficiencyContinuous batching, PagedAttention, bigger batch

The throughput–latency tradeoff you have to choose

There's an inherent tradeoff baked into batching: increasing batch size raises aggregate throughput (cheaper per token) but can raise each request's TPOT (slower for the individual, because more sequences share each bandwidth-limited weight read). A chat product optimizes TTFT and TPOT for responsiveness; a bulk-extraction pipeline optimizes throughput and tolerates latency. Same model, opposite tuning — and you can only make that call correctly once you know which phase you're paying for.

What this changes about building on LLMs

The expensive resource in production LLM serving is not FLOPs — it's memory bandwidth during decode and memory capacity for the KV cache. So the cost levers are: pick GQA models, serve behind a continuous-batching engine with PagedAttention, quantize the KV cache, cache shared prefixes, cap output length, and route easy requests to smaller models so the big one's scarce cache is spent only where it's needed. None of that is a hack — it's just what falls out of taking the hardware seriously. That's the difference between an AI feature that's economical at scale and one whose bill quietly makes the product unviable.

Tags
LLMInferenceGPUPerformanceAI Engineering
Related services

Want this handled by senior operators instead of read about? Our Software Management practice turns the ideas above into shipped work — or explore everything we do below.

Free 48-hour audit · no lock-in

The cost of waiting
is your competitor.

Every 90 days you delay is 90 days of authority compounding for someone else. Get the audit. See the math. Then decide.

No lock-in
Weekly invoicing
Reply within
3 hours
Audit value
$2,400 yours, free