LLM Inference: Foundations, System Architecture, and Popular Engines
Apr 2026
This post summarizes the foundation and current status of LLM inference, covering the core functionalities, system architecture, and popular inference engines (e.g., vLLM, SGLang).
Inference refers to generating responses given an input prompt, which corresponds to the decoding process.
Autoregressive decoding is the default decoding method for LLMs. It iteratively generates the next token based on
the input and all previously generated tokens, i.e., \(y_t \sim p_\theta(y_t \mid X, y_1, \dots, y_{t-1})\),
where \(\mathbf{x}\) is the input prompt, \(y_{1:t-1}\) are the previously generated tokens, and \(\theta\) denotes the model parameters.
The full response \(y = (y_1, \dots, y_T)\) is produced by repeating this process until an end-of-sequence token is generated.
The auto-regressive decoding process consists of two phases: a prefill phase and a decode phase.
During prefill, the model processes the entire input prompt in parallel to build the initial key-value (KV) cache.
During decode, the model generates tokens one at a time and update the KV cache.
The prefill phase is compute-intensive (large matrix multiplications over the full prompt), while the decode phase is memory-bandwidth-intensive
(each step only computes a single token but must load the full model weights and KV cache from memory).
The prefill phase processes the entire input prompt in a forward pass to produce the initial KV cache and the first output token \(y_1 \sim p_\theta(X)\).
Prefill processes all \(n\) prompt tokens simultaneously (vs decoding phase processes each token one by one).
Prefill computation (Forward pass)
Below we illustrate the forward pass for one layer of the vanilla transformer architecture
(multi-head attention with standard FFN).
Tokenization: the raw input text \(X_s\) is split into a sequence of token IDs \(X \in \mathbb{R}^{B \times S}\) using the model's tokenizer (e.g., BPE, SentencePiece), where \(B\) is the batch size and \(S\) is the sequence length. This step runs on the CPU.
Embedding lookup (first layer only): token IDs are mapped to dense vectors via the embedding matrix \(W_e \in \mathbb{R}^{|V| \times d}\), producing \(E \in \mathbb{R}^{B \times S \times d}\), where \(d\) is the embedding dimension. Positional information is also added at this stage (e.g., via RoPE).
QKV projection: for each attention head \(i \in \{1, \dots, h\}\), three linear projections compute queries, keys, and values:
$$Q_i = E \, W_{Qi}, \quad K_i = E \, W_{Ki}, \quad V_i = E \, W_{Vi}$$
where \(W_{Qi}, W_{Ki}, W_{Vi} \in \mathbb{R}^{d \times d_h}\) and \(d_h = d / h\). Each projection is a GEMM of shape \((B \cdot S \times d) \cdot (d \times d_h)\).
Attention computation: for each head \(i\), the attention scores and output are:
$$A_i = Q_i K_i^\top / \sqrt{d_h}, \quad O_i = \text{softmax}(A_i + M) \cdot V_i$$
where \(M\) is the causal mask ensuring each token only attends to previous positions. The \(A_i\) matrix has shape \(S \times S\).
Multi-head concatenation and output projection: the per-head outputs are concatenated and projected:
$$O_a = [O_1, O_2, \dots, O_h] \, W_o$$
where \(W_o \in \mathbb{R}^{d \times d}\).
Layer normalization and FFN: after residual connection and layer normalization (producing \(O_l\)), the feed-forward network applies:
$$\text{FFN}(O_l) = \text{ReLU}(O_l \, W_1 + b_1) \, W_2 + b_2$$
where \(W_1 \in \mathbb{R}^{d \times d_{ff}}\), \(W_2 \in \mathbb{R}^{d_{ff} \times d}\), and \(d_{ff} = 4d\).
KV cache storage: the computed \(K_i\) and \(V_i\) tensors for all heads are stored in the KV cache for use during the decode phase. This is repeated for each of the \(L\) layers.
The above steps requires the following unit operations, which are implented as GPU kernels:
GEMM (General Matrix Multiply): QKV projections, output projection (\(W_o\)), and both FFN layers (\(W_1, W_2\)) are all GEMMs.
During prefill, these operate on full \(B \times S\) token batches, handled by Tensor Cores.
Common GEMM kernel implementations include:
cuBLAS: NVIDIA's standard BLAS library; provides highly tuned FP16/BF16/FP8 GEMM kernels with automatic tile (divide large matrix operations into small sub-matrix that can be calculated in a streaming way).
CUTLASS: a template-based CUDA library for writing custom GEMM kernels. Enables fused epilogues (e.g., GEMM + bias + activation in one kernel) and fine-grained control over tiling, pipelining, and memory access patterns.
Triton kernels: used in vLLM and SGLang for custom fused GEMMs (e.g., GEMM + quantization dequantization). Easier to write than CUTLASS but with less low-level control.
TensorRT-LLM custom kernels: pre-compiled, architecture-specific GEMM kernels with aggressive fusion (e.g., QKV projection as a single fused GEMM).
FlashAttention: the attention computation is fused into a single FlashAttention kernel. The key for FlashAttention is to divide the QKV into tiles and calculating
each tile in a streaming way and the accumulate teh result. It never materializes the full attention matrix, which saves memory (The whole process is as follows.)
Elementwise / reduction kernels: layer normalization, residual additions, activation functions (ReLU, SiLU), and RoPE positional encoding are lightweight elementwise or reduction operations.
These are often fused together (e.g., bias + residual + LayerNorm in one kernel) to minimize HBM round-trips.
Embedding lookup: a simple gather operation that indexes into the embedding table. Negligible cost relative to GEMMs.
KV cache write: copying the computed \(K_i, V_i\) tensors into the paged KV cache buffer.
FlashAttention
Standard attention computes \(O = \text{softmax}(QK^\top / \sqrt{d_h}) \cdot V\) by materializing the full \(S \times S\) attention matrix in HBM.
FlashAttention (Dao et al., 2022) avoids this by tiling the computation in SRAM and computing softmax in a streaming fashion using the online softmax trick.
The algorithm proceeds as follows (for a single head, dropping the head subscript \(i\) for clarity):
Partition into blocks: \(Q, K, V\) all have shape \(S \times d_h\). Partition them along the sequence dimension:
\(Q\) into \(T_r = \lceil S / B_r \rceil\) blocks each of size \(B_r \times d_h\), and \(K, V\) into \(T_c = \lceil S / B_c \rceil\) blocks each of size \(B_c \times d_h\).
Here \(B_r, B_c\) are chosen so that one Q block, one K block, one V block, and the partial output (\(B_r \times d_h\)) all fit in SRAM simultaneously.
Outer loop over Q blocks: for each Q block \(Q_j\) (\(j = 1, \dots, T_r\)):
Rescale previous accumulator and sum: \(O_j \leftarrow O_j \cdot \exp(m_j - m_j^{\text{new}}) + \tilde{P}_{jk} \, V_k\), and similarly update \(\ell_j\).
Update: \(m_j \leftarrow m_j^{\text{new}}\).
Normalize: \(O_j \leftarrow O_j / \ell_j\) and write \(O_j\) back to HBM.
Cost analysis: Let \(M\) be the SRAM size (in elements). Block sizes are \(B_r \approx B_c \approx \sqrt{M / d_h}\) so that the tiles fit in SRAM.
Standard attention HBM access: \(O(S \cdot d_h + S^2)\). The \(Q, K, V\) reads cost \(O(S \cdot d_h)\), but the \(S \times S\) attention matrix must be written to HBM after \(QK^\top\), read back for softmax, written again, and read back for multiplication with \(V\) — each time \(O(S^2)\).
FlashAttention HBM access: \(O(S^2 d_h / M)\). There are \(T_r = S/B_r\) outer iterations. In each, all \(T_c = S/B_c\) blocks of \(K, V\) are loaded (each \(B_c \times d_h\)), totaling \(T_r \times T_c \times B_c \times d_h = S^2 d_h / B_r\). Since \(B_r \approx M / d_h\), this gives \(O(S^2 d_h^2 / M)\). For typical values (\(d_h = 128\), \(M \approx 100\text{K}\)), the reduction over standard attention is significant.
Extra memory: standard attention requires \(O(S^2)\) for the attention matrix. FlashAttention only needs \(O(S)\) for the per-row running statistics \(m_j, \ell_j\).
Figure: Prefill kernel execution. The HBM bar (top) stores model weights, intermediate activations, and KV cache. Dashed arrows show kernels reading from HBM;
the write arrow shows the KV write kernel storing K, V back to HBM for the decode phase.
Prefill optimizations
Chunked prefill (Batching): for very long prompts, chunked prefill splits the prompt into chunks (e.g., 512 or 1024 tokens) and interleaves prefill chunks with decode steps,
reducing time-to-first-token (TTFT) for co-scheduled requests.
Prefix caching (KV cache): when many requests share a common system prompt, the prefill KV cache for the shared prefix is computed once and reused.
CUDA graph: normally, each kernel launch requires a CPU-side API call that incurs microseconds of overhead — across hundreds of kernels per forward pass, this adds up.
CUDA graphs address this by recording a sequence of kernel launches (with their arguments and dependencies) into a graph during a warm-up run,
then replaying the entire graph in a single launch. This eliminates per-kernel CPU launch overhead and enables the GPU driver to optimize scheduling across the full sequence.
The graph is static: works better for decoding where tensor shapes are fixed.
At decode step \(t\), the input is a single token embedding \(e_t \in \mathbb{R}^{1 \times d}\). For each transformer layer:
QKV projection: compute the query, key, and value for the new token:
$$q_t = e_t \, W_{Qi}, \quad k_t = e_t \, W_{Ki}, \quad v_t = e_t \, W_{Vi}$$
Each is a matrix-vector product, which is a GEMV — far less compute than the prefill GEMM.
KV cache append: append \(k_t, v_t\) to the cached keys and values:
\(K \leftarrow [K_{\text{cache}}; k_t]\), \(V \leftarrow [V_{\text{cache}}; v_t]\).
Attention: compute attention of the single query against all cached keys:
$$o_t = \text{softmax}\!\left(\frac{q_t \, K^\top}{\sqrt{d_h}}\right) V$$
Output projection: \(o_t \leftarrow o_t \, W_o\).
Residual + LayerNorm + FFN: same operations as prefill but on a single token.
Kernels and GPU execution
The decode step uses almost the same kernel types as prefill, with the following special ones:
GEMV kernels (QKV, output projection, FFN): replace the large GEMMs of prefill. Since the batch dimension is 1,
these are memory-bandwidth-bound — the entire weight matrix is loaded from HBM for a single output vector.
Sampling kernel: applies temperature, top-k/top-p filtering, and samples from the distribution. Lightweight.
Figure: Decode kernel execution sequence. Unlike prefill, most kernels are GEMVs (memory-bandwidth-bound) since only one token is processed per step. The attention kernel must read the full KV cache, with cost growing linearly in sequence length.
Decoding strategies and optimization
Popluar decoding strategies includes: Greedy decoding, Beam search, and Top-k/p sampling.
Below are some decoding optimization strategies:
Batching: the most effective optimization. Since decode is memory-bandwidth-bound, batching multiple requests together amortizes the weight loading cost — the same weights serve \(B\) tokens instead of 1, increasing arithmetic intensity from ~1 to ~\(B\) FLOPs/byte. Continuous batching allows new requests to join mid-generation.
FlashDecoding: standard decode attention parallelizes across batch and heads but processes the KV cache sequence dimension serially. FlashDecoding (Dao et al., 2023) adds parallelism along the sequence dimension by splitting the KV cache into chunks, computing partial attention per chunk, and reducing across chunks. This significantly improves GPU utilization for long sequences.
The basis of KV cache size are covered in the transformer post.
Here we focus on some advanced topics.
PagedAttention
PagedAttention (Kwon et al., 2023) manages KV cache memory in fixed-size blocks (pages), similar to virtual memory in operating systems.
Without PagedAttention, each request pre-allocates a contiguous memory region for the maximum possible sequence length, wasting memory on unused positions.
PagedAttention solves this with:
Non-contiguous storage: KV cache is stored in fixed-size pages (e.g., 16 tokens per page). Pages are allocated on demand as the sequence grows, eliminating internal fragmentation.
Page table indirection: a per-request page table maps logical token positions to physical page locations. The attention kernel uses this table to gather the correct KV blocks.
Memory sharing: requests that share a common prefix (e.g., same system prompt) can share physical pages via copy-on-write, reducing memory usage for parallel sampling and beam search.
Advanced KV cache techniques
KV cache quantization: compressing cached K, V tensors to lower precision (e.g., FP8, INT4) after they are computed.
Since KV cache is only read during decode attention (not used in GEMMs on Tensor Cores), lower precision has less impact on quality.
This can halve or quarter the cache memory, enabling longer sequences or larger batch sizes (Hooper et al., 2024).
Token filtering: not all cached tokens are equally important for future attention.
Methods like H2O (Heavy-Hitter Oracle, Zhang et al., 2023) identify and retain only the most attended tokens,
evicting low-importance tokens to keep cache size bounded.
Batching multiple decoding requests can amortizes the weight loading cost, increasing the decoding throughput.
Batching can balance the memory and computational costs.
Static batching
The naive approach: group requests into a fixed-size batch, pad all sequences to the same length, and wait for all sequences to finish.
Continuous batching
Continuous batching (Orca, Yu et al., OSDI 2022) allows the scheduler to add new requests and remove completed requests at every decode step, rather than waiting for the entire batch to finish.
The procedure:
At each decode iteration, the scheduler checks for completed requests (generated EOS token or hit max length) and removes them from the batch.
If there is free GPU memory (for KV cache) and the batch is below capacity, new requests from the waiting queue are admitted.
New requests go through prefill (possibly chunked) while existing requests continue decoding — both are processed in the same GPU batch.
The combined batch is processed in a single forward pass, with the scheduler tracking each request's state independently.
This eliminates the "convoy effect" where short requests are blocked by long ones, significantly improving throughput and latency under variable-length workloads.
Continuous batching is now standard in all major inference engines (vLLM, SGLang, TGI, TensorRT-LLM).
Advanced batching topics
Scheduling policies: the scheduler must decide which requests to admit and in what order. Common policies include FCFS (first-come-first-served),
shortest-job-first (prioritize short prompts), and preemption (pause a long-running request to admit an urgent one, swapping its KV cache to CPU).
Batch size vs latency trade-off: larger batches improve throughput but increase per-request latency (more tokens compete for the same GPU).
The scheduler must balance throughput (requests/second) against latency SLOs (time-to-first-token, inter-token latency).
Quantizes weights to low precision while keeping activations in FP16/BF16. Since decode is memory-bandwidth-bound,
reducing weight size by 2-4x directly improves throughput by reducing HBM reads.
The dequantization (INT4 → FP16) is fused into the GEMV kernel so that computation still runs at higher precision.
GPTQ (Frantar et al., 2022): post-training quantization using approximate second-order information (Hessian) to minimize quantization error layer by layer. Supports INT4/INT3.
AWQ (Lin et al., 2023): identifies salient weight channels (those multiplied by large activation values) and protects them with per-channel scaling before quantization. Simpler than GPTQ with comparable quality.
Weight-activation quantization
Quantizes both weights and activations, enabling the use of integer/FP8 Tensor Cores for the actual matrix multiply (not just reduced memory).
This improves both bandwidth and compute throughput but requires careful handling of activation outliers.
SmoothQuant (Xiao et al., 2022): migrates the quantization difficulty from activations to weights via a mathematically equivalent per-channel scaling transformation. Enables INT8 W8A8 quantization.
FP8: FP8 (E4M3 for forward, E5M2 for gradients) is natively supported on H100/B200 Tensor Cores.
The Transformer Engine applies per-tensor or per-block dynamic scaling to keep values within FP8 range.
FP8 is increasingly the default quantization for inference on Hopper+ GPUs, offering ~2x throughput over FP16 with minimal quality loss.
Advanced quantization topics
FP4 and sub-4-bit: Blackwell GPUs (B200) support native FP4 with microscaling (MXFP4), achieving ~4x compression over FP16. QuIP# and AQLM push to 2-bit with vector quantization and lattice codebooks.
Mixed precision: not all layers are equally sensitive to quantization. Techniques like SpQR and OWQ use higher precision for sensitive layers/channels and lower precision elsewhere, optimizing the quality-compression Pareto frontier.
Quantization-aware training (QAT): trains the model with simulated quantization noise, producing models that are more robust to low precision. More expensive than post-training quantization but achieves better quality at very low bit-widths.
Speculative decoding (Leviathan et al., 2023; Chen et al., 2023)
reduces the number of expensive target model forward passes by using a faster draft model to propose candidate tokens,
then verifying them in parallel. It maintains the exact output distribution of the target model.
Procedure
Draft: the draft model (smaller and faster) autoregressively generates \(\gamma\) candidate tokens \(\tilde{y}_1, \dots, \tilde{y}_\gamma\). This is cheap since the draft model is small.
Verify: the target model runs a single forward pass on all \(\gamma\) candidate tokens in parallel (like a prefill), producing the target model's probability distributions \(p_1, \dots, p_\gamma\) at each position.
Accept/reject: for each position \(i = 1, \dots, \gamma\) sequentially:
If \(\tilde{y}_i\) was sampled from draft distribution \(q_i\) and target distribution \(p_i\), accept with probability \(\min(1, \, p_i(\tilde{y}_i) / q_i(\tilde{y}_i))\).
If accepted, keep \(\tilde{y}_i\) and move to position \(i+1\).
If rejected, resample from an adjusted distribution \(\text{norm}(\max(0, \, p_i - q_i))\) and stop (discard remaining candidates).
Bonus token: if all \(\gamma\) candidates are accepted, sample one additional token from the target model's distribution at position \(\gamma + 1\) (since we already have its logits).
In the best case, one target model forward pass produces \(\gamma + 1\) tokens instead of 1. The expected number of accepted tokens per step is \(\gamma \cdot \alpha\),
where \(\alpha\) is the acceptance rate (depends on how well the draft model approximates the target).
The speedup is approximately \(\frac{\gamma \alpha + 1}{c + 1}\), where \(c\) is the relative cost of the draft model pass.
In standard decode: each step loads the full target model weights from HBM just to produce one token, leaving the GPU's compute units largely idle.
Speculative decoding exploits this imbalance. The verification step processes all \(\gamma\) draft tokens in a single forward pass, since the extra arithmetic fits into the otherwise-idle Tensor Cores.
Meanwhile, the draft model's passes are cheap because it is orders of magnitude smaller (e.g., 100M vs 70B parameters), so its weight-loading and compute overhead are negligible relative to a single target model pass.
Advanced speculative decoding topics
Tree-based verification: instead of a single draft sequence, generate a tree of candidates (e.g., Medusa tree, SpecInfer).
The target model verifies the entire tree in one forward pass using tree attention masks, accepting the longest valid path.
This increases the expected tokens per verification step.
Adaptive speculation length: the optimal \(\gamma\) depends on the acceptance rate, which varies across the sequence (easy vs hard tokens).
Adaptive methods increase \(\gamma\) when the draft model is performing well and decrease it when rejections are frequent.
Speculative decoding in serving systems: in a batched serving setting, speculative decoding interacts with the scheduler —
rejected tokens waste batch slots and KV cache memory. Systems like SpecInfer and Sequoia co-optimize the speculation tree structure with the serving scheduler.
Lossless vs lossy: standard speculative decoding is lossless (exact same distribution as the target model).
Lossy variants relax this guarantee for higher acceptance rates, e.g., by using a higher temperature for the rejection threshold or truncating the adjusted distribution.
As shown in the figure below, an inference engine has a two-level architecture split across CPU and GPU:
on the CPU side, an API frontend receives requests and a scheduler decides when and how to batch them;
on the GPU side, the model executor runs the actual computation on streaming multiprocessors (SMs/Tensor Cores),
while HBM stores model weights, KV cache pages, and activations.
The KV cache physically resides in GPU HBM, but its logical management (page table, allocation, eviction decisions) is handled by the CPU-side scheduler.
Figure: LLM inference engine architecture. The CPU handles request scheduling; the GPU contains both the compute units (model executor) and HBM where model weights, KV cache pages, and activations reside.
API frontend exposes an HTTP/gRPC endpoint (often OpenAI-compatible) that accepts generation requests with parameters such as max tokens, temperature, and stop sequences.
It handles tokenization, validates inputs, and manages streaming responses (Server-Sent Events for token-by-token delivery).
The frontend also enforces rate limits and routes requests into the scheduler's waiting queue.
Scheduler decides which requests to batch together and when to admit new requests.
This is where the batching strategies discussed earlier are implemented, such as continuous batching and chunked prefill.
It also does preemption, which swaps out low-priority requests' KV cache to CPU memory.
A key scheduling decision is how to organize prefill and decoding.
Co-located (interleaved) prefill and decode.
In this approach, the scheduler mixes prefill and decode requests in the same batch.
For example, when existing decode requests are waiting for KV cache loads, the scheduler can admit a new prefill request to fill idle compute units.
Disaggregated (separated) prefill and decode.
Here, prefill and decode run on different GPUs or even different machines.
A prefill-dedicated node processes the prompt and streams the resulting KV cache to a decode-dedicated node, which then generates tokens autoregressively.
Separation eliminates interference: prefill nodes can run large batches for maximum compute throughput, while decode nodes maintain stable, low-latency token generation without being disrupted by bursty prefill work.
In practice, the choice depends on scale and latency requirements.
Co-located scheduling is simpler, works well on a single node, and can optimize GPU utilization.
Disaggregation is more useful for larger deployments and has lower latency.
Model executor runs the actual transformer forward pass on the GPU. It manages kernel dispatch, attention backends, and parallelism strategies.
It also handles quantization and speculative decoding.
Kernels: GMME, FlashAttention (fused, tiling-based) or FlashInfer (Inference kernel built on FlashAttention, PagedAttention/Quantization/speculative decoding-aware).
Parallelism: for models too large for a single GPU, the executor implements tensor parallelism (splitting weight matrices across GPUs within a node) and pipeline parallelism (splitting layers across nodes).
CUDA graphs: to eliminate CPU-side kernel launch overhead during decode (where each step is very fast), the executor captures a sequence of GPU operations into a CUDA graph and replays it, reducing per-step latency.
GEMM Kernel
GEMM is also tiled on GPU to avoid full matrix saving in HBM.
// Simplified tiled GEMM kernel (CUDA pseudocode)
// C[M,N] = A[M,K] * B[K,N], tile sizes: BM, BN, BK
__global__ void gemm_tiled(float *A, float *B, float *C, int M, int N, int K) {
__shared__ float As[BM][BK], Bs[BK][BN];
int row = blockIdx.y * BM + threadIdx.y;
int col = blockIdx.x * BN + threadIdx.x;
float acc = 0.0f;
for (int t = 0; t < K; t += BK) {
// Cooperative load: each thread loads one element of A-tile and B-tile
As[threadIdx.y][threadIdx.x] = A[row * K + (t + threadIdx.x)];
Bs[threadIdx.y][threadIdx.x] = B[(t + threadIdx.y) * N + col];
__syncthreads();
// Compute partial dot product from shared memory
for (int k = 0; k < BK; k++)
acc += As[threadIdx.y][k] * Bs[k][threadIdx.x];
__syncthreads();
}
C[row * N + col] = acc;
}
// Grid: dim3((N+BN-1)/BN, (M+BM-1)/BM), Block: dim3(BN, BM)
// Real implementations use register tiling (TM x TN per thread),
// vectorized loads, and Tensor Core MMA instructions.
FlashAttention Kernel
FlashAttention kernel is an example of kernel fusion: the matmul, softmax, and second matmul are fused into a single kernel that keeps intermediate results in SRAM.
// Simplified FlashAttention forward kernel (CUDA pseudocode)
// Q, K, V: [S, d] in HBM; O: [S, d] output
// Br, Bc: tile sizes chosen so tiles fit in SRAM
__global__ void flash_attn_fwd(float *Q, float *K, float *V, float *O,
int S, int d, float scale) {
// Each thread block handles one Q-block (Br rows of output)
int q_block = blockIdx.x; // which Q tile
__shared__ float Qs[Br][d], Ks[Bc][d], Vs[Bc][d];
__shared__ float scores[Br][Bc];
// Per-row running statistics (in registers/shared mem)
float m[Br] = {-INFINITY}; // running max
float l[Br] = {0.0f}; // running sum of exp
float acc[Br][d] = {0.0f}; // output accumulator
// Load Q block into SRAM (stays resident for all inner iterations)
load_tile(Q, Qs, q_block * Br, Br, d);
// Inner loop: iterate over all K, V blocks
for (int kv = 0; kv < S; kv += Bc) {
load_tile(K, Ks, kv, Bc, d);
load_tile(V, Vs, kv, Bc, d);
__syncthreads();
// 1. Compute attention scores: S_ij = Qs @ Ks^T * scale
for (int i = 0; i < Br; i++)
for (int j = 0; j < Bc; j++) {
scores[i][j] = 0;
for (int k = 0; k < d; k++)
scores[i][j] += Qs[i][k] * Ks[j][k];
scores[i][j] *= scale;
}
// 2. Online softmax update (per row)
for (int i = 0; i < Br; i++) {
float m_new = m[i];
for (int j = 0; j < Bc; j++)
m_new = max(m_new, scores[i][j]);
float correction = exp(m[i] - m_new);
float l_new = l[i] * correction;
// Rescale previous accumulator
for (int k = 0; k < d; k++)
acc[i][k] *= correction;
// Accumulate current tile
for (int j = 0; j < Bc; j++) {
float p = exp(scores[i][j] - m_new);
l_new += p;
for (int k = 0; k < d; k++)
acc[i][k] += p * Vs[j][k];
}
m[i] = m_new;
l[i] = l_new;
}
__syncthreads();
}
// 3. Normalize and write back to HBM
for (int i = 0; i < Br; i++)
for (int k = 0; k < d; k++)
O[(q_block * Br + i) * d + k] = acc[i][k] / l[i];
}
// Key: the S×S attention matrix never exists in HBM.
// HBM I/O: O(S^2 * d / M) vs O(S^2) for standard attention.
KV cache memory manager: This component manages the GPU memory devoted to KV caches, implementing the techniques from the KV cache management section,
such ad PagedAttention, Prefix caching, KV quantization and eviction/preemption.
GPU HBM: the physical GPU memory, which stores three main data structures: model weights (static, loaded once), KV cache pages (dynamic, grow/shrink per request), and activations.
Important inference metrics include: throughput, time to first token (TTFT), inter-token latency (ITL), end-to-end latency, and time per output token (TPOT).
A recent refinement is to separate the attention and feedforward (FFN) layers within each phase, assigning them to different GPU groups.
This idea is motivated by the observation that attention and FFN have fundamentally different resource profiles:
Attention layers are memory-capacity-bound: they must store and access the KV cache, whose size grows linearly with sequence length and batch size. During decode, the attention kernel performs a GEMV against the full KV cache — it needs large HBM capacity but relatively little compute.
FFN layers are compute-bound: they consist of large GEMMs (or GEMVs during decode) against weight matrices that are fixed in size regardless of sequence length. They need high FLOP throughput but have a predictable, static memory footprint (just the weights).
When attention and FFN are co-located on the same GPU, they compete for HBM: the KV cache consumes memory that could otherwise allow larger FFN batches, and vice versa.
Separating them allows each GPU pool to be independently optimized:
Attention GPUs are provisioned with large HBM capacity (or use KV cache offloading to CPU/SSD) to hold long contexts. They can run at smaller batch sizes since their bottleneck is memory capacity and bandwidth, not compute.
FFN GPUs are provisioned for maximum compute throughput. Since they do not store KV caches, their HBM is entirely available for model weights and activations, allowing larger batch sizes and better arithmetic intensity.
In this architecture, a forward pass through one transformer layer requires a network hop: the attention GPU computes the attention output, sends the resulting hidden states to an FFN GPU, which computes the FFN output and sends the result back (or to the attention GPU for the next layer).
This introduces inter-GPU communication overhead, so the approach is most beneficial when:
The KV cache is large (long contexts, large batches), making memory pressure the primary bottleneck.
The model uses architectures like Mixture-of-Experts (MoE), where FFN layers have disproportionately large parameter counts and benefit most from dedicated compute.
High-bandwidth interconnects (NVLink, InfiniBand) are available to minimize the communication penalty.
Here, we discuss how kernel computations are actually implemented and executed on the GPU, using GEMM as a running example.
The implementation spans multiple layers, from high-level language (e.g., Python) down to GPU machine code.
Layer 1: GPU kernels in CUDA C++
The actual code that runs on GPU streaming multiprocessors (SMs) is written in CUDA C++ (for NVIDIA GPUs) — a C++ dialect with GPU-specific extensions such as __global__ (marks a function as a kernel launchable from the CPU),
threadIdx / blockIdx (built-in variables identifying each thread), and __shared__ (declares on-chip shared memory).
The code is compiled by nvcc (NVIDIA's compiler) into PTX (a virtual ISA / intermediate representation) and then into SASS (the actual machine code for the target GPU architecture, e.g., SM_90 for Hopper).
The compilation chain is analogous to C++ → LLVM IR → x86 assembly.
cuBLAS: NVIDIA's closed-source BLAS library. Kernels are hand-tuned CUDA C++ and sometimes hand-written SASS assembly.
The host code calls cublasGemmEx(), which selects the best kernel for the given matrix shapes, data types, and GPU architecture.
CUTLASS: NVIDIA's open-source C++ template library.
GEMM is expressed as composable templates parameterized by tile sizes, warp-level operations, pipeline stages, and Tensor Core mma instructions.
A production GEMM kernel in CUTLASS is typically 500–2000 lines of CUDA C++, handling shared-memory tiling, double buffering, and software pipelining.
FlashInfer / FlashAttention: specialized attention kernels (which fuse GEMM-like operations with softmax) written in CUDA C++ with explicit shared-memory management, as shown in the tiled GEMM and FlashAttention pseudocode above.
Layer 3: Kernel DSLs — Triton and Pallas
Writing optimized CUDA C++ kernels is labor-intensive. Kernel DSLs provide a higher-level alternative:
Triton (used by PyTorch's torch.compile): kernels are written in a Python-like syntax with @triton.jit.
Despite looking like Python, this code is not interpreted — the decorator compiles it through Triton IR → LLVM IR → PTX → SASS at JIT time.
Triton programs operate on blocks of data (e.g., tl.load, tl.dot, tl.store), and the compiler handles tiling, shared-memory allocation, and Tensor Core utilization automatically.
Pallas (JAX): a kernel language for both GPUs and TPUs, compiled via XLA or Mosaic to device-specific code.
Layer 4: Framework dispatch (Python → C++ → GPU)
When a user calls torch.matmul(A, B) in Python, the execution path is:
Python → C++: the call crosses into PyTorch's C++ core via pybind11, reaching at::matmul.
Backend dispatch: PyTorch's dispatcher selects the CUDA backend, invoking at::native::mm_cuda.
Library call: the C++ implementation calls cuBLAS (cublasGemmEx) or a Triton-compiled kernel.
Kernel launch: cuBLAS enqueues a pre-compiled SASS kernel onto the GPU's command queue via the CUDA runtime.
The CPU returns immediately — kernel launches are asynchronous.
GPU execution: the GPU hardware schedules thread blocks across SMs. Thousands of threads execute in parallel, reading from and writing to HBM.
Python never touches the GPU directly. It is an orchestrator that decides what to compute; the actual GPU work is always native SASS machine code.
CPU–GPU interaction model
The host (CPU) and device (GPU) communicate through the CUDA runtime and driver:
Memory allocation: cudaMalloc allocates regions in GPU HBM.
Data transfer: cudaMemcpy triggers DMA transfers over PCIe or NVLink between CPU RAM and GPU HBM. Inference engines minimize these transfers by keeping tensors on-GPU.
Kernel launch: kernel<<<grid, block>>>(args) places a kernel on a CUDA stream (a command queue). The CPU does not wait — it can enqueue further work or handle scheduling.
Synchronization: cudaDeviceSynchronize or CUDA events let the CPU wait for GPU completion when needed (e.g., before reading results back).
In an inference engine, the CPU-side scheduler (described above) continuously enqueues kernels onto CUDA streams while the GPU executes them, achieving overlap between scheduling decisions and GPU computation.
TPU note: TPUs do not use CUDA. Instead, XLA compiles computation graphs (from JAX or TensorFlow) through HLO IR into TPU-specific machine code.
Users do not write TPU kernels directly — XLA handles this, or they use Pallas for custom kernels.
vLLM is one of the most widely adopted inference engines.
Its core contribution is PagedAttention, which enables efficient KV cache management via the virtual memory abstraction described above.
The scheduler implements continuous batching with chunked prefill, and the executor supports FlashAttention-2, FlashInfer, and xFormers backends.
vLLM provides an OpenAI-compatible API server and supports a wide range of quantization formats (GPTQ, AWQ, FP8, INT8).
It also supports tensor, pipeline, and expert parallelism for multi-GPU serving, making it a strong default choice for production deployments.
SGLang focuses on efficient execution of complex LLM programs involving multiple generation calls, branching, and control flow.
Its key innovation is RadixAttention, which uses a radix tree to automatically reuse KV cache across requests sharing common prefixes — this goes beyond vLLM's prefix caching by supporting arbitrary prefix patterns, not just the system prompt.
SGLang also features a frontend DSL for expressing structured generation patterns (e.g., few-shot chains, constrained decoding) and a high-performance runtime built on FlashInfer as its primary attention backend.
TensorRT-LLM is NVIDIA's inference engine, built on top of the TensorRT compiler.
Unlike vLLM and SGLang which dispatch generic PyTorch kernels, TensorRT-LLM compiles the entire model graph into fused CUDA kernels — merging operations like LayerNorm + QKV projection into a single kernel to minimize memory round-trips.
It supports FP8, INT8, and INT4 quantization natively, and achieves the highest raw throughput on NVIDIA GPUs.
The trade-off is a mandatory compilation step and a narrower set of supported models.
llama.cpp takes a fundamentally different approach: it is designed for running LLMs on commodity hardware (CPUs, laptops, phones) rather than GPU clusters.
It uses the GGUF quantization format with many granularities (Q4_0, Q5_1, Q8_0, k-quants) to fit large models into limited memory.
The codebase is pure C/C++ with optional acceleration via Metal (Apple), CUDA, Vulkan, and SYCL, making it the most portable engine.
While it does not match GPU-based engines in throughput, it democratizes LLM access for local and edge deployment.
Table: Feature comparison of major LLM inference engines.
Multi-head Latent Attention (MLA), introduced in DeepSeek-V2, compresses the KV cache by projecting keys and values into a low-rank latent space.
Instead of caching per-head K and V tensors of dimension \(d_h\), MLA caches a single compressed latent vector \(c_t \in \mathbb{R}^{d_c}\) where \(d_c \ll n_h \cdot d_h\).
During attention, the full keys and values are reconstructed on-the-fly via learned up-projection matrices: \(K = c_t W^{UK}\), \(V = c_t W^{UV}\).
This dramatically reduces KV cache memory (e.g., from 2 × \(n_h d_h\) to just \(d_c\) per token per layer), enabling much larger batch sizes.
Figure: MLA inference. Left: standard MHA caches full K, V per head. Right: MLA caches only the compressed latent c_t;
absorbed projections (W_Q·W_UK⊤) let attention run directly in latent space without decompressing K/V.
Decoupled RoPE components (q_R, k_R) are concatenated for positional awareness.
(Based on DeepSeek-V2, Figure 3.)
Kernel-level challenges.
Absorbed projection: the naive approach — decompress K and V, then run standard attention — adds a large GEMM before every attention call and negates the memory savings (the decompressed KV is just as large).
The efficient approach absorbs the up-projection into the attention computation: instead of computing \(QK^T = Q(c W^{UK})^T\), we compute \((QW^{UK\top})c^T\),
i.e., project Q down into the latent space first, then compute attention over the compressed cache.
This requires a fused kernel that (1) applies the absorbed Q projection, (2) computes attention scores against the compressed latent cache, and (3) applies the absorbed V up-projection to the output — all in a single pass to avoid materializing intermediate tensors.
Decoupled RoPE: MLA separates positional encoding from the latent compression. The query and key each have a RoPE component (\(q^R_t, k^R_t\)) computed from a separate projection head, which is concatenated with the latent-space components for the attention score.
The attention score becomes \(\text{softmax}\bigl(\frac{[q^C_t; q^R_t] \cdot [k^C_i; k^R_i]^T}{\sqrt{d}}\bigr)\),
where \(q^C, k^C\) come from the absorbed latent path and \(q^R, k^R\) carry RoPE.
The kernel must handle this heterogeneous concatenation, applying RoPE only to the positional subspace while leaving the latent subspace position-independent.
Paging over latent tokens: the memory manager's page table now maps to compressed latent vectors rather than full KV pairs.
Each page stores a fixed number of latent vectors (\(d_c\) floats per token per layer, rather than \(2 \cdot n_h \cdot d_h\)).
The attention kernel's page-table lookup must be aware of this different layout.
Engine-level changes.
The model executor must register the absorbed projection matrices and dispatch the fused MLA kernel in place of standard MHA/GQA kernels.
FlashInfer (used by SGLang) added native MLA support; vLLM similarly introduced MLA-specific attention backends.
The KV cache memory manager allocates pages sized for latent vectors. The memory savings change the batch-size/memory trade-off:
with 8–16× smaller per-token cache, the scheduler can admit significantly more concurrent requests, shifting the bottleneck from memory to compute.
The scheduler may need to account for the additional compute cost of the absorbed projections — MLA trades memory for compute, so the decode step is slightly more compute-heavy than standard GQA, affecting the prefill/decode balance.
Some recent architectures introduce residual connections within the attention mechanism itself, beyond the standard post-attention residual.
For example, Diff Transformer computes attention as the difference of two softmax attention maps:
\(\text{DiffAttn}(X) = \bigl(\text{softmax}(Q_1 K_1^T) - \lambda \cdot \text{softmax}(Q_2 K_2^T)\bigr) V\),
where \(\lambda\) is a learnable scalar. This cancels out attention noise and sharpens the model's focus on relevant context.
Other variants include gated attention residuals (e.g., \(\text{Attn}(X) + \alpha \cdot X\) with a learnable gate \(\alpha\)) that blend the attention output with the input before the FFN.
Figure: Differential attention. The input is projected into two independent Q/K pairs sharing the same V.
Two softmax attention maps are computed and subtracted (scaled by learnable λ) to cancel attention noise, producing sharper attention patterns.
The heatmaps illustrate how diffuse background attention (A₂) is removed from the noisy map (A₁) to yield a clean signal.
(Based on Differential Transformer, Figure 1.)
Kernel-level challenges.
Dual softmax in Diff Transformer: the kernel must compute two independent softmax-attention maps (\(Q_1K_1^T\) and \(Q_2K_2^T\)) and subtract them before multiplying by V.
A naive implementation doubles the attention compute. An efficient fused kernel can share the K/V memory loads — since V is shared and K is loaded once per tile, the two softmax streams can be computed in the same tiling pass,
storing two running softmax accumulators per tile instead of one. This is a non-trivial modification to the FlashAttention tiling algorithm.
Double KV heads: Diff Transformer uses split heads — each logical attention head is split into two sub-heads with separate Q and K projections but shared V.
The KV cache must store keys for both sub-heads, and the kernel's head-indexing logic must map logical heads to the correct sub-head KV slots.
Learnable \(\lambda\): the subtraction weight \(\lambda\) is per-layer and per-head. The kernel needs access to this small parameter during the attention computation — typically passed as a constant buffer.
Engine-level changes.
The model executor must dispatch the modified attention kernel. The model config parser needs to recognize the architecture variant and select the appropriate kernel backend.
The KV cache manager must account for the different head structure — Diff Transformer's effective number of KV heads differs from standard GQA, affecting page sizing and memory allocation.
MoE models (e.g., Mixtral, DeepSeek-V3, Qwen-MoE) replace the dense FFN in each transformer layer with a set of expert FFN sub-networks, activating only a subset (top-\(k\)) per token.
A gating network (router) computes token-expert affinities: \(G(x) = \text{TopK}(\text{softmax}(x W_g), k)\), selecting \(k\) experts out of \(E\) total.
This allows the model to scale parameters without proportionally scaling per-token compute — e.g., a model with 8 experts and top-2 routing uses only 2/8 of the FFN parameters per token.
Kernel-level challenges.
Token routing and permutation: after the router selects experts, tokens must be reordered so that all tokens assigned to the same expert are grouped together for a batched GEMM.
This permute-compute-unpermute pattern is unique to MoE and not present in dense models.
Efficient implementations fuse the permutation with the expert GEMMs using grouped/batched GEMM kernels (e.g., CUTLASS grouped GEMM), where each "group" is one expert processing its assigned tokens.
Load imbalance: expert popularity is data-dependent — some experts may receive many tokens while others receive few or none.
This creates GEMM shapes that vary wildly across experts within the same layer, causing GPU underutilization.
Padding all experts to the max assignment wastes compute; dynamically shaped GEMMs add launch overhead.
Optimized MoE kernels use expert-parallel grouped GEMM with dynamic shapes, or pre-pad to common bucket sizes.
Fused gate + top-k: the gating computation (linear projection → softmax → top-k selection → index computation) is a small but latency-critical step that should be fused into a single kernel to avoid multiple memory round-trips for the routing indices.
Engine-level changes.
Expert parallelism (EP): for large MoE models, different experts are placed on different GPUs.
This requires an all-to-all communication primitive: after routing, each GPU sends tokens to the GPU hosting the selected expert, and receives results back.
This is fundamentally different from tensor parallelism (which splits every layer's weights) — EP only communicates routed tokens, and the communication pattern is data-dependent (varies per batch).
The scheduler and executor must orchestrate this all-to-all within each layer's forward pass.
Memory management: MoE models have large total parameter counts but sparse activation. The weight memory layout must enable fast expert selection — typically, all expert weights for a layer are stored contiguously with an expert-index lookup.
For EP, each GPU only stores its assigned expert weights, but must handle dynamic token assignments.
Scheduler awareness: MoE decode steps are typically faster than dense models of equivalent quality (fewer active parameters), but prefill may be slower due to the routing overhead. Some schedulers adjust the prefill/decode split accordingly.
The scheduler may also consider expert load balance across a batch — batches with highly skewed routing slow down due to load imbalance.
Quantization: MoE models' large parameter count makes quantization even more critical for fitting in GPU memory.
However, per-expert quantization calibration is important — different experts see different data distributions, so a global quantization scale may degrade quality.
Engines like vLLM and SGLang support per-expert FP8/INT8 scales.
In practice, adding support for a new architecture in an engine like vLLM or SGLang typically involves:
Model definition: implement the model's forward pass in the engine's model registry, mapping new layers (MLA attention, MoE FFN, diff attention) to the appropriate kernel calls.
Custom kernels: write or adapt fused GPU kernels for the non-standard operations — this is usually the most performance-critical step. Libraries like FlashInfer and CUTLASS provide building blocks (e.g., grouped GEMM, customizable attention templates) that reduce the effort.
KV cache layout: adjust the page allocator and page-table logic if the architecture changes what is cached (latent vectors instead of raw KV, different head counts, etc.).
Config and dispatch: the engine's model loader must parse the architecture config (e.g., num_experts, kv_lora_rank, num_sub_heads) and route to the correct code paths throughout the stack.
Parallelism strategy: MoE models add expert parallelism as a new dimension. The engine must support hybrid TP+EP (or DP+EP) with the required all-to-all communication, which is architecturally different from the all-reduce used in tensor parallelism.
The rapid pace of architectural innovation means inference engines must be designed with modularization and extensibility.
The key insight is that the interface between the scheduler, executor, and memory manager should be abstract enough to accommodate
different cache formats and compute patterns, while still enabling the fused, architecture-specific kernels
that are essential for performance.