🏠 Home

LLM Inference: Foundations, System Architecture, and Popular Engines

Apr 2026


This post summarizes the foundation and current status of LLM inference, covering the core functionalities, system architecture, and popular inference engines (e.g., vLLM, SGLang).

Table of Contents

Autoregressive Decoding

Inference refers to generating responses given an input prompt, which corresponds to the decoding process. Autoregressive decoding is the default decoding method for LLMs. It iteratively generates the next token based on the input and all previously generated tokens, i.e., \(y_t \sim p_\theta(y_t \mid X, y_1, \dots, y_{t-1})\), where \(\mathbf{x}\) is the input prompt, \(y_{1:t-1}\) are the previously generated tokens, and \(\theta\) denotes the model parameters. The full response \(y = (y_1, \dots, y_T)\) is produced by repeating this process until an end-of-sequence token is generated.

The auto-regressive decoding process consists of two phases: a prefill phase and a decode phase. During prefill, the model processes the entire input prompt in parallel to build the initial key-value (KV) cache. During decode, the model generates tokens one at a time and update the KV cache. The prefill phase is compute-intensive (large matrix multiplications over the full prompt), while the decode phase is memory-bandwidth-intensive (each step only computes a single token but must load the full model weights and KV cache from memory).

Prefill

The prefill phase processes the entire input prompt in a forward pass to produce the initial KV cache and the first output token \(y_1 \sim p_\theta(X)\). Prefill processes all \(n\) prompt tokens simultaneously (vs decoding phase processes each token one by one).

Prefill computation (Forward pass)

Below we illustrate the forward pass for one layer of the vanilla transformer architecture (multi-head attention with standard FFN).

  1. Tokenization: the raw input text \(X_s\) is split into a sequence of token IDs \(X \in \mathbb{R}^{B \times S}\) using the model's tokenizer (e.g., BPE, SentencePiece), where \(B\) is the batch size and \(S\) is the sequence length. This step runs on the CPU.
  2. Embedding lookup (first layer only): token IDs are mapped to dense vectors via the embedding matrix \(W_e \in \mathbb{R}^{|V| \times d}\), producing \(E \in \mathbb{R}^{B \times S \times d}\), where \(d\) is the embedding dimension. Positional information is also added at this stage (e.g., via RoPE).
  3. QKV projection: for each attention head \(i \in \{1, \dots, h\}\), three linear projections compute queries, keys, and values: $$Q_i = E \, W_{Qi}, \quad K_i = E \, W_{Ki}, \quad V_i = E \, W_{Vi}$$ where \(W_{Qi}, W_{Ki}, W_{Vi} \in \mathbb{R}^{d \times d_h}\) and \(d_h = d / h\). Each projection is a GEMM of shape \((B \cdot S \times d) \cdot (d \times d_h)\).
  4. Attention computation: for each head \(i\), the attention scores and output are: $$A_i = Q_i K_i^\top / \sqrt{d_h}, \quad O_i = \text{softmax}(A_i + M) \cdot V_i$$ where \(M\) is the causal mask ensuring each token only attends to previous positions. The \(A_i\) matrix has shape \(S \times S\).
  5. Multi-head concatenation and output projection: the per-head outputs are concatenated and projected: $$O_a = [O_1, O_2, \dots, O_h] \, W_o$$ where \(W_o \in \mathbb{R}^{d \times d}\).
  6. Layer normalization and FFN: after residual connection and layer normalization (producing \(O_l\)), the feed-forward network applies: $$\text{FFN}(O_l) = \text{ReLU}(O_l \, W_1 + b_1) \, W_2 + b_2$$ where \(W_1 \in \mathbb{R}^{d \times d_{ff}}\), \(W_2 \in \mathbb{R}^{d_{ff} \times d}\), and \(d_{ff} = 4d\).
  7. KV cache storage: the computed \(K_i\) and \(V_i\) tensors for all heads are stored in the KV cache for use during the decode phase. This is repeated for each of the \(L\) layers.

The above steps requires the following unit operations, which are implented as GPU kernels:

FlashAttention

Standard attention computes \(O = \text{softmax}(QK^\top / \sqrt{d_h}) \cdot V\) by materializing the full \(S \times S\) attention matrix in HBM. FlashAttention (Dao et al., 2022) avoids this by tiling the computation in SRAM and computing softmax in a streaming fashion using the online softmax trick. The algorithm proceeds as follows (for a single head, dropping the head subscript \(i\) for clarity):

  1. Partition into blocks: \(Q, K, V\) all have shape \(S \times d_h\). Partition them along the sequence dimension: \(Q\) into \(T_r = \lceil S / B_r \rceil\) blocks each of size \(B_r \times d_h\), and \(K, V\) into \(T_c = \lceil S / B_c \rceil\) blocks each of size \(B_c \times d_h\). Here \(B_r, B_c\) are chosen so that one Q block, one K block, one V block, and the partial output (\(B_r \times d_h\)) all fit in SRAM simultaneously.
  2. Outer loop over Q blocks: for each Q block \(Q_j\) (\(j = 1, \dots, T_r\)):
    • Initialize running statistics: max vector \(m_j = -\infty\) (\(B_r \times 1\)), sum vector \(\ell_j = 0\) (\(B_r \times 1\)), output accumulator \(O_j = 0\) (\(B_r \times d_h\)).
  3. Inner loop over K, V blocks: for each K, V block pair \((K_k, V_k)\) (\(k = 1, \dots, T_c\)):
    1. Load \(Q_j, K_k, V_k\) from HBM to SRAM.
    2. Compute tile attention scores: \(A_{jk} = Q_j K_k^\top / \sqrt{d_h}\) (\(B_r \times B_c\)), entirely in SRAM.
    3. Compute tile-local max: \(\tilde{m}_{jk} = \text{softmax}(A_{jk})\) \((Br \times 1)\).
    4. Update running max: \(m_j^{\text{new}} = \max(m_j, \tilde{m}_{jk})\).
    5. Compute rescaled exponentials: \(\tilde{P}_{jk} = \exp(A_{jk} - m_j^{\text{new}})\).
    6. Rescale previous accumulator and sum: \(O_j \leftarrow O_j \cdot \exp(m_j - m_j^{\text{new}}) + \tilde{P}_{jk} \, V_k\), and similarly update \(\ell_j\).
    7. Update: \(m_j \leftarrow m_j^{\text{new}}\).
  4. Normalize: \(O_j \leftarrow O_j / \ell_j\) and write \(O_j\) back to HBM.

Cost analysis: Let \(M\) be the SRAM size (in elements). Block sizes are \(B_r \approx B_c \approx \sqrt{M / d_h}\) so that the tiles fit in SRAM.

Prefill: Kernel Execution Sequence (per layer) CPU Tokenize Transfer to GPU PCIe HBM Model weights (W_Q, W_K, W_V, W_o, W₁, W₂) Activations (E, Q, K, V, O) KV cache (K_i, V_i per layer) read weights & activations write K,V GPU Kernels QKV GEMM compute-bound FlashAttn SRAM-tiled Out proj compute-bound Res + LN fused elemwise FFN GEMMs compute-bound KV write → HBM cache Repeat for L layers HBM (weights, activations, KV cache) GEMM (compute-bound) FlashAttention (SRAM-tiled) Fused elementwise (memory-bound)
Figure: Prefill kernel execution. The HBM bar (top) stores model weights, intermediate activations, and KV cache. Dashed arrows show kernels reading from HBM; the write arrow shows the KV write kernel storing K, V back to HBM for the decode phase.

Prefill optimizations

Decoding

Decode computation

At decode step \(t\), the input is a single token embedding \(e_t \in \mathbb{R}^{1 \times d}\). For each transformer layer:

  1. QKV projection: compute the query, key, and value for the new token: $$q_t = e_t \, W_{Qi}, \quad k_t = e_t \, W_{Ki}, \quad v_t = e_t \, W_{Vi}$$ Each is a matrix-vector product, which is a GEMV — far less compute than the prefill GEMM.
  2. KV cache append: append \(k_t, v_t\) to the cached keys and values: \(K \leftarrow [K_{\text{cache}}; k_t]\), \(V \leftarrow [V_{\text{cache}}; v_t]\).
  3. Attention: compute attention of the single query against all cached keys: $$o_t = \text{softmax}\!\left(\frac{q_t \, K^\top}{\sqrt{d_h}}\right) V$$
  4. Output projection: \(o_t \leftarrow o_t \, W_o\).
  5. Residual + LayerNorm + FFN: same operations as prefill but on a single token.

Kernels and GPU execution

The decode step uses almost the same kernel types as prefill, with the following special ones:

Decode: Kernel Execution Sequence (per step, per layer) HBM Model weights (loaded for 1 token → low reuse) KV cache (grows by k_t, v_t each step) Activations (single token) read weights & KV cache append GPU QKV GEMV mem-bound KV append write to cache Attention q_t × full KV cache Out proj mem-bound Res + LN fused elemwise FFN GEMVs mem-bound Sample → y_t Repeat for L layers (sample only at last layer) GEMV (memory-bandwidth-bound) Attention (reads full KV cache) HBM (weights, KV cache, activations)
Figure: Decode kernel execution sequence. Unlike prefill, most kernels are GEMVs (memory-bandwidth-bound) since only one token is processed per step. The attention kernel must read the full KV cache, with cost growing linearly in sequence length.

Decoding strategies and optimization

Popluar decoding strategies includes: Greedy decoding, Beam search, and Top-k/p sampling. Below are some decoding optimization strategies:


Inference Techniques

In this section, we discuss a few key inference techniques: KV cache management, batching strategies, quantization, and speculative decoding.

KV Cache and Memory Management

The basis of KV cache size are covered in the transformer post. Here we focus on some advanced topics.

PagedAttention

PagedAttention (Kwon et al., 2023) manages KV cache memory in fixed-size blocks (pages), similar to virtual memory in operating systems. Without PagedAttention, each request pre-allocates a contiguous memory region for the maximum possible sequence length, wasting memory on unused positions. PagedAttention solves this with:

Advanced KV cache techniques

Batching Strategies

Batching multiple decoding requests can amortizes the weight loading cost, increasing the decoding throughput. Batching can balance the memory and computational costs.

Static batching

The naive approach: group requests into a fixed-size batch, pad all sequences to the same length, and wait for all sequences to finish.

Continuous batching

Continuous batching (Orca, Yu et al., OSDI 2022) allows the scheduler to add new requests and remove completed requests at every decode step, rather than waiting for the entire batch to finish. The procedure:

  1. At each decode iteration, the scheduler checks for completed requests (generated EOS token or hit max length) and removes them from the batch.
  2. If there is free GPU memory (for KV cache) and the batch is below capacity, new requests from the waiting queue are admitted.
  3. New requests go through prefill (possibly chunked) while existing requests continue decoding — both are processed in the same GPU batch.
  4. The combined batch is processed in a single forward pass, with the scheduler tracking each request's state independently.

This eliminates the "convoy effect" where short requests are blocked by long ones, significantly improving throughput and latency under variable-length workloads. Continuous batching is now standard in all major inference engines (vLLM, SGLang, TGI, TensorRT-LLM).

Advanced batching topics

Quantization

Weight-only quantization

Quantizes weights to low precision while keeping activations in FP16/BF16. Since decode is memory-bandwidth-bound, reducing weight size by 2-4x directly improves throughput by reducing HBM reads. The dequantization (INT4 → FP16) is fused into the GEMV kernel so that computation still runs at higher precision.

Weight-activation quantization

Quantizes both weights and activations, enabling the use of integer/FP8 Tensor Cores for the actual matrix multiply (not just reduced memory). This improves both bandwidth and compute throughput but requires careful handling of activation outliers.

Advanced quantization topics

Speculative Decoding

Speculative decoding (Leviathan et al., 2023; Chen et al., 2023) reduces the number of expensive target model forward passes by using a faster draft model to propose candidate tokens, then verifying them in parallel. It maintains the exact output distribution of the target model.

Procedure

  1. Draft: the draft model (smaller and faster) autoregressively generates \(\gamma\) candidate tokens \(\tilde{y}_1, \dots, \tilde{y}_\gamma\). This is cheap since the draft model is small.
  2. Verify: the target model runs a single forward pass on all \(\gamma\) candidate tokens in parallel (like a prefill), producing the target model's probability distributions \(p_1, \dots, p_\gamma\) at each position.
  3. Accept/reject: for each position \(i = 1, \dots, \gamma\) sequentially:
  4. Bonus token: if all \(\gamma\) candidates are accepted, sample one additional token from the target model's distribution at position \(\gamma + 1\) (since we already have its logits).

In the best case, one target model forward pass produces \(\gamma + 1\) tokens instead of 1. The expected number of accepted tokens per step is \(\gamma \cdot \alpha\), where \(\alpha\) is the acceptance rate (depends on how well the draft model approximates the target). The speedup is approximately \(\frac{\gamma \alpha + 1}{c + 1}\), where \(c\) is the relative cost of the draft model pass.

In standard decode: each step loads the full target model weights from HBM just to produce one token, leaving the GPU's compute units largely idle. Speculative decoding exploits this imbalance. The verification step processes all \(\gamma\) draft tokens in a single forward pass, since the extra arithmetic fits into the otherwise-idle Tensor Cores. Meanwhile, the draft model's passes are cheap because it is orders of magnitude smaller (e.g., 100M vs 70B parameters), so its weight-loading and compute overhead are negligible relative to a single target model pass.

Advanced speculative decoding topics


Inference Engines

System Architecture Overview

As shown in the figure below, an inference engine has a two-level architecture split across CPU and GPU: on the CPU side, an API frontend receives requests and a scheduler decides when and how to batch them; on the GPU side, the model executor runs the actual computation on streaming multiprocessors (SMs/Tensor Cores), while HBM stores model weights, KV cache pages, and activations. The KV cache physically resides in GPU HBM, but its logical management (page table, allocation, eviction decisions) is handled by the CPU-side scheduler.

LLM Inference Engine — System Architecture CPU Prompts API Frontend OpenAI-compatible API · gRPC · streaming Tokenizer encode / decode Scheduler (CPU) Request queue · priority · preemption · decides what runs on GPU Continuous batching Prefill / decode split Chunked prefill KV cache bookkeeping PCIe / NVLink GPU Model Executor (Streaming Multiprocessors / Tensor Cores) Kernel dispatch · attention backend · MLP GEMMs · all compute happens here FlashAttention Quantized kernels Speculative decode Tensor / pipeline TP CUDA graphs Kernel fusion (TRT) load weights, read/write KV & activations GPU HBM (High Bandwidth Memory) Model Weights W_Q, W_K, W_V, W_o, W1, W2 (static, loaded once) KV Cache Pages (dynamic, grows/shrinks per request) PagedAttention Prefix caching / CoW KV quantization Eviction (LRU / H2O) CPU page table → GPU physical pages Activations Q, K, V, O, hidden (transient, reused) The GPU box contains the model executor (compute) and HBM (storage). KV cache lives in HBM, managed via CPU-side page tables.
Figure: LLM inference engine architecture. The CPU handles request scheduling; the GPU contains both the compute units (model executor) and HBM where model weights, KV cache pages, and activations reside.

API frontend exposes an HTTP/gRPC endpoint (often OpenAI-compatible) that accepts generation requests with parameters such as max tokens, temperature, and stop sequences. It handles tokenization, validates inputs, and manages streaming responses (Server-Sent Events for token-by-token delivery). The frontend also enforces rate limits and routes requests into the scheduler's waiting queue.

Scheduler decides which requests to batch together and when to admit new requests. This is where the batching strategies discussed earlier are implemented, such as continuous batching and chunked prefill. It also does preemption, which swaps out low-priority requests' KV cache to CPU memory. A key scheduling decision is how to organize prefill and decoding.

In practice, the choice depends on scale and latency requirements. Co-located scheduling is simpler, works well on a single node, and can optimize GPU utilization. Disaggregation is more useful for larger deployments and has lower latency.

Model executor runs the actual transformer forward pass on the GPU. It manages kernel dispatch, attention backends, and parallelism strategies. It also handles quantization and speculative decoding.

GEMM Kernel

GEMM is also tiled on GPU to avoid full matrix saving in HBM.

// Simplified tiled GEMM kernel (CUDA pseudocode)
// C[M,N] = A[M,K] * B[K,N], tile sizes: BM, BN, BK
__global__ void gemm_tiled(float *A, float *B, float *C, int M, int N, int K) {
    __shared__ float As[BM][BK], Bs[BK][BN];

    int row = blockIdx.y * BM + threadIdx.y;
    int col = blockIdx.x * BN + threadIdx.x;
    float acc = 0.0f;

    for (int t = 0; t < K; t += BK) {
        // Cooperative load: each thread loads one element of A-tile and B-tile
        As[threadIdx.y][threadIdx.x] = A[row * K + (t + threadIdx.x)];
        Bs[threadIdx.y][threadIdx.x] = B[(t + threadIdx.y) * N + col];
        __syncthreads();

        // Compute partial dot product from shared memory
        for (int k = 0; k < BK; k++)
            acc += As[threadIdx.y][k] * Bs[k][threadIdx.x];
        __syncthreads();
    }
    C[row * N + col] = acc;
}
// Grid: dim3((N+BN-1)/BN, (M+BM-1)/BM), Block: dim3(BN, BM)
// Real implementations use register tiling (TM x TN per thread),
// vectorized loads, and Tensor Core MMA instructions.
FlashAttention Kernel

FlashAttention kernel is an example of kernel fusion: the matmul, softmax, and second matmul are fused into a single kernel that keeps intermediate results in SRAM.

// Simplified FlashAttention forward kernel (CUDA pseudocode)
// Q, K, V: [S, d] in HBM; O: [S, d] output
// Br, Bc: tile sizes chosen so tiles fit in SRAM
__global__ void flash_attn_fwd(float *Q, float *K, float *V, float *O,
                               int S, int d, float scale) {
    // Each thread block handles one Q-block (Br rows of output)
    int q_block = blockIdx.x;  // which Q tile
    __shared__ float Qs[Br][d], Ks[Bc][d], Vs[Bc][d];
    __shared__ float scores[Br][Bc];

    // Per-row running statistics (in registers/shared mem)
    float m[Br] = {-INFINITY};  // running max
    float l[Br] = {0.0f};       // running sum of exp
    float acc[Br][d] = {0.0f};  // output accumulator

    // Load Q block into SRAM (stays resident for all inner iterations)
    load_tile(Q, Qs, q_block * Br, Br, d);

    // Inner loop: iterate over all K, V blocks
    for (int kv = 0; kv < S; kv += Bc) {
        load_tile(K, Ks, kv, Bc, d);
        load_tile(V, Vs, kv, Bc, d);
        __syncthreads();

        // 1. Compute attention scores: S_ij = Qs @ Ks^T * scale
        for (int i = 0; i < Br; i++)
            for (int j = 0; j < Bc; j++) {
                scores[i][j] = 0;
                for (int k = 0; k < d; k++)
                    scores[i][j] += Qs[i][k] * Ks[j][k];
                scores[i][j] *= scale;
            }

        // 2. Online softmax update (per row)
        for (int i = 0; i < Br; i++) {
            float m_new = m[i];
            for (int j = 0; j < Bc; j++)
                m_new = max(m_new, scores[i][j]);

            float correction = exp(m[i] - m_new);
            float l_new = l[i] * correction;

            // Rescale previous accumulator
            for (int k = 0; k < d; k++)
                acc[i][k] *= correction;

            // Accumulate current tile
            for (int j = 0; j < Bc; j++) {
                float p = exp(scores[i][j] - m_new);
                l_new += p;
                for (int k = 0; k < d; k++)
                    acc[i][k] += p * Vs[j][k];
            }
            m[i] = m_new;
            l[i] = l_new;
        }
        __syncthreads();
    }

    // 3. Normalize and write back to HBM
    for (int i = 0; i < Br; i++)
        for (int k = 0; k < d; k++)
            O[(q_block * Br + i) * d + k] = acc[i][k] / l[i];
}
// Key: the S×S attention matrix never exists in HBM.
// HBM I/O: O(S^2 * d / M) vs O(S^2) for standard attention.

KV cache memory manager: This component manages the GPU memory devoted to KV caches, implementing the techniques from the KV cache management section, such ad PagedAttention, Prefix caching, KV quantization and eviction/preemption.

GPU HBM: the physical GPU memory, which stores three main data structures: model weights (static, loaded once), KV cache pages (dynamic, grow/shrink per request), and activations.

Important inference metrics include: throughput, time to first token (TTFT), inter-token latency (ITL), end-to-end latency, and time per output token (TPOT).

Attention-FFN Disaggregation

A recent refinement is to separate the attention and feedforward (FFN) layers within each phase, assigning them to different GPU groups. This idea is motivated by the observation that attention and FFN have fundamentally different resource profiles:

When attention and FFN are co-located on the same GPU, they compete for HBM: the KV cache consumes memory that could otherwise allow larger FFN batches, and vice versa. Separating them allows each GPU pool to be independently optimized:

In this architecture, a forward pass through one transformer layer requires a network hop: the attention GPU computes the attention output, sends the resulting hidden states to an FFN GPU, which computes the FFN output and sends the result back (or to the attention GPU for the next layer). This introduces inter-GPU communication overhead, so the approach is most beneficial when:

GPU Kernel Implementation Stack

Here, we discuss how kernel computations are actually implemented and executed on the GPU, using GEMM as a running example. The implementation spans multiple layers, from high-level language (e.g., Python) down to GPU machine code.

Layer 1: GPU kernels in CUDA C++

The actual code that runs on GPU streaming multiprocessors (SMs) is written in CUDA C++ (for NVIDIA GPUs) — a C++ dialect with GPU-specific extensions such as __global__ (marks a function as a kernel launchable from the CPU), threadIdx / blockIdx (built-in variables identifying each thread), and __shared__ (declares on-chip shared memory). The code is compiled by nvcc (NVIDIA's compiler) into PTX (a virtual ISA / intermediate representation) and then into SASS (the actual machine code for the target GPU architecture, e.g., SM_90 for Hopper). The compilation chain is analogous to C++ → LLVM IR → x86 assembly.

Layer 2: Optimized kernel libraries (cuBLAS, CUTLASS, FlashInfer)

Layer 3: Kernel DSLs — Triton and Pallas

Writing optimized CUDA C++ kernels is labor-intensive. Kernel DSLs provide a higher-level alternative:

Layer 4: Framework dispatch (Python → C++ → GPU)

When a user calls torch.matmul(A, B) in Python, the execution path is:

  1. Python → C++: the call crosses into PyTorch's C++ core via pybind11, reaching at::matmul.
  2. Backend dispatch: PyTorch's dispatcher selects the CUDA backend, invoking at::native::mm_cuda.
  3. Library call: the C++ implementation calls cuBLAS (cublasGemmEx) or a Triton-compiled kernel.
  4. Kernel launch: cuBLAS enqueues a pre-compiled SASS kernel onto the GPU's command queue via the CUDA runtime. The CPU returns immediately — kernel launches are asynchronous.
  5. GPU execution: the GPU hardware schedules thread blocks across SMs. Thousands of threads execute in parallel, reading from and writing to HBM.

Python never touches the GPU directly. It is an orchestrator that decides what to compute; the actual GPU work is always native SASS machine code.

CPU–GPU interaction model

The host (CPU) and device (GPU) communicate through the CUDA runtime and driver:

In an inference engine, the CPU-side scheduler (described above) continuously enqueues kernels onto CUDA streams while the GPU executes them, achieving overlap between scheduling decisions and GPU computation.

TPU note: TPUs do not use CUDA. Instead, XLA compiles computation graphs (from JAX or TensorFlow) through HLO IR into TPU-specific machine code. Users do not write TPU kernels directly — XLA handles this, or they use Pallas for custom kernels.

Popular Engines Comparison

vLLM is one of the most widely adopted inference engines. Its core contribution is PagedAttention, which enables efficient KV cache management via the virtual memory abstraction described above. The scheduler implements continuous batching with chunked prefill, and the executor supports FlashAttention-2, FlashInfer, and xFormers backends. vLLM provides an OpenAI-compatible API server and supports a wide range of quantization formats (GPTQ, AWQ, FP8, INT8). It also supports tensor, pipeline, and expert parallelism for multi-GPU serving, making it a strong default choice for production deployments.

SGLang focuses on efficient execution of complex LLM programs involving multiple generation calls, branching, and control flow. Its key innovation is RadixAttention, which uses a radix tree to automatically reuse KV cache across requests sharing common prefixes — this goes beyond vLLM's prefix caching by supporting arbitrary prefix patterns, not just the system prompt. SGLang also features a frontend DSL for expressing structured generation patterns (e.g., few-shot chains, constrained decoding) and a high-performance runtime built on FlashInfer as its primary attention backend.

TensorRT-LLM is NVIDIA's inference engine, built on top of the TensorRT compiler. Unlike vLLM and SGLang which dispatch generic PyTorch kernels, TensorRT-LLM compiles the entire model graph into fused CUDA kernels — merging operations like LayerNorm + QKV projection into a single kernel to minimize memory round-trips. It supports FP8, INT8, and INT4 quantization natively, and achieves the highest raw throughput on NVIDIA GPUs. The trade-off is a mandatory compilation step and a narrower set of supported models.

llama.cpp takes a fundamentally different approach: it is designed for running LLMs on commodity hardware (CPUs, laptops, phones) rather than GPU clusters. It uses the GGUF quantization format with many granularities (Q4_0, Q5_1, Q8_0, k-quants) to fit large models into limited memory. The codebase is pure C/C++ with optional acceleration via Metal (Apple), CUDA, Vulkan, and SYCL, making it the most portable engine. While it does not match GPU-based engines in throughput, it democratizes LLM access for local and edge deployment.

Table: Feature comparison of major LLM inference engines.
Feature vLLM SGLang TensorRT-LLM llama.cpp
KV cache PagedAttention RadixAttention (radix tree prefix reuse) Paged KV cache (custom CUDA impl) Contiguous cache
Batching Continuous + chunked prefill Continuous + chunked prefill In-flight batching Static / single request
Quantization GPTQ, AWQ, FP8, INT8, bitsandbytes GPTQ, AWQ, FP8, INT4 (via FlashInfer) FP8, INT8, INT4, W4A16 (native TRT) GGUF: Q4_0, Q5_1, Q8_0, k-quants
Attention backend FlashAttention-2, FlashInfer, xFormers FlashInfer (primary backend) Custom fused MHA (TRT compiler) Custom C/C++ (+ Metal, CUDA optional)
Speculative decoding Draft model, ngram, Eagle Eagle, draft model Draft model, Medusa Draft model
Parallelism TP, PP, EP TP, DP, EP TP, PP, EP Single GPU / CPU
Hardware NVIDIA, AMD, TPU NVIDIA (primary) NVIDIA only CPU, Metal, CUDA, Vulkan, SYCL
Best for General-purpose serving at scale Multi-call LLM programs, prefix-heavy workloads Max throughput on NVIDIA GPUs Local / edge deploy, laptop inference

Supporting Advanced Model Architectures

Hidden Attention (e.g., DeepSeek MLA)

Multi-head Latent Attention (MLA), introduced in DeepSeek-V2, compresses the KV cache by projecting keys and values into a low-rank latent space. Instead of caching per-head K and V tensors of dimension \(d_h\), MLA caches a single compressed latent vector \(c_t \in \mathbb{R}^{d_c}\) where \(d_c \ll n_h \cdot d_h\). During attention, the full keys and values are reconstructed on-the-fly via learned up-projection matrices: \(K = c_t W^{UK}\), \(V = c_t W^{UV}\). This dramatically reduces KV cache memory (e.g., from 2 × \(n_h d_h\) to just \(d_c\) per token per layer), enabling much larger batch sizes.

Multi-head Latent Attention (MLA) — Standard vs Absorbed Inference Standard MHA / GQA (what we want to avoid) h_t (hidden) W_Q, W_K, W_V Q K V KV cache: n_h · d_h × 2 softmax(QK⊤/√d) · V Output Full-size K, V materialized → no memory saving MLA with Absorbed Projection (efficient inference path) h_t (hidden) W_DKV (compress) c_t latent (d_c) Cache: d_c only! h_t (query) W_Q · W_UK⊤ absorbed q_C (d_c) RoPE q_R, k_R (position) softmax( [q_C ; q_R] · [c_t ; k_R]⊤ / √d ) attention in latent space — no K/V decompression! × W_UV (absorbed) Output Only c_t cached (d_c ≪ n_h·d_h) → 8–16× KV cache reduction
Figure: MLA inference. Left: standard MHA caches full K, V per head. Right: MLA caches only the compressed latent c_t; absorbed projections (W_Q·W_UK⊤) let attention run directly in latent space without decompressing K/V. Decoupled RoPE components (q_R, k_R) are concatenated for positional awareness. (Based on DeepSeek-V2, Figure 3.)

Kernel-level challenges.

Engine-level changes.

Attention with Residual Connections

Some recent architectures introduce residual connections within the attention mechanism itself, beyond the standard post-attention residual. For example, Diff Transformer computes attention as the difference of two softmax attention maps: \(\text{DiffAttn}(X) = \bigl(\text{softmax}(Q_1 K_1^T) - \lambda \cdot \text{softmax}(Q_2 K_2^T)\bigr) V\), where \(\lambda\) is a learnable scalar. This cancels out attention noise and sharpens the model's focus on relevant context. Other variants include gated attention residuals (e.g., \(\text{Attn}(X) + \alpha \cdot X\) with a learnable gate \(\alpha\)) that blend the attention output with the input before the FFN.

Differential Attention Mechanism DiffAttn(X) = ( softmax(Q₁K₁⊤) − λ · softmax(Q₂K₂⊤) ) V X (input) W_Q₁, W_K₁ projection W_Q₂, W_K₂ projection W_V (shared) Q₁ K₁ Q₂ K₂ V A₁ = softmax(Q₁K₁⊤/√d) A₂ = softmax(Q₂K₂⊤/√d) A₁ (signal+noise) λ × A₂ (noise) = A₁−λA₂ (clean signal) ( A₁ − λA₂ ) · V Output Noise-cancellation analogy: A₂ captures diffuse "background" attention. Subtracting it removes noise, sharpening signal.
Figure: Differential attention. The input is projected into two independent Q/K pairs sharing the same V. Two softmax attention maps are computed and subtracted (scaled by learnable λ) to cancel attention noise, producing sharper attention patterns. The heatmaps illustrate how diffuse background attention (A₂) is removed from the noisy map (A₁) to yield a clean signal. (Based on Differential Transformer, Figure 1.)

Kernel-level challenges.

Engine-level changes.

Mixture-of-Experts (MoE)

MoE models (e.g., Mixtral, DeepSeek-V3, Qwen-MoE) replace the dense FFN in each transformer layer with a set of expert FFN sub-networks, activating only a subset (top-\(k\)) per token. A gating network (router) computes token-expert affinities: \(G(x) = \text{TopK}(\text{softmax}(x W_g), k)\), selecting \(k\) experts out of \(E\) total. This allows the model to scale parameters without proportionally scaling per-token compute — e.g., a model with 8 experts and top-2 routing uses only 2/8 of the FFN parameters per token.

Kernel-level challenges.

Engine-level changes.

In practice, adding support for a new architecture in an engine like vLLM or SGLang typically involves:

  1. Model definition: implement the model's forward pass in the engine's model registry, mapping new layers (MLA attention, MoE FFN, diff attention) to the appropriate kernel calls.
  2. Custom kernels: write or adapt fused GPU kernels for the non-standard operations — this is usually the most performance-critical step. Libraries like FlashInfer and CUTLASS provide building blocks (e.g., grouped GEMM, customizable attention templates) that reduce the effort.
  3. KV cache layout: adjust the page allocator and page-table logic if the architecture changes what is cached (latent vectors instead of raw KV, different head counts, etc.).
  4. Config and dispatch: the engine's model loader must parse the architecture config (e.g., num_experts, kv_lora_rank, num_sub_heads) and route to the correct code paths throughout the stack.
  5. Parallelism strategy: MoE models add expert parallelism as a new dimension. The engine must support hybrid TP+EP (or DP+EP) with the required all-to-all communication, which is architecturally different from the all-reduce used in tensor parallelism.

The rapid pace of architectural innovation means inference engines must be designed with modularization and extensibility. The key insight is that the interface between the scheduler, executor, and memory manager should be abstract enough to accommodate different cache formats and compute patterns, while still enabling the fused, architecture-specific kernels that are essential for performance.