🏠 Home

Transformers and LLMs

Jan. 2025


This post covers the foundations and advanced research topics of transformers and large models. Transformers were first introduced to conduct machine translations and demonstrated better performance than RNNs and LSTMs, which were the dominant models for sequential data. The self-attention mechanism allows transformers to capture long-term input dependencies. Its model depth is pre-specified and does not change as the input length changes. As such, transformers do not suffer the gradient vanishing/exploding problem of RNNs and LSTMs. Later, the encoder of transformers was used for unsupervised pre-training, which serves as a foundation model for many downstream NLP classification tasks. Given that transformers can easily scale to large models by adding more layers and more attention heads, their architecture naturally supports parallelism and distributed computing. It is easy to scale transformers to very large models and train them on large unlabeled datasets. Models with high capacity can generalize better and thus become the dominant models in many NLP tasks. Transformers were also generalized to image domains and other domains with specific tokenization methods, making them suitable for handling different types of data (multi-modal). Another line of work trains decoders of transformers as the foundation model for generative tasks. The training task is more difficult than masked language modeling (MLM) used in pretraining encoders and thus gives models with stronger generation capabilities. This forms the foundation of today's LLMs.

Table of Content

Transformers

As shown in the figure, the transformer model has an encoder-decoder structure that is composed of multiple transformer blocks, where each block has self-attention, layer normalization, and fully connected layers. In NLP tasks, the model takes as input a batch of text sequences, \(X_s \in \mathbb{N}^{B \times S_x}\), where \(B\) is the batch size and \(S_x\) is the maximum sequence length. The input is first tokenized into a sequence of tokens denoted as \(X \in \{0,1\}^{B \times S \times |V|}\), where \(S\) is the maximum tokenized sequence length and \(|V|\) is the vocabulary size. Note that \(S\) may not always be equal to \(S_x\) during tokenization, as each input word may be divided into multiple tokens. Each token is represented as a one-hot vector of size \(|V|\). Before feeding the input to the model, the token representations first need to be transformed into a continuous space to support gradient computation. This step is called embedding; it is done by multiplying \(X\) with a learnable embedding matrix \(W_e \in \mathbb{R}^{|V| \times d}\), where \(d\) is the embedding dimension. The output embedding is denoted as \(E \in \mathbb{R}^{B \times S \times d}\). The figure below shows the architecture of the transformer model. Below, we break down the transformer block.

Multi-head attention

Attention was originally introduced in the context of seq2seq models for machine translation [2]. Given an input \(E\) of size \(B \times S \times d\), the original additive attention first computes the attention weights \(\alpha = \text{softmax}(f(E)) \in \mathbb{R}^{B \times S}\), then calculates the output as \(O = \sum_{j=1}^{S} \alpha_{:,j}\odot E_{:,j,:} \in \mathbb{R}^{B \times d}\). This attention introduces a set of parameters in the linear layer \(f\). In transformers, the dot-product attention is used, which does not introduce additional parameters. It defines a query matrix \(Q\), a key matrix \(K\), and a value matrix \(V\). The attention weights are calculated by \(\alpha = \text{softmax}(QK^T/\sqrt{d_h})\), which computes the similarity between keys and queries. Then, the output is a weighted sum \(O = \alpha V\). The attention weights control the importance of each value in the output. They help the model focus on the most relevant information, filtering out noise and irrelevant information. As such, the model can handle long inputs and capture global long-term dependencies.

Multi-head scaled self-attention. As shown in the figure above, the attention mechanism takes the same input for calculating queries, keys, and values. This is called self-attention. In addition, to encourage the model to learn different hidden correlations of the input, transformers use multi-head attention, where in each head, the input is first transformed into a unique subspace and then dot-product attention is computed in each subspace. Suppose we have \(h\) attention heads; the hidden dimension of each head is \(d_h = d/h\). The input is first transformed into \(Q_i, K_i, V_i \in \mathbb{R}^{B \times S \times d_h}\) for each head \(i\): \(Q_i = E W_{Qi}\), \(K_i = E W_{Ki}\), \(V_i = E W_{Vi}\), where \(W_{Qi}, W_{Ki}, W_{Vi} \in \mathbb{R}^{d \times d_h}\) are learnable parameters. The output of each head is calculated as \(O_i = \text{softmax}(Q_i K_i^T/\sqrt{d_h}) V_i\). The final output of the multi-head attention is the concatenation of all heads, \(O_a = [O_1, O_2, \ldots, O_h] W_o\), where \(W_o \in \mathbb{R}^{d \times d}\) is a learnable parameter. The dot product is scaled by \(1/\sqrt{d_h}\) to avoid large values entering the softmax function. We notice that the output dimension is the same as the input dimension, and varying the number of attention heads does not change the total number of model parameters.

Layer norm and residual connection

The output of the attention is then added to the input and passed through a layer normalization. \(O+E\) is called a residual connection, which was first introduced in ResNet for image classification. The residual connection enables the model to learn from identity functions and also makes training easier (avoiding degradation and gradient vanishing), especially for deep networks. Layer normalization is calculated for each position in the sequence, so it is not affected much by padding. In comparison, batch normalization is calculated for each feature across the batch, which will be affected if the inputs are padded heavily.

Description of the image
Demonstration of the difference between layer norm (top) and batch norm (bottom).
Linear layers
The output of layer normalization is then passed through two linear layers. $$ \text{FFN}(O_l) = \text{ReLU}(O_l W_1 + b_1) W_2 + b_2. $$ The dimensions of the weight matrices are \(W_1 \in \mathbb{R}^{d \times d_{ff}}\) and \(W_2 \in \mathbb{R}^{d_{ff} \times d}\), where \(d_{ff} = 4d\) is pre-specified. The output is again of the same dimension as the input and is passed to the next transformer block. The main hyper-parameters of the model are the number of layers and the number of attention heads.
Decoder
As shown in the figure, the decoder has three differences from the encoder.

More recent transformer architectures

BERT: Bidirectional Encoder Representations from Transformers

BERT model is a classical variant of the original transformer model. It was widely used for learning hidden representations for input text and demonstrated better performance than other language models, like RNN. The key technique points of this model are as follows:

BERT is good for classification tasks as it can learn hidden correlations in the input, and <CLS> provides a holistic representation of the entire input. There are some variants of the BERT model with changes in the embedding layer or tokenization, such as RoBERTa. Two common model variants are BERT-base (12 layers, 110M parameters) and BERT-large (24 layers, 340M parameters). This line of models cannot perform generative tasks well because the encoder can access future tokens during training and the models are not large enough.

GPT: Generative Pre-trained Transformer

GPT models are another line of variants of the original transformer model. The key differences between GPT and BERT are 1) GPT leverages the decoder of the transformer while BERT uses the encoder; 2) GPT is trained to predict unseen tokens while BERT mainly relies on MLM. The GPT training tasks are more difficult but once the model is well-trained, it has better generative capabilities than BERT models. Also, GPT models typically have a much larger number of parameters than BERT models, such that these models have enough capacity to handle their difficult training tasks. GPT models are the foundation of modern LLMs. The most classical papers on GPT models are GPT-1 proposes the GPT model. GPT-2 demonstrates scaling up the model can significantly improve the performance. GPT-3 first proposes testing-phase prompting techniques. The key technique points of this model are as follows:

MoE: Mixture of Experts

MoE is a recently emerging model structure that shows better performance than the vanilla GPT models. The idea stems from the observation that different parts of a large model can be more specialized for certain data or applications. Motivated by this observation, MoE was proposed to combine multiple smaller models and test if they can beat a single large model. Exploration started with mixing the fully-connected layers in CNNs or RNNs and extended to the entire attention block. The most recent success of MoE is the DeepSeek-V3 model, which shows better performance than SOTA large models at a much lower training cost.

Early MoEs.

Mistral MoEs. The MoE models from Mistral work well on certain tasks, especially the Mixtral-8x7B model. The model uses a sparse attention mechanism (grouped-query attention and sliding window attention), which we will introduce later when talking about different attention variations. Its MoE follows a sparse gating mechanism without adding noise: \(y = \sum_i G(x)_i E_i(x)\), \(G(x) = \text{softmax}(\text{TopK}(xW_g))\). \(E_i(x)\) is a SwiGLU (Swish-Gated Linear Unit) expert; the FFN in each transformer block is replaced with SwiGLU. Basically, they still mix the final linear layers in a transformer block.

Description of the image
Demonstration of the MoE mechanism in the Mixtral model. Router is the gate. [2]

DeepSeek MoEs. DeepSeek's latest model, DeepSeek-V3, also uses an MoE structure for its feedforward network. Its key idea is to use finer-grained experts and isolate some experts as shared ones. Given the FFN input (attention output) of a certain token \(u \in \mathbb{R}^{p}\), it sets up a total of \(N_s+N_r\) FFNs. The output is \(u + \sum_{s=1}^{N_s} F_s^{(s)}(u) + \sum_{i=1}^{N_r} g_i\, F_{i}^{(r)}(u)\). The gating network \(g\) is similar to the top-K sparse gating introduced above. The only difference is that, instead of computing \(xW_g\), DeepSeek computes \(\text{Sigmoid}(u^T e_i)\), where \(e_i\) is the centroid vector. The sigmoid is newly added in DeepSeek-V3 and may help with numerical stability.

Description of the image
Demonstration of the DeepSeek-V3 model structure [3]. The MLA will be introduced later.

Transformer component optimizations

This part discusses the optimizations over the vanilla transformer which have been used as SOTA methods. We will focus on tokenizer, positional embedding, and attention. Note that MoE can also be taken as an optimization over the FFN in the transformer block.

Tokenizer

Positional embedding

Attention

Below, we introduce some techniques to improve the efficiency of computing attentions during training and testing.

Training

Here, we go over the widely used pre-training and fine-tuning methods in GPT models. We will also demonstrate the backward pass of the transformer model. Data is the first important factor in both pre-training and fine-tuning. Early works aim to collect high-quality data from the wild. They train another model to filter out low-quality samples. However, data is like ``fossil fuel'': less and less fresh, untouched data can be found. More recent work explores how to use LLMs to generate data for continued training.

Pretraining

As discussed above, the pretraining objective function is the next-token prediction loss. The model iteratively predicts the next token based on the input and the previously generated tokens. The objective function is written as: \(\sum_{i=2}^{T} \text{CE}(f(\mathbf{X}, \mathbf{x}'_{1}, \ldots, \mathbf{x}'_{i-1}; \Theta), \mathbf{x}_i)\), where \(\text{CE}\) represents the cross-entropy loss. In other words, generating every token is equivalent to conducting a \(|V|\)-class classification. MoE models typically add an expert balance loss to prevent routing collapse.

Fine-tuning

Backward pass

Similar to the analysis in the MLP post, we compute the backward pass of a simple transformer model to demonstrate the process. Consider a decoder-only transformer model with one transformer block. The input embedding is \(E \in \mathbb{R}^{S \times d}\). Suppose the model has \(h\) attention heads and the hidden dimension of each head is \(d_h = d/h\). Note that this section uses the convention \(W_{Qi}, W_{Ki}, W_{Vi} \in \mathbb{R}^{d_h \times d}\) (with a transpose in the forward pass) for derivation convenience, which is the transpose of the convention used in the architecture section. The forward pass of the model is: $$ Q_i = E W_{Qi}^T,\quad K_i = E W_{Ki}^T,\quad V_i = E W_{Vi}^T, $$ $$ O_i = \text{softmax}\!\left(\frac{Q_iK_i^T}{\sqrt{d_h}} + M\right) V_i, $$ $$ O_a = [O_1, O_2, \ldots, O_h] W_o, $$ $$ O_l = \text{LayerNorm}(O_a + E), $$ $$ \text{FFN}_{\text{out}} = \text{ReLU}(O_lW_1^T+b_1)W_2^T+b_2, $$ $$ O = \text{FFN}_{\text{out}} W_{\text{cls}}^T,\quad P = \text{softmax}(O_{S,:}), $$ where \(M\) is the additive causal mask (\(0\) on allowed positions and \(-\infty\) on masked positions), \(W_{\text{cls}} \in \mathbb{R}^{|V| \times d}\) is the classification head, and \(P \in \mathbb{R}^{|V|}\). The final loss for predicting the next token is the cross-entropy loss: \(L = -\log P_j\), where \(j\) is the index of the true next token.

Now, we compute the backward pass. First, we compute the backward pass of the classification head and the feedforward layers, which is essentially the same as in the MLP model.

Then, we compute the gradient for the attention layer. Note that the attention operation itself does not introduce any parameters, so we mainly compute the gradients for the layer normalization layer and the linear transformations of each attention head. Note that we can ignore the residual connection when computing these gradients within the block, but it will contribute additively to the gradient for the input embedding.

Here, we first need to handle the layer normalization, which normalizes the input along the last dimension (feature). $$ \mu_i = \frac{1}{d}\sum_{j=1}^{d} (O_a)_{ij},\quad \sigma_i^2 = \frac{1}{d} \sum_{j=1}^{d}\left((O_a)_{ij} - \mu_i\right)^2, $$ $$ (\hat{O}_a)_{ij} = \frac{(O_a)_{ij} - \mu_i}{\sqrt{\sigma_i^2+\epsilon}},\quad (O_l)_{ij} = \gamma_j (\hat{O}_a)_{ij} + \beta_j. $$ Let \(g_{ij} \equiv \frac{\partial L}{\partial (\hat{O}_a)_{ij}} = \frac{\partial L}{\partial (O_l)_{ij}}\,\gamma_j\). The gradients are as follows.

With the gradient for the layer normalization input, \(\frac{\partial L}{\partial O_a}\), we can compute the gradients for the linear transformations in each attention head. Recall that \(O_a = [O_1, O_2, \ldots, O_h] W_o\), so

Let \(A_i = Q_i K_i^T/\sqrt{d_h} + M\) and \(S_i = \text{softmax}(A_i)\), so that \(O_i = S_i V_i\). For \(V_i\) of the \(i\)-th attention head, For \(Q_i\) of the \(i\)-th attention head, For \(K_i\) of the \(i\)-th attention head,

Efficient training

Parallelism training

There are four major parallelism mechanisms: data parallelism, model parallelism, pipeline parallelism, and tensor parallelism. Data parallelism splits the data and assigns different chunks to different devices; it copies the full model to all devices and synchronizes the gradients calculated on each data chunk. This approach cannot deal with large models that do not fit on a single device. Model parallelism splits the model and assigns different parts to different devices. Vanilla model parallelism cannot fully utilize the computation power of each device because the earlier/later layers need to wait when computing the later/earlier layers. We can see that there is a trade-off between memory and speed in data and model parallelism. Pipeline parallelism splits the model and data into different parts (and the training into different stages) and assigns different stages to different devices. There are different pipeline parallelism strategies, all of which try to strike a balance in the memory–speed trade-off. My earlier post on MLP has a more detailed discussion of these three parallelism mechanisms.

Tensor parallelism works at a lower level: it divides the model weights into different parts and assigns different parts to different devices. The MLP post discusses the forward/backward pass under tensor parallelism for the MLP model, which is also an essential component of transformers. Here, we discuss tensor parallelism for the attention module in the transformer model.

LoRA

LoRA is an efficient fine-tuning method that learns a task-specific weight update \(\Delta\Phi\). \(\Delta\Phi\) adds a weight \(\Delta W = BA\) to each weight \(W \in \mathbb{R}^{d\times k}\), where \(B \in \mathbb{R}^{d\times r}\) and \(A \in \mathbb{R}^{r\times k}\) have a low rank \(r \ll \min(d, k)\). When fine-tuning, the model only needs to learn the low-rank matrices \(B\) and \(A\) under a given objective function. Follow-up works proposed Q-LoRA, which combines quantization with LoRA, and LongLoRA, which combines LoRA with RoPE and position interpolation.

Inference

In LLMs, inference refers to the process of generating responses based on the input (in probabilistic models, inference means computing posterior distributions of hidden variables). Recent research has proposed a number of prompting techniques. Below, we briefly discuss some of the most representative methods. For a complete list, please refer to the Prompting Guide.

Besides the representative methods introduced above, which use human-discovered heuristics for prompt design, there is another line of work that aims to automatically generate prompts. Such methods are also used for LLM red-teaming, which generates adversarial prompts targeting different attack goals. See this post for more details.

Efficient Inference

This part introduces three widely used techniques for efficient inference in LLMs: KV cache, quantization, and efficient decoding.

KV cache is used during autoregressive generation; it stores the key and value for the previously generated tokens. Recall that the attention weights are computed as \(qK^T\), where \(q \in \mathbb{R}^{1\times d}\) is the query for the current token and \(K \in \mathbb{R}^{t \times d}\) contains the keys for the current and previous tokens. The KV cache stores the key and value for the previous tokens, so that they only need to be calculated once. $$ q_t = E_tW_{Q},\ k_t = E_tW_{K},\ v_t = E_tW_{V},\ K = [k_{1:t-1};\,k_t],\ V = [v_{1:t-1};\,v_t],\ o = \text{softmax}(q_t K^T)\,V,$$ where \(o \in \mathbb{R}^{1 \times d}\). When implementing the KV cache, we can use a dictionary to store \(K\) and \(V\) and update it in the forward function of the model class.

Quantization is the process of reducing the precision of the model weights and activations, which can reduce the memory and computation cost. The default precision is 32-bit floating point, which can be reduced to 16-bit floating point or even 8-bit integer. Quantization: \(q = \text{round}(x/s) + z\), where \(s = \frac{x_{\max}-x_{\min}}{q_{\max}-q_{\min}}\) and \(z = q_{\min} - \text{round}(x_{\min}/s)\). Dequantization: \(x = s(q-z)\). Quantization can be done during training (quantization-aware training) or after training (post-training quantization).

The most common efficient decoding method is speculative decoding. When conducting inference for a target model, it uses a small model to generate the candidates for next token and uses the target model to verify the candidates. The process is called drafting and verification. Drafting with a small model can be faster than directly generating from the large model and the verification can be done in parallel. This survey has a comprehensive summary of different drafting and verification techniques. Setting a length limit for the generated tokens is also a common practice to reduce the computation cost. Recent work proposes diffusion LLMs, which leverage the diffusion process for autoregressive generation and have been shown to be very efficient.




@article{guo2024mlbasis,
  title   = {Transformers: basis and advanced topics},
  author  = {Guo, Wenbo},
  journal = {henrygwb.github.io},
  year    = {2024},
  url     = {https://henrygwb.github.io/posts/ml_basis.htm}
}