Transformers and LLMs

Jan. 2025


This post covers the foundations and advanced research topics of transformers and large models. Transformers were first introduced to conduct machine translations and demonstrated better performance than RNNs and LSTMs, which were the dominant models for sequential data. The self-attention mechanism allows transformers to capture long-term input dependencies. Its model depth is pre-specified and does not change as the input length changes. As such, transformers do not suffer the gradient vanishing/exploding problem of RNNs and LSTMs. Later, the encoder of transformers was used for unsupervised pre-training, which serves as a foundation model for many downstream NLP classification tasks. Given that transformers can easily scale to large models by adding more layers and more attention heads, its architecture naturally supports parallelism and distributed computing. It is easy to scale transformers to very large models and train them on large unlabeled datasets. The model with high capacity can generalize better and thus become the dominant model in many NLP tasks. Transformers were also generalized to image domains and other domains with specific tokenization methods, making it possible to use the same model for handling different types of data (multi-modal). Another line of work trains decoders of transformers as the foundation model for generative tasks. The training task is more difficult than mask language modeling (MLM) used in pretraining encoders and thus gives models with stronger generation capabilities. This makes the foundation of today's LLMs.

Table of Content

Transformers

As shown in the figure, the transformer model has an encoder and decoder structure that is composed of multiple layers of transformer block, where each block has self-attention, layer normalization, and fully connected layers. In NLP tasks, the model takes as input a batch of text sequence, \(X_s \in B \times S_x\) where \(B\) is the batch size and \(S_x\) is the maximum sequence length. The input is first tokenized into a sequence of tokens denoted as \(X \in B \times S \times |V|\), where \(S\) is the maximum sequence length and \(|V|\) is the vocabulary size. Note that \(S\) may not always equal to \(S_x\) during tokenization as each input word may be divided into multiple tokens. Each token is represented as a one-hot vector of size \(|V|\). Before feeding the input to the model, the token representations first need to be transformed into a continuous space to support gradient computation. This step is called embedding, it is done by multiplying \(X\) with a learnable embedding matrix \(W_e \in |V| \times d\), where \(d\) is the embedding dimension. The output embedding is denoted as \(E \in B \times S \times d\). The figure below shows the architecture of the transformer model. Below, we break down the transformer block.

Multi-head attention

Attention was originally introduced in the context of seq2seq models for machine translation [2]. Given an input E of size \(B \times S \times d\), the original additive attention first compute the attention weight \( \alpha = \text{softmax}(f(E)) \in \mathcal{R}^{B \times S }\), then it calculate the output as \(O = \sum \alpha_{:,j} E_{:,j,:} \in \mathcal{R}^{B \times d}\). This attention introduces a set of parameters in the liner layer \(f\). In transformers, the dot-product attention is used, which does not introduce additional parameters. It defines a query, key, and value matrix \(Q, K, V\). The attention weight is calculated by \( \alpha = \text{softmax}(QK^T)\), which computes the similarity between keys and queries. Then, the output is a weighted sum \(O = \alpha V\). The attention weight controls the importance of each value in the output. It helps the model to focus on the most relevant information, filtering out noises and irrelevant information. As such, the model can handle long inputs and capture global long-term dependencies.

Multi-head scaled self attentions. As shown in the figure above, the attention mechanism takes the same input for calculating queries, keys, and values. This is called self-attention. Besides, to encourage the model to learn different hidden correlations of the input, transformers design multi-head attention, where in each attention, the input is first transformed into a unique subspace and then calculates the dot product attention in each subspace. Suppose we have \(h\) attention heads, the hidden dimension of each head is \(d_h = d/h\). The input is first transformed into \(Q_i, K_i, V_i \in B \times S \times d_h\) for each head \(i\). \(Q_i = E W_{Qi}, K_i = E W_{Ki}, V_i = E W_{Vi} \), where \(W_{Qi}, W_{Ki}, W_{Vi} \in d \times d_h\) are learnable parameters. The output of each head is calculated as \(O_i = \text{softmax}(Q_i K_i^T) V_i\). The final output of the multi-head attention is the concatenation of all heads, \(O_a = [O_1, O_2, ..., O_h] W_o\), where \(W_o \in d \times d\) is a learnable parameter. To avoid large numbers in the softmax function, the dot product is typically scaled by \(1/\sqrt{d_h}\). We notice that the output dimension is the same as the input dimension and also the different number of attention heads does not change the number of model parameters.

Layer norm and residual connection

The output of the attention is then added to the input and passed through a layer normalization. \(O+E\) is called residual connection, which was first introduced in ResNet for image classification. The residual connection enables the model to learn from identical functions and also makes the training easier (avoid degradation and gradient vanish), espcailly for deep networks. The layer normalization is calculated for each position in the sequence, which is not affected a lot by padding. In comparison, batch normalization is calculated for each data in the batch, which will be affected if the input is padded a lot.

Description of the image
Demonstration of the difference between layer norm (top) and batch norm (bottom).
Linear layers
The output of layer normalization is then passed through two linear layers. $$ \text{FFN}(O) = \text{ReLU}(O_l W_1 + b_1) W_2 + b_2 $$, the dimension of weight matrices are \(W_1 \in d \times d_{ff}\), \(W_2 \in d_{ff} \times d\), where \(d_{ff} = 4d\) is pre-specified. The output is again of the same dimension as the input and is passed to the next transformer block. The model only has two hyper-parameters, the number of layers and the number of attention heads.
Decoder
As shown in the figure, the decoder has three differences from the encoder.

More recent transformer architectures

BERT: Bidirectional Encoder Representations from Transformers

BERT model is a classical variant of the original transformer model. It was widely used for learning hidden representations for input text and demonstrated better performance than other language models, like RNN. The key technique points of this model are as follows:

BERT model is good for classification tasks as it can learn hidden correlations in the input and < CLS > provides a holistic representation of the entire input. There are some variants of the BERT model with changes in the embedding layer or tokenization, such as RoBERTa. Two common model selections are BERT-base (12 layers, 108M) and BERT-large (24 layers, 334M). This line of models cannot perform generative tasks because the model can access the future tokens during the training and the model is not large enough.

GPT: Generative Pre-trained Transformer

GPT models are another line of variants of the original transformer model. The key differences between GPT and BERT are 1) GPT leverages the decoder of the transformer while BERT uses the encoder; 2) GPT is trained to predict unseen tokens while BERT mainly relies on MLM. The GPT training tasks are more difficult but once the model is well-trained, it has better generative capabilities than BERT models. Also, GPT models typically have a much larger number of parameters than BERT models, such that these models have enough capacity to handle their difficult training tasks. GPT models are the foundation of modern LLMs. The most classical papers on GPT models are GPT-1 proposes the GPT model. GPT-2 demonstrates scaling up the model can significantly improve the performance. GPT-3 first proposes testing-phase prompting techniques. The key technique points of this model are as follows:

MoE: Mixture of Experts

MoE is a recent emerging model structure that shows better performance than the vanilla GPT models. The idea stems from the observations that different parts of a large model can be more responsible to certain data or applications. Motivated by this observation, the MoE was proposed to combine multiple small models and test if it can beat a single large model. The exploration starts with mixing the fully-connected layers for CNNs or RNNs to the entire attention block. The most recent success of MoE is the DeepSeek-V3 model, which shows better performance than SOTA large models with an extremely lower training cost.

Early MoEs.

Mixtral MoEs. The MoE models from mixtral work well in certain tasks, especially its mixtral-8*7B model. The model uses sparse attention mechanism (grouped-query attention and sliding window attention), which we will introduce later when talking about different attention variations. Its MoE follows a sparse gating without adding noise. \(\sum_i G(x)_i E_i(x)\), \(G(x) = \text{softmax}(\text{TopK}(xW_g))\). \(E(x)\) stands for switch gated linear unit, where they replaced the FFN in a transformer block with SwiGLU. Basically, they still mix the final linear layers in a transformer block.

Description of the image
Demonstration of the MoE mechanism in the Mixtral model. Router is the gate. [2]

DeepSeek MoEs. DeepSeek's latest model, DeepSeek-V3, also uses a MoE structure for its Feedforward network. Its key point is to use finer-grained experts and isolates some experts as shared ones. Given the FNN input (attention output) of a certain token \(u \in \mathcal{R}^{p}\), it will set up a total number of \(N_s+N_r\) FFN. Then, the output is \(u + \sum F^{(s)}(u) + \sum g_i F_{i}^{(r)}(u)\). The gating network \(g\) is similar as the top-K sparse gating introduced above. The only difference is instead of doing \(xW_g\), they do \(\text{Sigmoid}(u^T e_i)\), where \(e_i\) is the centroid vector. The sigmoid function is newly added in DeepSeek-V3, it may help with numerical stability.

Description of the image
Demonstration of the DeepSeek-V3 model structure [3]. The MLA will be introduced later.

Transformer component optimizations

This part discusses the optimizations over the vanilla transformer which have been used as SOTA methods. We will focus on tokenizer, positional embedding, and attention. Note that MoE can also be taken as an optimization over the FNN in the transformer block.

Tokenizer

Positional embedding

Attention

Below, we introduce some techniques to improve the efficiency of computing attentions during training and testing.

Training

Here, we go over the widely pre-training and fine-tuning methods used in GPT models. We will also demonstrate the backward pass of the transformer model. Data is the first important factor in both pre-training and fine-tuning. Early works aim to collect high-quality data from the wild. They train another model to filter out low-quality ones. However, data is like ``fossil energy'', less and less untrained data can be found. More recent work explores how to use LLMs to generate data for continuous training.

Pretraining

As discussed above, the pretraining objective function is next token prediction loss. The model iteratively computes the next token based on the input and the previously generated tokens. The objective function is written as: \( \sum_{i=2}^{T} \text{CE}(f(\mathbf{X}, \mathbf{x}'_{1}, \ldots, \mathbf{x}'_{i-1}; \Theta), \mathbf{x}_i )\), where \(\text{CE}\) represents the cross-entropy loss. In other words, generating every token is equal to conducting a \(|V|\) class classification. MoE models typically add an expert balance loss to prevent routing collapse.

Fine-tuning

Backward pass

Similar to the analysis in the MLP post, we compute the backward pass of a simple transformer model to demonstrate the process. Consider a decoder-only transformer model with one transformer block. The input embedding is \(E \in \mathcal{R}^{S \times d}\). Suppose the model has \(h\) attention heads and the hidden dimension of each head is \(d_h = d/h\). The forward pass of the model is: $$ Q_i = E W_{Qi}^T, K_i = E W_{Ki}^T, V_i = E W_{Vi}^T, $$ $$ O_i = \text{softmax}(Q_iK_i^T)MV_i, $$ $$ O_a = [O_1, O_2, ..., O_h] W_o, $$ $$ O_l = \text{LayerNorm}(O_a + E), $$ $$ O = \text{ReLU}(O_lW_1^T+b_1)W_2^T+b_2, $$ $$ P = \text{softmax}(O_{-1,:}) $$ The final loss function for predicting the next token is the cross-entropy loss: \(-\text{log} P_j\), where \(j\) is the index of the true token for the next token.

Now, we compute the backward pass. First, we compute the backward pass of the feedforward layers, which is exact the same as the MLP model.

Then, we compute the gradient for the attention layer, note that the attention layer does not introduce any parameters, we mainly compute the gradient for the layer normalization layer and linear transformation for each attention head. Note that we can ignore the residual connection when computing these gradient, but it will involve the gradient for the word and positional embedding.

Here, we first need to handle the layer normalization, which normalize the input based on the last dimension (feature). $$ \mu_i = \frac{1}{d}\sum_j (O_a)_{ij}, \sigma_i^2 = \frac{1}{d} \sum_j((O_a)_{ij} - \mu_i)^2, $$ $$ \hat{O_a}_{ij} = \frac{(O_a)_{ij} - \mu_i}{\sqrt{\sigma_i^2+\epsilon}}, (O_l)_{ij} = \gamma_j \hat{(O_a)}_{ij} +\beta_j.$$ The gradients are as follows.

With the gradients for the layer normalization \(\frac{\partial L}{\partial (O_a)}\), we can compute the gradients for the linear transformation in each attention head.

For \(V_i\) of the \(i\)-th attention head, For \(Q_i\) of the \(i\)-th attention head, For \(K_i\) of the \(i\)-th attention head,

Efficient training

Parallelism training

There are four major parallelism mechanisms: data parallel, model parallel, pipeline parallel, and tensor parallel. Data parallel splits the data and assigns different chunks to different devices; it copies the full model to all devices and synchronizes the gradients calculated on each data chunk. This approach cannot deal with large models that cannot fit into a single device. Model parallel splits the model and assigns different parts to different devices. The vanilla model parallel cannot fully utilize the computation power of each device because the earlier/later layers need to wait when computing the later/earlier layers. We can tell there is a trade-off between memory and speed from the data and model parallelism. Pipeline parallel splits the model and data into different parts (the training into different stages) and assigns different stages to different devices. There are different pipeline parallel strategies, all try to strike a balance of the memory and speed trade-off. My earlier post on MLP has a more detailed discussion on these three parallel mechanisms.

Tensor parallel works on a lower level that divides the model weights into different parts and assigns different parts to different devices. The MLP discusses the forward/backward pass under tensor parallel for the MLP model, which is also an essential component for the transformers. Here, we discuss the tensor parallel for the attention module in the transformer model.

LoRA

LoRA is an efficient fine-tuning method that learns a task-specific weight \(\Delta\Phi(\Theta)\). \(\Delta\Phi(\Theta)\) adds a weight \(\Delta W = BA\) to each weight \(W \in \mathcal{R}^{d\times k}\), where \(B \in \mathcal{R}^{d\times r}\) and \(A \ \in \mathcal{R}^{r\times k}\) have a low rank \(r << \text{min}(d, k)\). When fine-tuning, the model only needs to learn the low-rank matrices \(B\) and \(A\) under a given objective functions. Follow-up works proposed Q-LoRA that combines quantization with LoRA; and LongLoRA that combines LoRA with RoPE and position interpolation.

Inference

In LLMs, inference refers to the process of generating responses based on the input (In probabilistic models, inference means compute posterior distributions of hidden variances). Recent research has proposed a number of prompting techniques. Below, we briefly discuss some most representative methods. For a complete list, please refer to the Prompting Guide.

Besides the representative methods introduced above, which have human-discovered heuristics for prompt designs, there is another line of works that aim to automatically generate prompts. Such methods are also used for LLM red-teaming, which generates adversarial prompts leading to different attack goals. See this post for more details.

Efficient Inference

This part introduces three widely used techniques for efficient inference in LLMs: KV cache, quantization, and efficient decoding.

KV cache is used during the autoregressive generation, which stores the value and key for the previously generated tokens. Recall that the attention weight is computed as \(qK^T\), where \(q \in \mathcal{R}^{1\times d}\) is the query for the current token and \(K \in \mathcal{R}^{t \times d}\) is the key for the current and previous tokens. The KV cache stores the key and value for the previous tokens, such that they only need to be calculated once. $$ q_t = E_tW_{Q}, k_t = E_tW_{K}, V_t = E_tW_{V}, K = [k_{1:t-1},k_t], V = [v_{1:t-1},v_t], o = \text{softmax}(q_tK^T)V,$$ where \(o \in \mathcal{R}^{1 \times d}\). When implementing the KV cache, we can use a dictionary to store the \(K\) and \(V\) and update it in the forward function of the model class.

Quantization is the process of reducing the precision of the model weights and activations, which can reduce the memory and computation cost. The default precision is 32-bit floating point, which can be reduced to 16-bit floating point or even 8-bit integer. Quantization: \( \text{round}(\frac{x}{s})+z\), \(s=\frac{x_{max}-x_{min}}{q_{max}-q_{min}}\), \(z=\frac{q_{min}-x_{min}}{s}\), Dequantization: \(x = s(q-z)\). The quantization process can be done during the training (Quantization-aware training) or after the training (post-training quantization).

The most common efficient decoding method is speculative decoding. When conducting inference for a target model, it uses a small model to generate the candidates for next token and uses the target model to verify the candidates. The process is called drafting and verification. Drafting with a small model can be faster than directly generating from the large model and the verification can be done in parallel. This survey has a comprehensive summary of different drafting and verification techniques. Setting a length limit for the generated tokens is also a common practice to reduce the computation cost. Recent work proposes diffusion LLMs, which leverage the diffusion process for autoregressive generation and has been shown very efficient.




@article{guo2024mlbasis,
  title   = {Transformers: basis and advanced topics},
  author  = {Guo, Wenbo},
  journal = {henrygwb.github.io},
  year    = {2024},
  url     = {https://henrygwb.github.io/posts/ml_basis.htm}
}