Transformers

Transformers and LLMs

Jan. 2025

This post covers the foundations and advanced research topics of transformers and large models. Transformers were first introduced to conduct machine translations and demonstrated better performance than RNNs and LSTMs, which were the dominant models for sequential data. The self-attention mechanism allows transformers to capture long-term input dependencies. Its model depth is pre-specified and does not change as the input length changes. As such, transformers do not suffer the gradient vanishing/exploding problem of RNNs and LSTMs. Later, the encoder of transformers was used for unsupervised pre-training, which serves as a foundation model for many downstream NLP classification tasks. Given that transformers can easily scale to large models by adding more layers and more attention heads, its architecture naturally supports parallelism and distributed computing. It is easy to scale transformers to very large models and train them on large unlabeled datasets. The model with high capacity can generalize better and thus become the dominant model in many NLP tasks. Transformers were also generalized to image domains and other domains with specific tokenization methods, making it possible to use the same model for handling different types of data (multi-modal). Another line of work trains decoders of transformers as the foundation model for generative tasks. The training task is more difficult than mask language modeling (MLM) used in pretraining encoders and thus gives models with stronger generation capabilities. This makes the foundation of today's LLMs.

Table of Content

Transformers

As shown in the figure, the transformer model has an encoder and decoder structure that is composed of multiple layers of transformer block, where each block has self-attention, layer normalization, and fully connected layers. In NLP tasks, the model takes as input a batch of text sequence, $X_s \in B \times S_x$ where $B$ is the batch size and $S_x$ is the maximum sequence length. The input is first tokenized into a sequence of tokens denoted as $X \in B \times S \times |V|$, where $S$ is the maximum sequence length and $|V|$ is the vocabulary size. Note that $S$ may not always equal to $S_x$ during tokenization as each input word may be divided into multiple tokens. Each token is represented as a one-hot vector of size $|V|$. Before feeding the input to the model, the token representations first need to be transformed into a continuous space to support gradient computation. This step is called embedding, it is done by multiplying $X$ with a learnable embedding matrix $W_e \in |V| \times d$, where $d$ is the embedding dimension. The output embedding is denoted as $E \in B \times S \times d$. The figure below shows the architecture of the transformer model. Below, we break down the transformer block.

Multi-head attention

Attention was originally introduced in the context of seq2seq models for machine translation [2]. Given an input E of size $B \times S \times d$, the original additive attention first compute the attention weight $ \alpha = \text{softmax}(f(E)) \in \mathcal{R}^{B \times S }$, then it calculate the output as $O = \sum \alpha_{:,j} E_{:,j,:} \in \mathcal{R}^{B \times d}$. This attention introduces a set of parameters in the liner layer $f$. In transformers, the dot-product attention is used, which does not introduce additional parameters. It defines a query, key, and value matrix $Q, K, V$. The attention weight is calculated by $ \alpha = \text{softmax}(QK^T)$, which computes the similarity between keys and queries. Then, the output is a weighted sum $O = \alpha V$. The attention weight controls the importance of each value in the output. It helps the model to focus on the most relevant information, filtering out noises and irrelevant information. As such, the model can handle long inputs and capture global long-term dependencies.

Multi-head scaled self attentions. As shown in the figure above, the attention mechanism takes the same input for calculating queries, keys, and values. This is called self-attention. Besides, to encourage the model to learn different hidden correlations of the input, transformers design multi-head attention, where in each attention, the input is first transformed into a unique subspace and then calculates the dot product attention in each subspace. Suppose we have $h$ attention heads, the hidden dimension of each head is $d_h = d/h$. The input is first transformed into $Q_i, K_i, V_i \in B \times S \times d_h$ for each head $i$. $Q_i = E W_{Qi}, K_i = E W_{Ki}, V_i = E W_{Vi} $, where $W_{Qi}, W_{Ki}, W_{Vi} \in d \times d_h$ are learnable parameters. The output of each head is calculated as $O_i = \text{softmax}(Q_i K_i^T) V_i$. The final output of the multi-head attention is the concatenation of all heads, $O_a = [O_1, O_2, ..., O_h] W_o$, where $W_o \in d \times d$ is a learnable parameter. To avoid large numbers in the softmax function, the dot product is typically scaled by $1/\sqrt{d_h}$. We notice that the output dimension is the same as the input dimension and also the different number of attention heads does not change the number of model parameters.

Layer norm and residual connection

The output of the attention is then added to the input and passed through a layer normalization. $O+E$ is called residual connection, which was first introduced in ResNet for image classification. The residual connection enables the model to learn from identical functions and also makes the training easier (avoid degradation and gradient vanish), espcailly for deep networks. The layer normalization is calculated for each position in the sequence, which is not affected a lot by padding. In comparison, batch normalization is calculated for each data in the batch, which will be affected if the input is padded a lot.

Description of the image — Demonstration of the difference between layer norm (top) and batch norm (bottom).

Linear layers

The output of layer normalization is then passed through two linear layers. $$ \text{FFN}(O) = \text{ReLU}(O_l W_1 + b_1) W_2 + b_2 $$, the dimension of weight matrices are $W_1 \in d \times d_{ff}$, $W_2 \in d_{ff} \times d$, where $d_{ff} = 4d$ is pre-specified. The output is again of the same dimension as the input and is passed to the next transformer block. The model only has two hyper-parameters, the number of layers and the number of attention heads.

Decoder

As shown in the figure, the decoder has three differences from the encoder.

First, the self-attention connected to the input is masked to prevent the model from looking into the future, which is also called causal attention. For example, if the model takes as input a sentence "I am a student at UCSB", the model should not use the word "UCSB" to predict the word "student". That is, when predicting the word "student", the attention weight corresponding to "at UCSB" are masked out.
Second, the decoder has an additional attention layer that takes the output of the encoder as key and value.
Finally, the model has an additional classification head that predicts the next token in the sequence, which is modeled as a classification problem with $|V|$ classes. This design makes learning more efficient and easier than directly outputting the token embedding, given that classification problems are discrete and have a smaller space than regression problems.

More recent transformer architectures

BERT: Bidirectional Encoder Representations from Transformers

BERT model is a classical variant of the original transformer model. It was widely used for learning hidden representations for input text and demonstrated better performance than other language models, like RNN. The key technique points of this model are as follows:

Each input is composed of two sub-sentences, where an additional < CLS > token is added at the beginning of the input and a < SEP > token is added in between two sub-sentences. An example is < CLS > < I > < AM > < SEP > < LA > < IS > < PAD >, where < PAD > is a padding token.
Each input is associated with a token embedding, a position embedding, and segment embedding indicating which sub-sentence the token belongs to.
The model only has the encoder of the original transformer model, which outputs a hidden representation for each input token.
The learning process has two stages: pre-training and fine-tuning/post-training. The pre-training has two loss functions: (1) mask language modeling (MLM), where a certain percentage of the input tokens are masked (replaced with < MASK >) and the learning objective function is to recover the masked tokens. This is done by adding a classification head to the hidden representation of each of the masked tokens. This objective function forces the model to learn common knowledge about the correlations of input tokens. (2) Next sentence prediction, leverage the hidden representation of the < CLS > token to predict if the two sub-sentences come from the same sentence. This objective function helps the model learn a holistic representation of the entire input using the < CLS > token. The overall objective function is written as: $$ \sum_{i\in\mathcal{M}} - \log(P(\mathbf{x}'_i \mid \Theta))+ CE(\hat{y}(\mathbf{s}_i,\mathbf{s}_j), y(\mathbf{s}_i,\mathbf{s}_j \mid \Theta)) $$ where $\mathcal{M}$ is the index set of the masked tokens, $\hat{y}(\mathbf{s}_i,\mathbf{s}_j) = 0/1$, "CE" means the cross-entropy loss.
The fine-tuning for a classification task is to add a classification head (linear probing) to the < CLS > token representation and conduct a sentence-level prediction.

BERT model is good for classification tasks as it can learn hidden correlations in the input and < CLS > provides a holistic representation of the entire input. There are some variants of the BERT model with changes in the embedding layer or tokenization, such as RoBERTa. Two common model selections are BERT-base (12 layers, 108M) and BERT-large (24 layers, 334M). This line of models cannot perform generative tasks because the model can access the future tokens during the training and the model is not large enough.

GPT: Generative Pre-trained Transformer

GPT models are another line of variants of the original transformer model. The key differences between GPT and BERT are 1) GPT leverages the decoder of the transformer while BERT uses the encoder; 2) GPT is trained to predict unseen tokens while BERT mainly relies on MLM. The GPT training tasks are more difficult but once the model is well-trained, it has better generative capabilities than BERT models. Also, GPT models typically have a much larger number of parameters than BERT models, such that these models have enough capacity to handle their difficult training tasks. GPT models are the foundation of modern LLMs. The most classical papers on GPT models are GPT-1 proposes the GPT model. GPT-2 demonstrates scaling up the model can significantly improve the performance. GPT-3 first proposes testing-phase prompting techniques. The key technique points of this model are as follows:

Given an input sentence say < I > < AM > < a > < UCSB > < Gauthier > < EOS >, where < EOS > is the end token, the tokens will be iteratively fed to the model, i.e., the first input will just be < I >, the second input will be < I > < AM >, etc. This is equivalent to masking the later tokens in the input, which is also how it is realized in the implementation.
The tokenization and embedding methods are similar as the BERT, but there will be no segment embedding. There are some advanced positional embeddings, which we will discuss later.
The model only has the decoder of the original transformer model. The key here is the causal attention in each attention block. Also, given that the model does not have an encoder, it will not receive K and V from the encoder. So the model will again have only one multi-head attention in each block.
The learning process still has two stages: pre-training and fine-tuning. The pre-training only has one loss function: autoregressive next token prediction, which iteratively predicts the next token in the input sequence based on the input $\mathbf{X}$ and previously generated tokens $ \mathbf{x}'_{1}, \ldots, \mathbf{x}'_{i-1}$.
The model after pre-training can already generate coherent text, the fine-tuning in GPT was originally proposed to conduct alignment, which trains the model to align with human intentions, such as not generating harmful content, etc. Two widely used methods are supervised fine-tuning (SFT) and reinforcement learning from human feedback (RLHF). We will introduce these methods later. More recent works also propose different fine-tuning methods to improve the model's capability in specific tasks as well as its reasoning capabilities.
Testing: the simplest testing method is to just give the model an instruction (could be a question) and ask the model to generate following the instruction. More advanced works propose a number of techniques to construct the input such that the model can better understand the input and follow the instructions. We will discuss these prompting techniques later as well.

MoE: Mixture of Experts

MoE is a recent emerging model structure that shows better performance than the vanilla GPT models. The idea stems from the observations that different parts of a large model can be more responsible to certain data or applications. Motivated by this observation, the MoE was proposed to combine multiple small models and test if it can beat a single large model. The exploration starts with mixing the fully-connected layers for CNNs or RNNs to the entire attention block. The most recent success of MoE is the DeepSeek-V3 model, which shows better performance than SOTA large models with an extremely lower training cost.

Early MoEs.

Mixture of feedforward layers. Design a gating network and n experts, where each expert is a feedforward net. $y = \sum_i G(x)_i E_i(x)$, where $G(x)$ is the gating network. It proposes two gating nets, the simple softmax gating and noisy Top-K gating. The noisy Top-K gating also enables sparsity selection given that only the top-K element in $G(x)$ will be non-zero. Another key point for this method is to add a balance regularization to the loss that encourages all the experts to have the same weights to avoid self-reinforcement.
GShard and Switch transformer are early MoE for transformer models. They also did MoE for the final feedforward layers in a transformer blocks. The noisy Top-K gating and balance regularization are applied as well.

Mixtral MoEs. The MoE models from mixtral work well in certain tasks, especially its mixtral-8*7B model. The model uses sparse attention mechanism (grouped-query attention and sliding window attention), which we will introduce later when talking about different attention variations. Its MoE follows a sparse gating without adding noise. $\sum_i G(x)_i E_i(x)$, $G(x) = \text{softmax}(\text{TopK}(xW_g))$. $E(x)$ stands for switch gated linear unit, where they replaced the FFN in a transformer block with SwiGLU. Basically, they still mix the final linear layers in a transformer block.

DeepSeek MoEs. DeepSeek's latest model, DeepSeek-V3, also uses a MoE structure for its Feedforward network. Its key point is to use finer-grained experts and isolates some experts as shared ones. Given the FNN input (attention output) of a certain token $u \in \mathcal{R}^{p}$, it will set up a total number of $N_s+N_r$ FFN. Then, the output is $u + \sum F^{(s)}(u) + \sum g_i F_{i}^{(r)}(u)$. The gating network $g$ is similar as the top-K sparse gating introduced above. The only difference is instead of doing $xW_g$, they do $\text{Sigmoid}(u^T e_i)$, where $e_i$ is the centroid vector. The sigmoid function is newly added in DeepSeek-V3, it may help with numerical stability.

Transformer component optimizations

This part discusses the optimizations over the vanilla transformer which have been used as SOTA methods. We will focus on tokenizer, positional embedding, and attention. Note that MoE can also be taken as an optimization over the FNN in the transformer block.

Tokenizer

SOTA: Byte pair encoding: Start with the smallest tokens and iteratively merge tokens to form new tokens based on their occurrence in the dataset. This approach can reduce the number of OOV (out of vocabulary) tokens, which is critical for model performance.
Other tokenization methods: WordPiece/SentencePiece/unigram language model. Unigram language model: start with all possible sub-word units (single characters and substrings up to a certain length); each sub-word is assigned with a probability based on frequency; use the EM method to update the tokenizer to maximize the likelihood.
For some non-common data types, we need to design a specific tokenization method. The principle is to find a proper vocabulary size. A small size will result in many OOV tokens in the input, while a large size will end up with a high input dimension (making learning harder).
For multi-modal input, we need to combine multiple tokenizers for different data modalities. Here is an example of tokenizing blockchain transactions, which involved hush value (Smart contract Func. calls, wallet addresses), data value (transaction amount), and text (Func. log).

Positional embedding

Absolute/Fixed positional embedding: Use deterministic sine and cosine functions of varying frequencies to encode positions. $$PE(p, 2i) = \text{sin}(\frac{p}{10000^{2i/d}}), PE(p, 2i+1) = \text{cos}(\frac{p}{10000^{2i/d}})$$ where $p$ is the current location and $i$ is the i-th dimension in the input token embedding. Positional embedding has the same dimension as the token embedding. Here, using sine and cosine function is because they are periodic (can capture repeated patterns) and smooth (can compute gradient). Each dimension defines a sine or cosine function; the lower dimension has a higher frequency and capture more fine-grained patterns. The reason for not setting a higher frequency for higher dimensions is to ensure the numerical stability. Fixed positional embedding cannot handle the cases where testing-phase inputs are longer than the training inputs very well.
RoPE: A relative positioning embedding method that can handle flexible sequence lengths, enabling long-term decay (increase relative distances for long distance token pairs). Instead of adding the token embeddings with the positional embeddings, RoPE multiplies token embeddings with attention weights and then rotate it by a certain angle based on its position index. Given an input token embedding $E_i \in \mathcal{d}$, the fixed positional embedding first computes a positional embedding $E_i^{p} \in \mathcal{d}$, adds them together $ E_i+E^{p}_i $, and multiplies it with the Q/K/V weight $ (E_i+E^{p}_i)W_{Q/K/V} $ to get the Q/K/V for the attention. RoPE first defines a rotation matrix $ R( \theta, p) $ and then compute the input to attention as $ R( \theta, p)(E_i)W_{Q/K/V} $.
There is a line of research on extend the positional embedding of a pre-trained model during fine-tuning or inference [1 (RoPE+Adjusted Base Frequency), 2 (RoPE+Positional interpolation), 3]. They perform position interpolation or extrapolation. As the pre-trained models can take much longer sequences, these methods draw less attentions.

Attention

Below, we introduce some techniques to improve the efficiency of computing attentions during training and testing.

Grouped-query attention: It reduces the computational cost during decoding, allowing for larger batch sizes and higher throughputs. It divides the original query into sub-groups (based on row or column), computes the query for each group, and concatenates their results. The computational cost is $O(d_h^2/g)$, $g$ is the number of groups. This method loses the attention weight between groups, resulting in block structures for the attention weight matrix $QK$.
Sliding window attention: Use partial context to compute the attention; the context length is constrained by a window size. This context window will be applied to Q, K, and V at the same time. The computational cost is $O(nw)$, where $w$ is the window size.
Shift-short attention: combine the grouped-query (short) attention with an inter-group (shift) attention (local and global attentions).
Attention in Mixtral: sliding window attention + grouped query attention.
Attention in DeepSeek: Multi-Head Latent Attention. It first maps the input into a low-dimensional space and then map them back to the original dimension when computing K/Q/V representations. This helps reduce the memory cost during the KV cache, as only the low-dimensional representations need to be preserved. Another point is it breaks down the Q and K representations into two parts; one with RoPE, one without. This is different from the original RoPE attention where the rotation is applied to all dimensions. Typical RoPE has $ Q = R( \theta )EW_{Q} $; DeepSeek has $ Q = [R( \theta ) EW_{Q}, R( \theta )E_1W_{Q}] $. The detailed computation can be found here.

Training

Here, we go over the widely pre-training and fine-tuning methods used in GPT models. We will also demonstrate the backward pass of the transformer model. Data is the first important factor in both pre-training and fine-tuning. Early works aim to collect high-quality data from the wild. They train another model to filter out low-quality ones. However, data is like ``fossil energy'', less and less untrained data can be found. More recent work explores how to use LLMs to generate data for continuous training.

Pretraining

As discussed above, the pretraining objective function is next token prediction loss. The model iteratively computes the next token based on the input and the previously generated tokens. The objective function is written as: $ \sum_{i=2}^{T} \text{CE}(f(\mathbf{X}, \mathbf{x}'_{1}, \ldots, \mathbf{x}'_{i-1}; \Theta), \mathbf{x}_i )$, where $\text{CE}$ represents the cross-entropy loss. In other words, generating every token is equal to conducting a $|V|$ class classification. MoE models typically add an expert balance loss to prevent routing collapse.

Fine-tuning

Supervised fine-tuning (SFT): Apply supervised learning for fine-tuning; basically the loss function is the same as the pre-training.
RLHF: First, collect the human annotated data and train a reward model. Then, apply the RL to maximize the total reward collected during the token generation. Mapping this to the typical RL setting: state: current generated tokens $\mathbf{X}, \mathbf{x}'_{1}, \ldots, \mathbf{x}'_{i-1}$; action: generate each token, reward is given by the reward function after each generation. The training loss function of the reward model is: $-\text{log}(\sigma(R(y_w) - R(y_l)))$, where $y_w$ and $y_l$ are preferred and de-preferred responses. The reward model is trained to assign a higher reward to preferred responses. Once we have the reward function, we can apply the PPO loss to conduct the RL training, which guarantees the monotonically increases of the total reward.
DPO: Direct learning from the preference data without training a reward model. The loss function is: $-\mathbb{E}_{(x, y_w, y_l)\sim \mathcal{D}} [\text{log} \sigma(\beta\text{log} \frac{\pi_{\theta}(y_w|x)}{\pi(y_w|x)} - \beta\text{log} \frac{\pi_{\theta}(y_l|x)}{\pi(y_l|x)}) ] $, where $\sigma$ is a normalization function and $\pi$ is the current policy.
More recent research proposes some RL fine-tuning methods to improve the reasoning capability of the model. We can discuss them in a separate post later.

Backward pass

Similar to the analysis in the MLP post, we compute the backward pass of a simple transformer model to demonstrate the process. Consider a decoder-only transformer model with one transformer block. The input embedding is $E \in \mathcal{R}^{S \times d}$. Suppose the model has $h$ attention heads and the hidden dimension of each head is $d_h = d/h$. The forward pass of the model is: $$ Q_i = E W_{Qi}^T, K_i = E W_{Ki}^T, V_i = E W_{Vi}^T, $$ $$ O_i = \text{softmax}(Q_iK_i^T)MV_i, $$ $$ O_a = [O_1, O_2, ..., O_h] W_o, $$ $$ O_l = \text{LayerNorm}(O_a + E), $$ $$ O = \text{ReLU}(O_lW_1^T+b_1)W_2^T+b_2, $$ $$ P = \text{softmax}(O_{-1,:}) $$ The final loss function for predicting the next token is the cross-entropy loss: $-\text{log} P_j$, where $j$ is the index of the true token for the next token.

Now, we compute the backward pass. First, we compute the backward pass of the feedforward layers, which is exact the same as the MLP model.

$\frac{\partial L}{\partial O} =[\mathbf{0},...,\mathbf{0}, P-Y]$, where $ \mathbf{0} \in \mathcal{R}^{|V|} $ is a zero vector, $Y$ is the one-hot vector of the true token
$\frac{\partial L}{\partial W_2} = (\frac{\partial L}{\partial O})^T M_1$, where $ M_1 = \text{ReLU}(O_aW_1^T+b_1) $ and $\frac{\partial L}{\partial b_2} = (\frac{\partial L}{\partial O})^T [1, 1, ..., 1]$
$\frac{\partial L}{\partial W_1} = (\frac{\partial L}{\partial \hat{M}_1})^T O_l$, where $ \frac{\partial L}{\partial \hat{M}_1} = \frac{\partial L}{\partial O} W_2\odot 1\{\hat{M}_1 \geq 0\} $ and $\frac{\partial L}{\partial b_1} = (\frac{\partial L}{\partial \hat{M}_1})^T [1, 1, ..., 1]$
$\frac{\partial L}{\partial O_l} = (\frac{\partial L}{\partial \hat{M}_1}) W_1$

Then, we compute the gradient for the attention layer, note that the attention layer does not introduce any parameters, we mainly compute the gradient for the layer normalization layer and linear transformation for each attention head. Note that we can ignore the residual connection when computing these gradient, but it will involve the gradient for the word and positional embedding.

Here, we first need to handle the layer normalization, which normalize the input based on the last dimension (feature). $$ \mu_i = \frac{1}{d}\sum_j (O_a)_{ij}, \sigma_i^2 = \frac{1}{d} \sum_j((O_a)_{ij} - \mu_i)^2, $$ $$ \hat{O_a}_{ij} = \frac{(O_a)_{ij} - \mu_i}{\sqrt{\sigma_i^2+\epsilon}}, (O_l)_{ij} = \gamma_j \hat{(O_a)}_{ij} +\beta_j.$$ The gradients are as follows.

$\frac{\partial L}{\partial \gamma_j} = \sum_{i=1}^{S} (\frac{\partial L}{\partial O_l})_{ij} (\hat{O_a})_{ij}, \quad \frac{\partial L}{\partial \beta_j} = \sum_{i=1}^{S} (\frac{\partial L}{\partial O_l})_{ij} $
$ \frac{\partial L}{\partial (O_a)_{ij}} = \frac{1}{\sqrt{\sigma_i^2 + \epsilon}} \left( d(\hat{O_a})_{ij} - \frac{1}{d} \sum_{k=1}^{d} d(\hat{O_a})_{ik} - (\hat{O_a})_{ij} \frac{1}{d} \sum_{k=1}^{d} d(\hat{O_a})_{ik} (\hat{O_a})_{ik} \right) $

With the gradients for the layer normalization $\frac{\partial L}{\partial (O_a)}$, we can compute the gradients for the linear transformation in each attention head.

$\frac{\partial L}{\partial (O_i)} = \frac{\partial L}{\partial (O_a)} W_o[:, i*d/h:(i+1)*d/h] $

For $V_i$ of the $i$-th attention head,

$\frac{\partial L}{\partial V_i} = S_i^T \frac{\partial L}{\partial (O_i)} $, where $S_i = \text{softmax}(Q_iK_i^T)$
$\frac{\partial L}{\partial W_{Vi}} = (\frac{\partial L}{\partial (V_i)})^T E $, it is $ \in \mathcal{R}^{d/h \times d} $

For $Q_i$ of the $i$-th attention head,

$\frac{\partial L}{\partial S_i} = \frac{\partial L}{\partial (O_i)} V_i^T $; $A = KQ^T$, $S = \text{Softmax}(A)$
$\frac{\partial S_{il}}{\partial A_{ij}} = \begin{cases} S_{il}(1 - S_{il}) & \text{if } l = j, \\ - S_{il} S_{ij} & \text{if } l \neq j. \end{cases} $
$\frac{\partial L}{\partial A_{ij}} = S_{ij}[\frac{\partial L}{\partial S_{ij}} - \sum_l S_{il}\frac{\partial L}{\partial S_{il}}] $; here $\frac{\partial L}{\partial A_{ij}} = \frac{\partial L}{\partial S_{i,:}} \frac{\partial S_{i,:}}{\partial A_{ij}} = \frac{\partial L}{\partial S_{ij}} (S_{ij}(1-S_{ij})) + \sum_{l\neq j} \frac{\partial L}{\partial S_{il}} (-S_{ij}S_{il}) = S_{ij}(\frac{\partial L}{\partial S_{ij}} (1-S_{ij}) - \sum_{l\neq j} \frac{\partial L}{\partial S_{il}} S_{il}) = S_{ij}(\frac{\partial L}{\partial S_{ij}} - \sum_{l=1}^{d/h} \frac{\partial L}{\partial S_{il}} S_{il}) $
$\frac{\partial L}{\partial Q_i} = \frac{\partial L}{\partial A_i} K_i $
$\frac{\partial L}{\partial W_{Qi}} = (\frac{\partial L}{\partial Q_i})^T E $

For $K_i$ of the $i$-th attention head,

$\frac{\partial L}{\partial K_i} = (\frac{\partial L}{\partial A_i})^T Q_i $
$\frac{\partial L}{\partial W_{Ki}} = (\frac{\partial L}{\partial K_i})^T E $

Efficient training

Parallelism training

There are four major parallelism mechanisms: data parallel, model parallel, pipeline parallel, and tensor parallel. Data parallel splits the data and assigns different chunks to different devices; it copies the full model to all devices and synchronizes the gradients calculated on each data chunk. This approach cannot deal with large models that cannot fit into a single device. Model parallel splits the model and assigns different parts to different devices. The vanilla model parallel cannot fully utilize the computation power of each device because the earlier/later layers need to wait when computing the later/earlier layers. We can tell there is a trade-off between memory and speed from the data and model parallelism. Pipeline parallel splits the model and data into different parts (the training into different stages) and assigns different stages to different devices. There are different pipeline parallel strategies, all try to strike a balance of the memory and speed trade-off. My earlier post on MLP has a more detailed discussion on these three parallel mechanisms.

Tensor parallel works on a lower level that divides the model weights into different parts and assigns different parts to different devices. The MLP discusses the forward/backward pass under tensor parallel for the MLP model, which is also an essential component for the transformers. Here, we discuss the tensor parallel for the attention module in the transformer model.

Case 1: Each attention head is assigned to each device, the forward/backward pass is discussed above
Case 2: We divide the $K$ and $Q$ in one attention head further into $N$ shards where each shard has the dimension of $S\times d_{h}/N$. We keep $V$ as its original dimension. In this case, the forward pass will be $A = \sum_n A_n = \sum_n K_nQ_n^T$. The backward pass is similar as the process introduced above, with the key step $\frac{\partial L}{\partial A_n} = \frac{\partial L}{\partial A}\frac{\partial A}{\partial A_n} = \frac{\partial L}{\partial A}$
Case 3: We divide the $K$ and $Q$ in one attention head further into $N$ shards where each shard has the dimension of $S/N \times d_{h}$. We keep $V$ as its original dimension. The forward pass will be $A = [A_1, ...., A_n]$. The backward pass is also similar, with the key step $\frac{\partial L}{\partial A_n} = (\frac{\partial L}{\partial A})_{(n)*S/N:(n+1)*S/N, (n)*S/N:(n+1)*S/N} $; basically, we need to divide $\frac{\partial L}{\partial A}$ based on how they are composed
Case 4: We divide the $V$ based on column and keep $S$. The forward pass will be the concatenation of each shard of $O_n$, the backward pass $\frac{\partial L}{\partial V_n} = S^T \frac{\partial L}{\partial O_n}$ $\frac{\partial L}{\partial O_n}$ is a certain shard in $ \frac{\partial L}{\partial O} $

LoRA

LoRA is an efficient fine-tuning method that learns a task-specific weight $\Delta\Phi(\Theta)$. $\Delta\Phi(\Theta)$ adds a weight $\Delta W = BA$ to each weight $W \in \mathcal{R}^{d\times k}$, where $B \in \mathcal{R}^{d\times r}$ and $A \ \in \mathcal{R}^{r\times k}$ have a low rank $r << \text{min}(d, k)$. When fine-tuning, the model only needs to learn the low-rank matrices $B$ and $A$ under a given objective functions. Follow-up works proposed Q-LoRA that combines quantization with LoRA; and LongLoRA that combines LoRA with RoPE and position interpolation.

Inference

In LLMs, inference refers to the process of generating responses based on the input (In probabilistic models, inference means compute posterior distributions of hidden variances). Recent research has proposed a number of prompting techniques. Below, we briefly discuss some most representative methods. For a complete list, please refer to the Prompting Guide.

Instruction generation: The simplest inference method; give the model an instruction and ask the model to generate following the instruction
Few-shot examples: A method proposed in GPT-3; feed the LLM a few examples of the given instruction, which is shown to help the model better understand the instruction; Especially useful when we want to let the model generate outputs following a specific format
Chain of Thoughts: Ask the model to generate its reasoning process using instructions like "Please think step by step"; Can add few-shot examples of reasoning chains; recent reasoning models finetune the model with CoT trajectories to improve the model's reasoning capability; Self-consistency (majority vote of reasoning chains) further improves the CoT stability
Tree of Thoughts:
- Decomposes the reasoning into multiple thought steps, generates multiple thoughts per step and create a tree structure
- Use BFS or DFS and majority vote at each step to get the final output
- BFS uses a queue (FIFO) to manage the nodes to proces and DFS uses a stack (FILO) to manage the nodes to process
- The idea has also been used in training reasoning models (Monte Carlo Tree Search-based Process reward model training)
ReAct and Reflexion: Combine reasoning with taking actions in the physical world, early stage LLM-based agents
Retrieve-augmented generation (RAG): Provide LLMs with additional context by retrieving related information form an external database; LLM decides what to retrieve and retrieve based on embedding space distances or editing distances

Besides the representative methods introduced above, which have human-discovered heuristics for prompt designs, there is another line of works that aim to automatically generate prompts. Such methods are also used for LLM red-teaming, which generates adversarial prompts leading to different attack goals. See this post for more details.

White-box methods: Assume accessing to the target LLM, given an input instruction, they leverage the gradient information to generate prompts; prefix learning is a representative method
Black-box methods: These methods mostly leverage another helper LLM to generate the prompt for a target LLM; this problem is equivalent to a black-box search problem in the prompt space, where genetic methods, evolutionary strategies, and RL can help

Fuzzing-based methods: Design some mutators using the helper LLM and follow the program fuzzing framework to mutate an initial prompt based on mutators (e.g., GPTFUZZER); can further train an RL agent for more efficient mutator scheduling (i.e., RLBreaker)
Fine-tuning-based methods: Directly fine-tune the helper LLM to generate prompts for the target LLM given a goal; RL is typically used where we define a reward function based on the goal and treat the target LLM as part of the environment; train the helper LLM accordingly (e.g., RLPrompt and TEMPERA)

A more comprehensive summary can be found in: Prompt Search in LLMs

Efficient Inference

This part introduces three widely used techniques for efficient inference in LLMs: KV cache, quantization, and efficient decoding.

KV cache is used during the autoregressive generation, which stores the value and key for the previously generated tokens. Recall that the attention weight is computed as $qK^T$, where $q \in \mathcal{R}^{1\times d}$ is the query for the current token and $K \in \mathcal{R}^{t \times d}$ is the key for the current and previous tokens. The KV cache stores the key and value for the previous tokens, such that they only need to be calculated once. $$ q_t = E_tW_{Q}, k_t = E_tW_{K}, V_t = E_tW_{V}, K = [k_{1:t-1},k_t], V = [v_{1:t-1},v_t], o = \text{softmax}(q_tK^T)V,$$ where $o \in \mathcal{R}^{1 \times d}$. When implementing the KV cache, we can use a dictionary to store the $K$ and $V$ and update it in the forward function of the model class.

Quantization is the process of reducing the precision of the model weights and activations, which can reduce the memory and computation cost. The default precision is 32-bit floating point, which can be reduced to 16-bit floating point or even 8-bit integer. Quantization: $ \text{round}(\frac{x}{s})+z$, $s=\frac{x_{max}-x_{min}}{q_{max}-q_{min}}$, $z=\frac{q_{min}-x_{min}}{s}$, Dequantization: $x = s(q-z)$. The quantization process can be done during the training (Quantization-aware training) or after the training (post-training quantization).

The most common efficient decoding method is speculative decoding. When conducting inference for a target model, it uses a small model to generate the candidates for next token and uses the target model to verify the candidates. The process is called drafting and verification. Drafting with a small model can be faster than directly generating from the large model and the verification can be done in parallel. This survey has a comprehensive summary of different drafting and verification techniques. Setting a length limit for the generated tokens is also a common practice to reduce the computation cost. Recent work proposes diffusion LLMs, which leverage the diffusion process for autoregressive generation and has been shown very efficient.

@article{guo2024mlbasis,
  title   = {Transformers: basis and advanced topics},
  author  = {Guo, Wenbo},
  journal = {henrygwb.github.io},
  year    = {2024},
  url     = {https://henrygwb.github.io/posts/ml_basis.htm}
}