🏠 Home

Reinforcement Learning: Foundation and Modern Algorithms

Mar 2026


This post summarizes the basics of reinforcement learning (RL) from David Silver's RL course. It mainly covers the RL modeling, Q-learning, policy gradient, bandits (exploration and exploitation), and advanced RL methods (model-based RL, and inverse RL). For the application of RL in LLM training, please refer to this post.

Table of Contents

RL Basis

RL Agents and Environment
Agent
Action \(a_t\)
Reward \(r_{t+1}\)
State \(s_{t+1}\)
Environment
Simplified demonstration of RL agents and the environment.

Reinforcement learning addresses sequential decision-making problems arising from interactions between an agent and its environment. Here, an agent refers to an autonomous system capable of understanding the state of its environment and taking different actions. The environment is the setting where the agent performs its tasks. For a self-driving car, the agent is the car itself, while the environment includes the road, traffic, and weather conditions. One task could be to navigate from location A to location B safely and efficiently. As shown in the figure, an RL system has the following quantitative components:

At every time step \(t\), the agent observes the current state \(s_t\) from the environment, selects an action \(a_t\) based on its policy \(\pi\), and receives a reward \(r_{t+1}\) along with the next state \(s_{t+1}\) from the environment. The goal of the agent is to learn a policy that maximizes the cumulative reward over time. The RL process is formally modeled as a Markov Decision Process (MDP).

MDP and Bellman Equations

An MDP is defined as \( \langle \mathcal{S}, \mathcal{A}, P, R, \gamma \rangle\).

Under this model, the agent's policy is defined as \(\pi(a\mid s) = P(a_t = a \mid s_t = s)\): mapping from states to actions. The objective of the agent is to find an optimal policy \(\pi^*\) that maximizes the expected cumulative discounted reward. In the following we introduce two ways of measuring the cumulative reward and their relationships (Bellman equations).

Value functions

State value function under policy \(\pi\), measuring the expected return starting from state \(s\) when the agent plays policy \(\pi\): $$V^\pi(s) = \mathbb{E}_\pi \left[ \sum_{k=0}^\infty \gamma^k r_{t+k+1} \mid s_t = s \right].$$
Action value function under policy \(\pi\), measuring the expected return starting from state \(s\), taking action \(a\), and thereafter following policy \(\pi\): $$Q^\pi(s,a) = \mathbb{E}_\pi \left[ \sum_{k=0}^\infty \gamma^k r_{t+k+1} \mid s_t = s, a_t = a \right].$$

Bellman Equations express the recursive relationship between value functions, connecting the current state value to future states.

The value of a state equals the expected immediate reward plus the discounted value of the next state. \[ \begin{aligned} V^\pi(s_t) &= \mathbb{E}_{a_t \sim \pi(\cdot\mid s_t),\, s_{t+1} \sim P(\cdot\mid s_t,a_t)} \big[ r_{t+1} + \gamma V^\pi(s_{t+1}) \big] \\[4pt] &= \mathbb{E}_{a_t \sim \pi(\cdot\mid s_t)} \big[ Q^\pi(s_t, a_t) \big] \\[4pt] &= \sum_{a} \pi(a\mid s_t) \Big[ R(s_t,a) + \gamma \sum_{s'} P(s' \mid s_t,a)\, V^\pi(s') \Big]. \end{aligned} \]
The value of taking action \(a\) in state \(s\) equals the immediate reward plus the expected discounted value of following the policy thereafter. $$ \begin{aligned} Q^\pi(s,a) &= r_{t+1} + \gamma \mathbb{E}_\pi [ V^\pi(s_{t+1})] \\ &= R(s,a) + \gamma \sum_{s'} P(s' \mid s,a) V^\pi(s') \\ &= R(s,a) + \gamma \sum_{s'} P(s' \mid s,a) \sum_{a'} \pi(a' \mid s') Q^\pi(s', a') \end{aligned} $$
V-Q Relationship: The state value is the expected action value under the policy. $$V^\pi(s) = \sum_a \pi(a\mid s) Q^\pi(s,a)$$

There are also Bellman Optimality Equations for the optimal value functions \(V^{*}(s)\) and \(Q^{*}(s,a)\), which replace the expectation over the policy with a maximization over actions, i.e.,

$$Q^*(s,a) = R(s,a) + \gamma \sum_{s'} P(s'|s,a) V^*(s')$$
$$V^*(s) = \max_a Q^*(s,a), \quad \pi^*(a|s) \in \arg\max_a Q^*(s,a)$$
Maximum Entropy RL and Soft Bellman Equations

Maximum entropy RL modifies the standard RL objective by adding an entropy regularization term. This encourages exploration by preventing the policy from becoming too deterministic or collapsing. The agent aims to maximize not only the expected return but also the entropy of its policy, leading to more robust and exploratory behavior. Under maximum entropy RL, the value functions and Bellman equations are adjusted accordingly.

Soft Value Functions

Soft state value function under policy \(\pi\), incorporating both expected return and policy entropy: $$V_{\text{soft}}^\pi(s) = \mathbb{E}_\pi \left[ \sum_{k=0}^\infty \gamma^k (r_{t+k+1} + \alpha H(\pi(\cdot|s_{t+k}))) \mid s_t = s \right]$$ where \(\alpha > 0\) is the temperature parameter and \(H(\pi(\cdot|s)) = -\sum_a \pi(a|s) \log \pi(a|s)\) is the entropy.
Soft action value function under policy \(\pi\): \[ \begin{aligned} Q_{\text{soft}}^\pi(s_t,a_t) &= \mathbb{E}_\pi \Bigg[ r_{t+1}+ \sum_{k=1}^{\infty} \gamma^k \Big( r_{t+k+1} - \alpha \log \pi(a_{t+k}\mid s_{t+k}) \Big) \,\Big|\, s_t = s,\; a_t = a \Bigg] \\[6pt] &= R(s, a) + \mathbb{E}_\pi \Bigg[ \sum_{k=1}^\infty \gamma^k \Big( r_{t+k+1} + \alpha\,H(\pi(\cdot\mid s_{t+k})) \Big) \Bigg] \\ &= R(s, a) + \gamma \mathbb{E}_{\pi} V_{\text{soft}}^\pi(s_{t+1}). \end{aligned} \]

Soft Bellman Equations extend the standard Bellman equations by incorporating the entropy regularization term.

The soft state value equals the expected immediate reward plus the entropy bonus and the discounted soft value of the next state: \[ \begin{aligned} V_{\text{soft}}^\pi(s) &= \mathbb{E}_{a\sim\pi(\cdot|s),\, s'\sim P(\cdot|s,a)} \Big[ r_{t+1} + \alpha H(\pi(\cdot|s)) + \gamma V_{\text{soft}}^\pi(s') \Big] \\ &= \sum_a \pi(a\mid s) \Big[ R(s,a) - \alpha \log \pi(a\mid s) + \gamma \sum_{s'} P(s'\mid s,a)\, V_{\text{soft}}^\pi(s') \Big]. \end{aligned} \]
The soft action value incorporates the entropy of future states but not the current action selection: \[ \begin{aligned} Q_{\text{soft}}^\pi(s,a) &= r_{t+1} + \gamma\,\mathbb{E}_{s'\sim P(\cdot|s,a)} \big[ V_{\text{soft}}^\pi(s') \big] \\ &= R(s,a) + \gamma \sum_{s'} P(s'\mid s,a)\, V_{\text{soft}}^\pi(s') \\ &= R(s,a) + \gamma \sum_{s'} P(s'\mid s,a) \sum_{a'} \pi(a'\mid s') \Big[ Q_{\text{soft}}^\pi(s',a') - \alpha \log \pi(a'\mid s') \Big]. \end{aligned} \]
Soft V-Q Relationship: \[ \begin{aligned} V_{\text{soft}}^\pi(s) &= \mathbb{E}_{a\sim\pi(\cdot|s)} \big[ Q_{\text{soft}}^{\pi}(s,a) - \alpha \log \pi(a|s) \big]\\ &=\sum_a \pi(a|s) \big[ Q_{\text{soft}}^{\pi}(s,a) - \alpha \log \pi(a|s) \big]. \end{aligned} \]
Optimal soft V-Q Relationship: Under the optimal policy, the soft state value and action value functions are related as: $$V_{\text{soft}}^{\pi^*}(s) = \alpha \log \sum_a \exp\left(\frac{Q_{\text{soft}}^{\pi^*}(s,a)}{\alpha}\right)$$
Optimal Soft Policy: the policy that maximizes the soft value function has the form (we will revisit this in the policy gradient section): $$\pi^*(a|s) = \exp\left(\frac{Q_{\text{soft}}^*(s,a) - V_{\text{soft}}^*(s)}{\alpha}\right)$$

The temperature parameter \(\alpha\) controls the trade-off between exploitation and exploration: when \(\alpha \to 0\), the soft policy approaches the deterministic optimal policy; when \(\alpha\) is large, the policy becomes more uniform and exploratory.


Dynamic Programming

Dynamic Programming (DP) methods solve Markov Decision Processes (MDPs) under the assumption that the environment’s dynamics are fully known, i.e., \( \langle \mathcal{S}, \mathcal{A}, P, R, \gamma \rangle \). DP forms a foundational pillar of Reinforcement Learning, providing the theoretical basis for many later algorithms. DP focuses on two core problems:
  â€˘ Prediction: computing the value function of a given policy in order to quantify its expected long-term return.
  â€˘ Control: searching for an optimal policy, along with its corresponding optimal value function, using only the known model of the environment.

Dynamic Programming methods can be categorized as synchronous or asynchronous based on how value function updates are performed across the state space. In synchronous DP, all states are updated together in each iteration using the value estimates from the previous iteration, resulting in a sequence of full sweeps over the state space. In contrast, asynchronous DP updates states individually or in small subsets, immediately incorporating newly computed values without waiting for a complete sweep. Both approaches rely on the same Bellman equations and converge under standard assumptions, but they differ in how updates are scheduled and in their practical computational efficiency.

Synchronous Dynamic Programming

The following methods are the fundamental building blocks of synchronous dynamic programming, ranging from evaluating a fixed policy to iteratively improving it until the optimal policy is reached.

$$V_{k+1}(s) = \max_a \sum_{s'} P(s' \mid s,a) \big[R(s,a) + \gamma V_k(s')\big].$$
Comparison of DP Methods
Method Purpose Update Rule Policy Convergence
Policy Evaluation Compute \(V^\pi\) for given policy \(\pi\) \(V_{k+1}(s) = \sum_a \pi(a|s) \sum_{s'} P(s'|s,a)[R(s,a) + \gamma V_k(s')]\) Fixed (given) Always converges to \(V^\pi\)
Policy Iteration Find optimal policy \(\pi^*\) Alternate: Evaluate \(\rightarrow\) Improve \(\rightarrow\) Evaluate... Updated after each evaluation Converges to \(\pi^*\) in finite steps
Value Iteration Find optimal value \(V^*\) and policy \(\pi^*\) \(V_{k+1}(s) = \max_a \sum_{s'} P(s'|s,a)[R(s,a) + \gamma V_k(s')]\) Implicit (greedy w.r.t. \(V^*\)) Converges to \(V^*\) geometrically
Asynchronous Dynamic Programming

Synchronous DP methods update all states in each iteration, which can be computationally expensive for large state spaces. Asynchronous DP methods selectively update states, improving computational efficiency while maintaining convergence guarantees. The key insight is that not all states need to be updated with equal frequency.

In-place Dynamic Programming

In-place value iteration updates: $$V(s) \leftarrow \max_a \sum_{s'} P(s'|s,a)[R(s,a) + \gamma V_{\text{old}}(s')]$$ where \(V_{\text{old}}(s)\) uses the most recently computed values.

Prioritized Sweeping

Real-time Dynamic Programming

Asynchronous dynamic programming methods can be viewed as an important conceptual precursor to modern reinforcement learning. By updating value estimates incrementally, selectively, and often based on trajectories rather than full state sweeps, they introduce many of the computational ideas later adopted by model-free RL algorithms that learn from sampled experience.


Q Learning

Q-learning refers to methods that learn the action-value function \(Q(s,a)\) using a parametric function, most commonly a neural network, and then use an \(\epsilon\)-greedy policy for action selection. Below, we first introduce different ways to estimate value functions and then discuss \(\epsilon\)-greedy for control. We work in the model-free setting, meaning the agent does not assume or learn an explicit model of the environment’s transition dynamics or reward function.

Key terms: Skim these definitions before diving into value estimation so the vocabulary feels familiar.

Value Estimation

We start with value estimation given that the agent has no access to the environment’s dynamics or reward function beyond sampled experience, and therefore cannot reason directly about the long-term consequences of actions. Value estimation aims to predict long-term reward from a given state (or state–action pair), and different methods mainly differ in how they approximate this expectation.

There are three canonical value estimation strategies, which can be viewed along a bias–variance spectrum:

  1. Monte Carlo (MC): Estimates value by waiting until an episode terminates and using the realized return. MC methods are unbiased, but can be high variance since they depend on complete future trajectories.
  2. Temporal-Difference (TD): Updates value estimates after a single step by bootstrapping from the current estimate of the next state. TD methods typically have lower variance, but can be biased due to relying on the next-state value approximation.
  3. TD(λ) / n-step methods: Interpolate between Monte Carlo and TD by using multi-step returns. These methods smoothly trade off bias and variance, blending full-trajectory returns with bootstrapped estimates.
Why TD is biased but lower‑variance

MC (unbiased, higher variance): In episodic, on‑policy settings, the return \(G_t = \sum_{k=t}^T \gamma^{k-t} r_k\) satisfies \(\mathbb{E}[G_t\mid s_t] = V^\pi(s_t)\), so MC is an unbiased estimator of the true value, but it sums many random rewards, leading to higher variance.

TD(0) (biased, lower variance): The one‑step target \(Y_{TD} = r_t + \gamma\,\hat V(s_{t+1})\) uses a learned \(\hat V\) for bootstrapping. Its bias relative to \(V^\pi(s_t)\) is $$ \underbrace{\mathbb{E}[Y_{TD}\mid s_t] - V^\pi(s_t)}_{\text{bias}}\;=\;\gamma\, \mathbb{E}[\hat V(s_{t+1}) - V^\pi(s_{t+1})]. $$ If \(\hat V\) is biased, the target is biased. Variance is typically lower because it replaces the random future return with the smoother estimate \(\hat V(s_{t+1})\).

Quick Takeaways:

In deep RL, we approximate the value function with a neural network and train it to match chosen targets (MC, n-step, TD(0), or TD(\(\lambda\))). The target you choose sets the bias–variance trade‑off. We will reuse these estimators later inside actor–critic and policy‑gradient methods.

TD updates in practice: estimating \(\hat V\)

Tabular TD(0): Use this when the state space is small enough to keep one table entry per state. Maintain a table for \(V(s)\). For each transition \((s_t, a_t, r_t, s_{t+1})\), update $$ V(s_t) \leftarrow V(s_t) + \alpha\,\big[\underbrace{r_t + \gamma\,V(s_{t+1})}_{\text{TD target}} - V(s_t)\big]. $$ Here \(\hat V(s_{t+1})\) is just the current table entry of \(V(s)\) for \(s=s_{t+1}\).

  1. Initialize: set \(V(s)\) to zero (or a heuristic guess) for every state.
  2. Observe: collect a transition \((s_t, a_t, r_t, s_{t+1})\).
  3. Target: compute \(y_t = r_t + \gamma V(s_{t+1})\).
  4. Update: \(V(s_t) \leftarrow V(s_t) + \alpha (y_t - V(s_t))\).
  5. Repeat: keep looping over data until the table stops changing much.

Function approximation (state value): When the state space is too large for tabular TD(0), swap the table with a neural network \(V_\theta(s)\) that outputs the value estimate. The TD idea stays the same, but we now solve a regression problem where the “labels” are bootstrapped targets.

$$ y_t = r_t + \gamma\,V_{\bar\theta}(s_{t+1}),\qquad \mathcal{L}(\theta) = \tfrac{1}{2}\big(V_\theta(s_t) - y_t\big)^2. $$

  1. Initialize nets: pick weights \(\theta\) and copy them into the slow target network \(\bar\theta\).
  2. Sample a batch: draw recent transitions from the online stream or a replay buffer.
  3. Compute targets: use the frozen target network, \(y_t = r_t + \gamma V_{\bar\theta}(s_{t+1})\).
  4. Gradient step: minimize \(\tfrac{1}{2}(V_\theta(s_t) - y_t)^2\), treating \(y_t\) as constant (a semi-gradient update).
  5. Refresh target: every \(K\) steps copy \(\theta\) into \(\bar\theta\) (hard update every \(K\) steps) or slowly blend them (Polyak-average) to keep \(y_t\) stable.

n-step and TD(\(\lambda\)): Use \(y_t^{(n)} = \sum_{k=0}^{n-1} \gamma^k r_{t+k} + \gamma^n V_{\bar\theta}(s_{t+n})\) to reduce bias from bootstrapping while keeping variance lower than pure MC. TD(\(\lambda\)) blends all \(n\)-step targets; online implementations use eligibility traces, and advantage estimators often use GAE.

  1. Roll out: gather a short trajectory \(s_t, r_t, \ldots, s_{t+n}\).
  2. Build the \(n\)-step target: sum the observed rewards and add \(\gamma^n V_{\bar\theta}(s_{t+n})\).
  3. Plug in: replace the one-step target with \(y_t^{(n)}\) inside the tabular or neural update.
  4. TD(\(\lambda\)): blend every \(n\)-step target using weights \(\lambda^{n-1}\). In practice keep an eligibility trace per state, update it as \(z(s) \leftarrow \gamma\lambda z(s) + \mathbf{1}[s=s_t]\), compute the TD error \(\delta_t = y_t - V(s_t)\), then update all states by \(V(s) \leftarrow V(s) + \alpha\,\delta_t\,z(s)\).
  5. GAE note: Generalized Advantage Estimation mimics the same idea for actor–critic. After computing TD residuals \(\delta_t = r_t + \gamma V(s_{t+1}) - V(s_t)\), maintain a running accumulator \(A_t = \delta_t + \gamma\lambda A_{t+1}\) (working backward through the trajectory) and feed these smoothed advantages into the policy update.
Non-parametric control

So far we focused on value estimation (learning how good states or actions are). Control methods decide which action to actually take given those estimates. Here “non‑parametric” just means we use simple rules and tables, including \(\varepsilon\)-greedy, SARSA, or tabular Q‑learning, rather than using a learned neural policy network. These methods are the basic building blocks for exploration and policy improvement before we move on to deep, parametric control.

ε-Greedy

On-Policy vs Off-Policy Learning

Reinforcement learning algorithms differ in whether they learn only from the same policy they are currently using to act (on‑policy), or whether they can also learn from data generated by a different policy or a fixed dataset (off‑policy).

On-Policy Learning: An RL algorithm is on-policy if it learns the value of the policy that is currently being used to generate data. In other words, the agent runs its current (exploratory) policy to collect trajectories, then directly uses those trajectories to evaluate and improve that same policy.

Off-Policy Learning: An algorithm is off-policy if it learns the value of a target policy (often greedy) while using a different behavior policy to generate experience.


Policy Gradient Methods

Overview

Policy gradient methods directly optimize a parameterized policy \(\pi_\theta(a|s)\) rather than conducting non-parametric control. This approach offers several advantages:

Learning Objective

The goal is to find parameters \(\theta\) that maximize the expected cumulative (long-term) reward:

$$J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta} \left[ \sum_{t=0}^{T} \gamma^t r_t \right] = \mathbb{E}_{\tau \sim \pi_\theta} [G_0] = \mathbb{E}_{s_0} [ V^{\pi_\theta}(s_0)]$$

where \(\tau = (s_0, a_0, r_0, s_1, a_1, r_1, \ldots)\) is a trajectory.

Policy Gradient

$$\nabla_\theta J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta} \left[ \sum_{t=0}^{T} \nabla_\theta \log \pi_\theta(a_t|s_t) \cdot A_t^{\pi_\theta} \right]$$

where \(A_t^{\pi_\theta}\) can take several forms (all yielding unbiased gradient estimates with different variance properties):

Derivation of Policy Gradient

$$\nabla_\theta J(\theta) = \nabla_\theta \mathbb{E}_{\tau \sim \pi_\theta}[G_0] = \nabla_\theta \int p_\theta(\tau) G_0(\tau) d\tau = \int \nabla_\theta p_\theta(\tau) G_0(\tau) d\tau$$ $$= \int p_\theta(\tau) \frac{\nabla_\theta p_\theta(\tau)}{p_\theta(\tau)} G_0(\tau) d\tau = \int p_\theta(\tau) \nabla_\theta \log p_\theta(\tau) G_0(\tau) d\tau $$ $$= \mathbb{E}_{\tau \sim \pi_\theta}[\nabla_\theta \log p_\theta(\tau) \cdot G_0(\tau)]$$ $$\text{where } \log p_\theta(\tau) = \log p(s_0) + \sum_{t=0}^{T} \log \pi_\theta(a_t|s_t) + \sum_{t=0}^{T} \log P(s_{t+1}|s_t, a_t)$$ $$\Rightarrow \nabla_\theta \log p_\theta(\tau) = \sum_{t=0}^{T} \nabla_\theta \log \pi_\theta(a_t|s_t) \quad \text{(independent of environment model } P(s_{t+1}|s_t, a_t) \text{)}$$ $$\therefore \nabla_\theta J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta}\left[ \sum_{t=0}^{T} \nabla_\theta \log \pi_\theta(a_t|s_t) \cdot G_t \right] \quad \text{(by causality: } G_t = \sum_{k=t}^{T} \gamma^{k-t} r_k \text{)}$$
Actor-Critic Methods

Actor-Critic methods combine policy gradient (actor) with learned value functions (critic). The critic reduces variance by providing a learned baseline or advantage estimate, while the actor updates the policy using the critic's feedback.

Basic Actor-Critic Architecture

$$\text{Advantage estimate (TD error): } A_t^{\pi_\theta} = \delta_t = r_t + \gamma V_w(s_{t+1}) - V_w(s_t)$$ $$\text{Critic update: } w \leftarrow w + \alpha_w A_t^{\pi_\theta} \nabla_w V_w(s_t)$$ $$\text{Actor update: } \theta \leftarrow \theta + \alpha_\theta A_t^{\pi_\theta} \nabla_\theta \log \pi_\theta(a_t|s_t)$$

A2C (Advantage Actor-Critic)

A2C uses the advantage function estimated via GAE (Generalized Advantage Estimation) for lower-variance updates (see above for the details of GAE).

A3C (Asynchronous Advantage Actor-Critic)

A3C runs multiple actor-learners in parallel, each interacting with its own copy of the environment. Workers asynchronously update shared global parameters, providing decorrelated updates and natural exploration through policy diversity. This improves the training speed and yields larger rollouts.

Trust Region Policy Optimization (TRPO)

TRPO enables monotonic improvement of the policy throughout training to improve learning stability and reduce variance.

TRPO Objective

TRPO maximizes this surrogate objective subject to a KL divergence constraint. The TRPO paper has two major parts: 1) deriving the objective function that ensures monotonicity; 2) applying a series of approximations to make the optimization tractable (mainly replacing the max KL constraint with an average KL constraint).

$$\max_{\theta_{new}} \mathbb{E}_{s,a \sim \pi_\theta} \left[ \frac{\pi_{\theta_{new}}(a|s)}{\pi_\theta(a|s)} A^{\pi_\theta}(s, a) \right]$$ $$\text{s.t. } \mathbb{E}_s [D_{KL}(\pi_\theta(\cdot|s) \| \pi_{\theta_{new}}(\cdot|s))] \leq \delta$$

The constraint ensures the new policy stays close to the old policy, preventing catastrophic updates. TRPO uses conjugate gradient and line search to approximately solve this constrained optimization problem.

Derivation of Monotonic Improvement

The expected return \(\eta\) of a new policy \(\tilde{\pi}\) can be expressed in terms of the advantage over the old policy \(\pi\):

$$\eta(\tilde{\pi}) = \eta(\pi) + \mathbb{E}_{\tau \sim \tilde{\pi}} \left[ \sum_{t=0}^{\infty} \gamma^t A_\pi(s_t, a_t) \right]$$

Define \(\rho_{\tilde{\pi}}(s) = \sum_{t=0}^{\infty} \gamma^t P(s_t = s \mid \tilde{\pi})\) as the (unnormalized) discounted state visitation frequency under \(\tilde{\pi}\). Then the per-time-step expectation can be rewritten as a sum over states:

$$\eta(\tilde{\pi}) = \eta(\pi) + \sum_s \rho_{\tilde{\pi}}(s) \sum_a \tilde{\pi}(a|s) A_\pi(s, a)$$

The above requires sampling from \(\tilde{\pi}\), which is unavailable. Define a local approximation \(L_\pi\) using states from \(\pi\):

$$L_\pi(\tilde{\pi}) = \eta(\pi) + \sum_s \rho_{\pi}(s) \sum_a \tilde{\pi}(a|s) A_\pi(s, a)$$

Note that \(L_\pi(\pi) = \eta(\pi)\) and \(\nabla_\theta L_\pi(\pi_\theta)|_{\theta=\theta_{old}} = \nabla_\theta \eta(\pi_\theta)|_{\theta=\theta_{old}}\) (first-order match).

Kakade & Langford (2002) showed that the difference between \(\eta(\tilde{\pi})\) and \(L_\pi(\tilde{\pi})\) can be bounded:

$$\eta(\tilde{\pi}) \geq L_\pi(\tilde{\pi}) - C \cdot D_{KL}^{\max}(\pi, \tilde{\pi})$$ where \(C = \frac{4 \epsilon \gamma}{(1-\gamma)^2}\), \(\epsilon = \max_s |E_{a \sim \tilde{\pi}}[A_\pi(s,a)]|\), and \(D_{KL}^{\max} = \max_s D_{KL}(\pi(\cdot|s) \| \tilde{\pi}(\cdot|s))\).

Define \(M_\pi(\tilde{\pi}) = L_\pi(\tilde{\pi}) - C \cdot D_{KL}^{\max}(\pi, \tilde{\pi})\). Then:

$$\eta(\tilde{\pi}) \geq M_\pi(\tilde{\pi})$$ $$\eta(\pi) = M_\pi(\pi) \quad \text{(since } D_{KL}^{\max}(\pi, \pi) = 0 \text{)}$$ $$\therefore \eta(\tilde{\pi}) - \eta(\pi) \geq M_\pi(\tilde{\pi}) - M_\pi(\pi)$$

Thus, any policy \(\tilde{\pi}\) that improves \(M_\pi\) is guaranteed to improve \(\eta\). This yields a monotonically improving sequence: \(\eta(\pi_0) \leq \eta(\pi_1) \leq \eta(\pi_2) \leq \cdots\)

Proximal Policy Optimization (PPO)

PPO simplifies TRPO by replacing the hard KL constraint with a clipped surrogate objective. This achieves similar stability guarantees while being much simpler to implement and tune.

PPO-Clip Objective

Let \(r_t(\theta) = \frac{\pi_\theta(a_t|s_t)}{\pi_{\theta_{old}}(a_t|s_t)}\) be the probability ratio. The clipped objective is:

$$L^{CLIP}(\theta) = \mathbb{E}_t \left[ \min\left( r_t(\theta) A_t, \text{clip}(r_t(\theta), 1-\epsilon, 1+\epsilon) A_t \right) \right]$$

where \(\epsilon\) is typically 0.1 or 0.2. The clipping works as follows:

Full PPO Loss

The complete PPO objective typically includes value function and entropy terms:

$$L(\theta) = \mathbb{E}_t \left[ L^{CLIP}(\theta) - c_1 L^{VF}(\theta) + c_2 S[\pi_\theta](s_t) \right]$$

where \(L^{VF}\) is the value function MSE loss and \(S\) is an entropy bonus for exploration.

Bandits

Overview

The multi-armed bandit (MAB) problem is a fundamental framework for studying the exploration-exploitation trade-off. The name comes from imagining a gambler facing multiple slot machines ("one-armed bandits"), each with unknown reward distributions. The goal is to maximize cumulative reward by learning which arms are best while not spending too much time on suboptimal arms.

Problem Formulation

A stochastic \(K\)-armed bandit is defined by:

Objective: Maximize cumulative reward over \(T\) rounds: \( \max \mathbb{E}\left[\sum_{t=1}^{T} r_t\right]\)

The key difference between Bandits and RL is that Bandits can be viewed as a special case of RL with a single state, or equivalently, an MDP where all states are terminal.

Regret Analysis

The standard measure of bandit algorithm performance is regret: the difference between the reward of always playing the optimal arm and the reward actually obtained.

Cumulative Regret: $$R_T = \sum_{t=1}^{T} (\mu^* - \mu_{a_t}) = T\mu^* - \sum_{t=1}^{T} \mu_{a_t}$$
Expected Regret: $$\mathbb{E}[R_T] = \sum_{a: \mu_a < \mu^*} \Delta_a \cdot \mathbb{E}[N_T(a)]$$ where \(\Delta_a = \mu^* - \mu_a\) is the suboptimality gap and \(N_T(a)\) is the number of times arm \(a\) is pulled.

Key regret bounds:

Key Methods

1. ε-Greedy

The simplest exploration strategy: exploit most of the time, explore randomly with small probability.

Estimated mean: \(\hat{\mu}_a = \frac{1}{N_t(a)} \sum_{s=1}^{t-1} r_s \cdot \mathbf{1}[a_s = a]\)

Properties:

2. Upper Confidence Bound (UCB)

The principle of "optimism in the face of uncertainty": construct upper confidence bounds on each arm's mean and select the arm with the highest UCB. This naturally balances exploitation (high estimated mean) with exploration (high uncertainty).

UCB1 Algorithm: $$a_t = \arg\max_a \left[ \hat{\mu}_a + \sqrt{\frac{2 \ln t}{N_t(a)}} \right]$$

The confidence bonus \(\sqrt{\frac{2 \ln t}{N_t(a)}}\) comes from Hoeffding's inequality, ensuring the true mean lies below the UCB with high probability.

3. Thompson Sampling

A Bayesian approach: maintain a posterior distribution over each arm's mean, sample from the posterior, and act greedily with respect to the samples. This implements "probability matching" — arms are selected proportionally to their probability of being optimal.

Thompson Sampling Algorithm:
  1. Maintain posterior \(P(\mu_a | \text{history})\) for each arm
  2. Sample \(\tilde{\mu}_a \sim P(\mu_a | \text{history})\) for each arm
  3. Select \(a_t = \arg\max_a \tilde{\mu}_a\)
  4. Observe reward, update posterior

For Bernoulli bandits with Beta prior:

Prior: \(\mu_a \sim \text{Beta}(\alpha_a, \beta_a)\)
After observing reward \(r \in \{0,1\}\): update \(\alpha_a \leftarrow \alpha_a + r\), \(\beta_a \leftarrow \beta_a + (1-r)\)

Properties:

Advanced Topics

Contextual Bandits

Contextual bandits extend the basic setting by introducing context (features) that inform the reward distribution. At each round, the agent observes a context \(x_t\), selects action \(a_t\), and receives reward \(r_t \sim P(r | x_t, a_t)\). For linear contextual bandits where \(\mathbb{E}[r_t | x_t, a_t] = x_t^\top \theta_{a_t}\), the LinUCB algorithm selects: \(a_t = \arg\max_a [ x_t^\top \hat{\theta}_a + \alpha \sqrt{x_t^\top A_a^{-1} x_t} ]\). Applications include recommendation systems, clinical trials, ad placement, and LLM serving.

Bayesian Optimization

When arm rewards are correlated (e.g., nearby hyperparameters have similar performance), Bayesian optimization uses Gaussian Processes to model the reward function and guide exploration efficiently.

Adversarial Bandits

When rewards are chosen by an adversary rather than drawn from fixed distributions, algorithms like EXP3 (Exponential-weight algorithm for Exploration and Exploitation) achieve \(O(\sqrt{KT \log K})\) regret against any adversary.


Model-based RL

Overview

Model-based RL learns a model of the environment dynamics and uses it for planning, in contrast to model-free methods that learn policies or value functions directly from experience. The key idea is to learn \(\hat{P}(s'|s,a)\) (transition model) and \(\hat{R}(s,a)\) (reward model), then use these to simulate trajectories for policy optimization.

Core Framework

Model-based RL alternates between:
  1. Model learning: Fit \(\hat{P}, \hat{R}\) from collected transitions \(\{(s_t, a_t, r_t, s_{t+1})\}\)
  2. Planning/Policy optimization: Use the model to improve the policy via simulated rollouts
  3. Data collection: Execute the policy in the real environment to gather new data

Key Challenges


Offline RL

Overview

Offline RL (also called batch RL) learns policies entirely from a fixed dataset of previously collected experience, without any online interaction with the environment. The difference between offline RL and behavior cloning (SFT) is that offline RL aims to learn an optimal policy that maximizes expected return, while behavior cloning simply mimics the behavior policy that generated the data.

Problem Setting

Given: a static dataset \(\mathcal{D} = \{(s_i, a_i, r_i, s'_i)\}_{i=1}^N\) collected by behavior policy \(\pi_\beta\).
Goal: learn a policy \(\pi\) that maximizes expected return, using only \(\mathcal{D}\) (no new environment interactions).

Key Challenge: Distribution Shift

The fundamental challenge is distribution shift between the learned policy \(\pi\) and the behavior policy \(\pi_\beta\). When \(\pi\) selects actions outside the support of \(\mathcal{D}\), the Q-function has no data to learn from and can be arbitrarily wrong. This leads to overestimation of out-of-distribution actions and policy degradation.

Classic Offline RL Methods


Inverse RL

Overview

Inverse Reinforcement Learning (IRL) is a special line of approaches in offline RL. It explicitly learns the reward function together with the policy. This is useful when rewards are difficult to specify but expert demonstrations are available. The learned reward can be used for continued training.

Problem Formulation

Given: Goal: find a reward function \(R(s, a)\) such that the expert policy is optimal under \(R\).

Challenges

Maximum Entropy IRL

Maximum Entropy IRL (Ziebart et al., 2008) resolves the ambiguity by assuming the expert follows a stochastic policy that maximizes entropy while favoring high-reward trajectories. This leads to the Boltzmann distribution over trajectories:

$$p(\tau) \propto \exp\left(\sum_t R(s_t, a_t)\right) = \exp(R(\tau))$$
How do we derive the Boltzmann trajectory distribution?

We seek a trajectory distribution \(p(\tau)\) that is as uncommitted as possible, while still assigning high probability to high-reward trajectories. In MaxEnt IRL, this is formalized as:

  1. Maximize entropy over trajectories: $$\max_{p} \; H(p) = -\sum_{\tau} p(\tau)\log p(\tau).$$
  2. Satisfy normalization: $$\sum_{\tau} p(\tau) = 1.$$
  3. Fix expected trajectory reward: $$\sum_{\tau} p(\tau)\,R(\tau) = \bar{R}.$$ This means we constrain the average reward level and otherwise choose the maximum-entropy distribution.

This is a constrained optimization problem. Using Lagrange multipliers \(\alpha\) (normalization) and \(\lambda\) (reward constraint):

$$ \mathcal{L}(p,\alpha,\lambda)= -\sum_{\tau} p(\tau)\log p(\tau) + \alpha\left(\sum_{\tau} p(\tau)-1\right) + \lambda\left(\sum_{\tau} p(\tau)R(\tau)-\bar{R}\right). $$

Set derivative w.r.t. \(p(\tau)\) to zero:

$$ \frac{\partial \mathcal{L}}{\partial p(\tau)} = -(\log p(\tau)+1)+\alpha+\lambda R(\tau)=0 \Rightarrow \log p(\tau)=c+\lambda R(\tau). $$

Exponentiating both sides gives an exponential-family form:

$$p(\tau)=\frac{1}{Z}\exp\big(R(\tau)\big) \propto \exp\big(R(\tau)\big).$$

Intuitively: among all distributions consistent with a target reward level, choose the highest-entropy one, which yields soft exponential weighting by trajectory reward.

Let the reward be parameterized as \(R_\theta(s,a)\) (any differentiable form). The objective is to maximize the likelihood of expert demonstrations:

$$\max_\theta \mathbb{E}_{\tau \sim \mathcal{D}}[\log p_\theta(\tau)] = \max_\theta \mathbb{E}_{\tau \sim \mathcal{D}}[R_\theta(\tau)] - \log Z_\theta$$ where \(Z_\theta = \int \exp(R_\theta(\tau)) d\tau\) is the partition function.

The gradient compares reward sensitivities under expert trajectories and model trajectories, which involves solving an RL problem:

$$\nabla_\theta \mathcal{L} = \mathbb{E}_{\tau \sim \mathcal{D}}[\nabla_\theta R_\theta(\tau)] - \mathbb{E}_{\tau \sim p_\theta}[\nabla_\theta R_\theta(\tau)]$$ (Match expert-induced and model-induced reward gradients)
How do we derive this gradient?

Start from the log-likelihood objective: $$\mathcal{L}(\theta) = \mathbb{E}_{\tau \sim \mathcal{D}}[\log p_\theta(\tau)] = \mathbb{E}_{\tau \sim \mathcal{D}}[R_\theta(\tau)] - \log Z_\theta$$ where \(Z_\theta = \int \exp(R_\theta(\tau))\,d\tau\). Taking the gradient w.r.t. \(\theta\): $$\nabla_\theta \mathcal{L} = \mathbb{E}_{\tau \sim \mathcal{D}}[\nabla_\theta R_\theta(\tau)] - \nabla_\theta \log Z_\theta$$ The only non-trivial part is \(\nabla_\theta \log Z_\theta\). By the chain rule: $$\nabla_\theta \log Z_\theta = \frac{\nabla_\theta Z_\theta}{Z_\theta} = \frac{1}{Z_\theta}\int \exp(R_\theta(\tau))\,\nabla_\theta R_\theta(\tau)\,d\tau$$ Recognizing that \(\frac{\exp(R_\theta(\tau))}{Z_\theta} = p_\theta(\tau)\), this becomes: $$\nabla_\theta \log Z_\theta = \int p_\theta(\tau)\,\nabla_\theta R_\theta(\tau)\,d\tau = \mathbb{E}_{\tau \sim p_\theta}[\nabla_\theta R_\theta(\tau)]$$ Substituting back: $$\nabla_\theta \mathcal{L} = \mathbb{E}_{\tau \sim \mathcal{D}}[\nabla_\theta R_\theta(\tau)] - \mathbb{E}_{\tau \sim p_\theta}[\nabla_\theta R_\theta(\tau)]$$ Intuition: the first term increases reward along directions that the expert frequently visits; the second term decreases reward in directions that the current model frequently visits. Gradient ascent therefore shapes \(R_\theta\) to score expert trajectories higher than model rollouts, which is a contrastive update that requires sampling from \(p_\theta\) (i.e., running an RL inner loop) at each step.

GAIL: Generative Adversarial Imitation Learning

GAIL (Ho & Ermon, 2016) reframes imitation learning as a GAN problem, bypassing explicit reward recovery. Instead of the two-step process (IRL → RL), GAIL directly learns a policy that matches expert behavior.

Key Insight: Occupancy Measure Matching

Define the occupancy measure as the distribution of state-action pairs visited by a policy:

$$\rho_\pi(s, a) = \pi(a|s) \sum_{t=0}^{\infty} \gamma^t P(s_t = s | \pi)$$

GAIL matches the occupancy measure of the learned policy to that of the expert, using a discriminator to measure the difference.

GAIL Objective

$$\min_\pi \max_D \mathbb{E}_{\pi_E}[\log D(s, a)] + \mathbb{E}_{\pi}[\log(1 - D(s, a))] - \lambda H(\pi)$$

Training Procedure

  1. Sample trajectories from current policy \(\pi_\theta\)
  2. Update discriminator \(D_w\) to classify expert vs. policy samples
  3. Update policy using \(-\log(1 - D_w(s,a))\) as reward signal (via policy gradient)
  4. Repeat until convergence

Advantages over Classical IRL

Connection Between IRL and GANs

Finn et al. (2016) established a formal equivalence between maximum entropy IRL, GANs, and energy-based models. This connection unifies these approaches under a common mathematical framework.

The Equivalence

GAN Maximum Entropy IRL Energy-Based Model
Generator \(G\) Policy \(\pi\) Sampler
Discriminator \(D\) Reward function \(R\) Energy function \(E\)
Real data Expert demonstrations Training data
Generated samples Policy rollouts MCMC samples

Mathematical Connection

In MaxEnt IRL, the optimal policy under reward \(R\) is:

$$\pi^*(a|s) \propto \exp(Q^*(s,a)) \propto \exp(R(s,a) + \gamma \mathbb{E}[V^*(s')])$$
Why is this the optimal policy?

Under MaxEnt IRL, the policy optimization objective at each state is to maximize causal entropy subject to matching the expected reward: $$\max_{\pi(\cdot|s)}\; H\!\left(\pi(\cdot|s)\right) + \sum_a \pi(a|s)\,Q^*(s,a)$$ where \(H(\pi(\cdot|s)) = -\sum_a \pi(a|s)\log\pi(a|s)\) is the per-step entropy. Taking the derivative with respect to \(\pi(a|s)\) and setting it to zero (with a Lagrange multiplier for the simplex constraint) gives: $$-\log\pi^*(a|s) - 1 + Q^*(s,a) = \lambda \;\Longrightarrow\; \pi^*(a|s) = \frac{\exp(Q^*(s,a))}{Z(s)}$$ i.e., a Boltzmann/softmax policy over Q-values. The soft Q-function is defined recursively as: $$Q^*(s,a) = R(s,a) + \gamma\,\mathbb{E}_{s'}[V^*(s')], \quad V^*(s) = \log\sum_a \exp(Q^*(s,a))$$ where \(V^*\) is the soft value function (a log-sum-exp instead of the hard max in standard RL). This is the foundation of Soft Actor-Critic (SAC) and maximum-entropy RL, where entropy regularization naturally yields a Boltzmann policy rather than a deterministic greedy one.

This is analogous to the Boltzmann distribution in energy-based models:

$$p(x) \propto \exp(-E(x))$$ where \(R = -E\) (reward is negative energy).
What is an energy-based model (EBM)?

An energy-based model (EBM) defines a probability distribution over inputs \(x\) via a scalar energy function \(E_\theta(x)\): $$p_\theta(x) = \frac{\exp(-E_\theta(x))}{Z(\theta)}, \quad Z(\theta) = \int \exp(-E_\theta(x))\,dx$$ Low energy → high probability; high energy → low probability. The normalizing constant \(Z(\theta)\) (the partition function) is generally intractable; this is what makes EBMs difficult to train and sample from.

Training intuition: push down the energy of real/desired data points, push up the energy of undesired ones. In practice this requires sampling from the model (e.g., via MCMC) to estimate the gradient of \(\log Z(\theta)\).

Connection to IRL: the reward function \(R(s,a)\) plays the role of \(-E(s,a)\). The trajectory distribution \(p(\tau) \propto \exp(R(\tau))\) is exactly an EBM over trajectories. Learning the reward from demonstrations is therefore equivalent to fitting an EBM; this requires maximizing the (log-)likelihood of expert trajectories under the Boltzmann distribution, which then requires estimating the partition function \(Z\) (hard in large state spaces, motivating GAIL's discriminator-based shortcut).

The GAN discriminator optimal solution is:

$$D^*(x) = \frac{p_{data}(x)}{p_{data}(x) + p_{gen}(x)}$$

When the generator density is available, the discriminator effectively learns an energy function that assigns low energy to real data and high energy to generated samples; this is exactly what the reward function does in IRL.

Implications




@article{guo2026rl,
  title   = {Reinforcement Learning: Foundation and Modern Algorithms},
  author  = {Guo, Wenbo and Zhang, Ying},
  journal = {henrygwb.github.io},
  year    = {2026},
  url     = {https://henrygwb.github.io/posts/rl.htm}
}