Reinforcement Learning: Foundation and Modern Algorithms

Mar 2026

This post summarizes the basics of reinforcement learning (RL) from David Silver's RL course. It mainly covers the RL modeling, Q-learning, policy gradient, bandits (exploration and exploitation), and advanced RL methods (model-based RL, and inverse RL). For the application of RL in LLM training, please refer to this post.

RL Basis
Dynamic Programming
Q Learning
Policy Gradient Methods
Bandits
Model-based RL
Offline RL
Inverse RL

RL Basis

RL Agents and Environment

        Agent
      
Action \(a_t\)
Reward \(r_{t+1}\)
State \(s_{t+1}\)

        Environment
      

Simplified demonstration of RL agents and the environment.

Reinforcement learning addresses sequential decision-making problems arising from interactions between an agent and its environment. Here, an agent refers to an autonomous system capable of understanding the state of its environment and taking different actions. The environment is the setting where the agent performs its tasks. For a self-driving car, the agent is the car itself, while the environment includes the road, traffic, and weather conditions. One task could be to navigate from location A to location B safely and efficiently. As shown in the figure, an RL system has the following quantitative components:

Environment state $s_t \in \mathcal{S}$: represents the full information about the environment and the agent itself. It can be either fully observable or partially observable to the agent. It is typically represented as a vector $s_t \in \mathbb{R}^q$.
Action $a_t$: action taken by the agent at time $t$. It is also represented as a vector $a_t \in \mathbb{R}^p$, where each element is an action.
Reward $r_{t+1}$: scalar feedback given by the environment after taking action $a_t$ in state $s_t$, measuring the goodness of the action.

At every time step $t$, the agent observes the current state $s_t$ from the environment, selects an action $a_t$ based on its policy $\pi$, and receives a reward $r_{t+1}$ along with the next state $s_{t+1}$ from the environment. The goal of the agent is to learn a policy that maximizes the cumulative reward over time. The RL process is formally modeled as a Markov Decision Process (MDP).

MDP and Bellman Equations

An MDP is defined as $ \langle \mathcal{S}, \mathcal{A}, P, R, \gamma \rangle$.

$\mathcal{S}$: state space, $s_t \in \mathcal{S}$
$\mathcal{A}$: action space, $a_t \in \mathcal{A}$
$\mathcal{T}: P(s' \mid s,a)= P(s_{t+1} = s' \mid s_t = s, a_t = a)$: state transition function, determining the next state given the current state and action. The state transition function is typically unknown for most problems. There is research on estimating the state transition function. RL with a known or learned/estimated state transition function is referred to as model-based RL.
$R(s,a)$: reward function, determining the reward of the agent's current action.
$\gamma \in [0,1)$: discount factor, applying decay to the rewards of future steps.

Under this model, the agent's policy is defined as $\pi(a\mid s) = P(a_t = a \mid s_t = s)$: mapping from states to actions. The objective of the agent is to find an optimal policy $\pi^*$ that maximizes the expected cumulative discounted reward. In the following we introduce two ways of measuring the cumulative reward and their relationships (Bellman equations).

Value functions

State value function under policy $\pi$, measuring the expected return starting from state $s$ when the agent plays policy $\pi$: $$V^\pi(s) = \mathbb{E}_\pi \left[ \sum_{k=0}^\infty \gamma^k r_{t+k+1} \mid s_t = s \right].$$

Action value function under policy $\pi$, measuring the expected return starting from state $s$, taking action $a$, and thereafter following policy $\pi$: $$Q^\pi(s,a) = \mathbb{E}_\pi \left[ \sum_{k=0}^\infty \gamma^k r_{t+k+1} \mid s_t = s, a_t = a \right].$$

Bellman Equations express the recursive relationship between value functions, connecting the current state value to future states.

The value of a state equals the expected immediate reward plus the discounted value of the next state. \[ \begin{aligned} V^\pi(s_t) &= \mathbb{E}_{a_t \sim \pi(\cdot\mid s_t),\, s_{t+1} \sim P(\cdot\mid s_t,a_t)} \big[ r_{t+1} + \gamma V^\pi(s_{t+1}) \big] \\[4pt] &= \mathbb{E}_{a_t \sim \pi(\cdot\mid s_t)} \big[ Q^\pi(s_t, a_t) \big] \\[4pt] &= \sum_{a} \pi(a\mid s_t) \Big[ R(s_t,a) + \gamma \sum_{s'} P(s' \mid s_t,a)\, V^\pi(s') \Big]. \end{aligned} \]

The value of taking action $a$ in state $s$ equals the immediate reward plus the expected discounted value of following the policy thereafter. $$ \begin{aligned} Q^\pi(s,a) &= r_{t+1} + \gamma \mathbb{E}_\pi [ V^\pi(s_{t+1})] \\ &= R(s,a) + \gamma \sum_{s'} P(s' \mid s,a) V^\pi(s') \\ &= R(s,a) + \gamma \sum_{s'} P(s' \mid s,a) \sum_{a'} \pi(a' \mid s') Q^\pi(s', a') \end{aligned} $$

V-Q Relationship: The state value is the expected action value under the policy. $$V^\pi(s) = \sum_a \pi(a\mid s) Q^\pi(s,a)$$

There are also Bellman Optimality Equations for the optimal value functions $V^{*}(s)$ and $Q^{*}(s,a)$, which replace the expectation over the policy with a maximization over actions, i.e.,

$$Q^*(s,a) = R(s,a) + \gamma \sum_{s'} P(s'|s,a) V^*(s')$$

$$V^*(s) = \max_a Q^*(s,a), \quad \pi^*(a|s) \in \arg\max_a Q^*(s,a)$$

Maximum Entropy RL and Soft Bellman Equations

Maximum entropy RL modifies the standard RL objective by adding an entropy regularization term. This encourages exploration by preventing the policy from becoming too deterministic or collapsing. The agent aims to maximize not only the expected return but also the entropy of its policy, leading to more robust and exploratory behavior. Under maximum entropy RL, the value functions and Bellman equations are adjusted accordingly.

Soft Value Functions

Soft state value function under policy $\pi$, incorporating both expected return and policy entropy: $$V_{\text{soft}}^\pi(s) = \mathbb{E}_\pi \left[ \sum_{k=0}^\infty \gamma^k (r_{t+k+1} + \alpha H(\pi(\cdot|s_{t+k}))) \mid s_t = s \right]$$ where $\alpha > 0$ is the temperature parameter and $H(\pi(\cdot|s)) = -\sum_a \pi(a|s) \log \pi(a|s)$ is the entropy.

Soft action value function under policy $\pi$: \[ \begin{aligned} Q_{\text{soft}}^\pi(s_t,a_t) &= \mathbb{E}_\pi \Bigg[ r_{t+1}+ \sum_{k=1}^{\infty} \gamma^k \Big( r_{t+k+1} - \alpha \log \pi(a_{t+k}\mid s_{t+k}) \Big) \,\Big|\, s_t = s,\; a_t = a \Bigg] \\[6pt] &= R(s, a) + \mathbb{E}_\pi \Bigg[ \sum_{k=1}^\infty \gamma^k \Big( r_{t+k+1} + \alpha\,H(\pi(\cdot\mid s_{t+k})) \Big) \Bigg] \\ &= R(s, a) + \gamma \mathbb{E}_{\pi} V_{\text{soft}}^\pi(s_{t+1}). \end{aligned} \]

Soft Bellman Equations extend the standard Bellman equations by incorporating the entropy regularization term.

The soft state value equals the expected immediate reward plus the entropy bonus and the discounted soft value of the next state: \[ \begin{aligned} V_{\text{soft}}^\pi(s) &= \mathbb{E}_{a\sim\pi(\cdot|s),\, s'\sim P(\cdot|s,a)} \Big[ r_{t+1} + \alpha H(\pi(\cdot|s)) + \gamma V_{\text{soft}}^\pi(s') \Big] \\ &= \sum_a \pi(a\mid s) \Big[ R(s,a) - \alpha \log \pi(a\mid s) + \gamma \sum_{s'} P(s'\mid s,a)\, V_{\text{soft}}^\pi(s') \Big]. \end{aligned} \]

The soft action value incorporates the entropy of future states but not the current action selection: \[ \begin{aligned} Q_{\text{soft}}^\pi(s,a) &= r_{t+1} + \gamma\,\mathbb{E}_{s'\sim P(\cdot|s,a)} \big[ V_{\text{soft}}^\pi(s') \big] \\ &= R(s,a) + \gamma \sum_{s'} P(s'\mid s,a)\, V_{\text{soft}}^\pi(s') \\ &= R(s,a) + \gamma \sum_{s'} P(s'\mid s,a) \sum_{a'} \pi(a'\mid s') \Big[ Q_{\text{soft}}^\pi(s',a') - \alpha \log \pi(a'\mid s') \Big]. \end{aligned} \]

Soft V-Q Relationship: \[ \begin{aligned} V_{\text{soft}}^\pi(s) &= \mathbb{E}_{a\sim\pi(\cdot|s)} \big[ Q_{\text{soft}}^{\pi}(s,a) - \alpha \log \pi(a|s) \big]\\ &=\sum_a \pi(a|s) \big[ Q_{\text{soft}}^{\pi}(s,a) - \alpha \log \pi(a|s) \big]. \end{aligned} \]

Optimal soft V-Q Relationship: Under the optimal policy, the soft state value and action value functions are related as: $$V_{\text{soft}}^{\pi^*}(s) = \alpha \log \sum_a \exp\left(\frac{Q_{\text{soft}}^{\pi^*}(s,a)}{\alpha}\right)$$

Optimal Soft Policy: the policy that maximizes the soft value function has the form (we will revisit this in the policy gradient section): $$\pi^*(a|s) = \exp\left(\frac{Q_{\text{soft}}^*(s,a) - V_{\text{soft}}^*(s)}{\alpha}\right)$$

The temperature parameter $\alpha$ controls the trade-off between exploitation and exploration: when $\alpha \to 0$, the soft policy approaches the deterministic optimal policy; when $\alpha$ is large, the policy becomes more uniform and exploratory.

Dynamic Programming

Dynamic Programming (DP) methods solve Markov Decision Processes (MDPs) under the assumption that the environment’s dynamics are fully known, i.e., $ \langle \mathcal{S}, \mathcal{A}, P, R, \gamma \rangle $. DP forms a foundational pillar of Reinforcement Learning, providing the theoretical basis for many later algorithms. DP focuses on two core problems:
• Prediction: computing the value function of a given policy in order to quantify its expected long-term return.
• Control: searching for an optimal policy, along with its corresponding optimal value function, using only the known model of the environment.

Dynamic Programming methods can be categorized as synchronous or asynchronous based on how value function updates are performed across the state space. In synchronous DP, all states are updated together in each iteration using the value estimates from the previous iteration, resulting in a sequence of full sweeps over the state space. In contrast, asynchronous DP updates states individually or in small subsets, immediately incorporating newly computed values without waiting for a complete sweep. Both approaches rely on the same Bellman equations and converge under standard assumptions, but they differ in how updates are scheduled and in their practical computational efficiency.

Synchronous Dynamic Programming

The following methods are the fundamental building blocks of synchronous dynamic programming, ranging from evaluating a fixed policy to iteratively improving it until the optimal policy is reached.

Policy evaluation: Given a fixed policy, repeatedly apply the Bellman expectation equation to compute the corresponding value function $V^\pi$. This requires full sweeps over the state space and converges for finite MDPs.
Policy improvement: Given a value function $V^\pi$ or action-value function $Q^\pi$, construct a new policy by greedily selecting actions that maximize expected return.
Policy iteration: Alternates between policy evaluation and policy improvement. Each iteration produces a strictly better policy, and the process converges to the optimal policy in a finite number of steps.
Value iteration: Applies the Bellman optimality update directly, effectively combining policy evaluation and improvement into a single update. The policy is obtained implicitly by acting greedily with respect to the converged value function.

$$V_{k+1}(s) = \max_a \sum_{s'} P(s' \mid s,a) \big[R(s,a) + \gamma V_k(s')\big].$$

Comparison of DP Methods

Method	Purpose	Update Rule	Policy	Convergence
Policy Evaluation	Compute $V^\pi$ for given policy $\pi$	$V_{k+1}(s) = \sum_a \pi(a\|s) \sum_{s'} P(s'\|s,a)[R(s,a) + \gamma V_k(s')]$	Fixed (given)	Always converges to $V^\pi$
Policy Iteration	Find optimal policy $\pi^*$	Alternate: Evaluate $\rightarrow$ Improve $\rightarrow$ Evaluate...	Updated after each evaluation	Converges to $\pi^*$ in finite steps
Value Iteration	Find optimal value $V^$ and policy $\pi^$	$V_{k+1}(s) = \max_a \sum_{s'} P(s'\|s,a)[R(s,a) + \gamma V_k(s')]$	Implicit (greedy w.r.t. $V^*$)	Converges to $V^*$ geometrically

Asynchronous Dynamic Programming

Synchronous DP methods update all states in each iteration, which can be computationally expensive for large state spaces. Asynchronous DP methods selectively update states, improving computational efficiency while maintaining convergence guarantees. The key insight is that not all states need to be updated with equal frequency.

In-place Dynamic Programming

Updates use the most recent values immediately, rather than waiting for a complete sweep
Instead of maintaining two value arrays $V_k$ and $V_{k+1}$, use a single array updated in-place
Order of state updates can affect convergence speed

In-place value iteration updates: $$V(s) \leftarrow \max_a \sum_{s'} P(s'|s,a)[R(s,a) + \gamma V_{\text{old}}(s')]$$ where $V_{\text{old}}(s)$ uses the most recently computed values.

Prioritized Sweeping

Maintains a priority queue of states, where each state is ranked by the magnitude of its Bellman error.
The Bellman error for a state $s$ is defined as \[ \bigl|\max_a \sum_{s'} P(s' \mid s,a)\,[R(s,a) + \gamma V(s')] - V(s)\bigr|, \] measuring how inconsistent the current value estimate is with the Bellman optimality equation.
States with the largest Bellman errors are updated first, allowing value changes to propagate efficiently.
When a state’s value changes significantly, its predecessor states are inserted into the priority queue.
This approach is particularly effective when value changes are localized to specific regions of the state space.

Real-time Dynamic Programming

Updates only states that are visited by the agent during actual or simulated experience
Focuses computational resources on states that are relevant to the current policy or trajectory
Can be combined with trajectory sampling
Guarantees convergence for states that are visited infinitely often

Asynchronous dynamic programming methods can be viewed as an important conceptual precursor to modern reinforcement learning. By updating value estimates incrementally, selectively, and often based on trajectories rather than full state sweeps, they introduce many of the computational ideas later adopted by model-free RL algorithms that learn from sampled experience.

Q Learning

Q-learning refers to methods that learn the action-value function $Q(s,a)$ using a parametric function, most commonly a neural network, and then use an $\epsilon$-greedy policy for action selection. Below, we first introduce different ways to estimate value functions and then discuss $\epsilon$-greedy for control. We work in the model-free setting, meaning the agent does not assume or learn an explicit model of the environment’s transition dynamics or reward function.

Key terms: Skim these definitions before diving into value estimation so the vocabulary feels familiar.

Return: total discounted reward from time $t$ onward, $G_t = \sum_{k=t}^T \gamma^{k-t} r_k$; its expectations are $V^\pi(s)$ and $Q^\pi(s,a)$ for states and state–action pairs.
TD target: one-step lookahead $r_t + \gamma\,\hat V(s_{t+1})$ that bootstraps off the current estimate $\hat V$ instead of waiting for the full return.
Bootstrapping: reusing a learned estimate such as $\hat V(s)$ or $\hat Q(s,a)$ in place of sampled future rewards.
n-step target: accumulate $n$ rewards then bootstrap, $\sum_{k=0}^{n-1} \gamma^k r_{t+k} + \gamma^n \hat V(s_{t+n})$, trading variance for bias.
TD($\lambda$): weighted blend of all $n$-step targets, smoothly interpolating MC and TD behavior.
Baseline: reference value (often $V(s_t)$) subtracted from returns/advantages to cut variance without changing the expectation.
Advantage: how much better an action is than the baseline at a state, $A_t = Q(s_t,a_t) - V(s_t)$.

Value Estimation

We start with value estimation given that the agent has no access to the environment’s dynamics or reward function beyond sampled experience, and therefore cannot reason directly about the long-term consequences of actions. Value estimation aims to predict long-term reward from a given state (or state–action pair), and different methods mainly differ in how they approximate this expectation.

There are three canonical value estimation strategies, which can be viewed along a bias–variance spectrum:

Monte Carlo (MC): Estimates value by waiting until an episode terminates and using the realized return. MC methods are unbiased, but can be high variance since they depend on complete future trajectories.
Temporal-Difference (TD): Updates value estimates after a single step by bootstrapping from the current estimate of the next state. TD methods typically have lower variance, but can be biased due to relying on the next-state value approximation.
TD(λ) / n-step methods: Interpolate between Monte Carlo and TD by using multi-step returns. These methods smoothly trade off bias and variance, blending full-trajectory returns with bootstrapped estimates.

Why TD is biased but lower‑variance

MC (unbiased, higher variance): In episodic, on‑policy settings, the return $G_t = \sum_{k=t}^T \gamma^{k-t} r_k$ satisfies $\mathbb{E}[G_t\mid s_t] = V^\pi(s_t)$, so MC is an unbiased estimator of the true value, but it sums many random rewards, leading to higher variance.

TD(0) (biased, lower variance): The one‑step target $Y_{TD} = r_t + \gamma\,\hat V(s_{t+1})$ uses a learned $\hat V$ for bootstrapping. Its bias relative to $V^\pi(s_t)$ is $$ \underbrace{\mathbb{E}[Y_{TD}\mid s_t] - V^\pi(s_t)}_{\text{bias}}\;=\;\gamma\, \mathbb{E}[\hat V(s_{t+1}) - V^\pi(s_{t+1})]. $$ If $\hat V$ is biased, the target is biased. Variance is typically lower because it replaces the random future return with the smoother estimate $\hat V(s_{t+1})$.

Quick Takeaways:

Tabular TD(0) asymptotically matches the true value function even though every single update target is slightly biased.
Function approximation adds extra bias because the network can only represent values approximately; Monte Carlo stays unbiased but stays noisy.
n-step / TD($\lambda$) blend MC and TD targets so you can dial in the bias–variance trade-off you want.

In deep RL, we approximate the value function with a neural network and train it to match chosen targets (MC, n-step, TD(0), or TD($\lambda$)). The target you choose sets the bias–variance trade‑off. We will reuse these estimators later inside actor–critic and policy‑gradient methods.

TD updates in practice: estimating $\hat V$

Tabular TD(0): Use this when the state space is small enough to keep one table entry per state. Maintain a table for $V(s)$. For each transition $(s_t, a_t, r_t, s_{t+1})$, update $$ V(s_t) \leftarrow V(s_t) + \alpha\,\big[\underbrace{r_t + \gamma\,V(s_{t+1})}_{\text{TD target}} - V(s_t)\big]. $$ Here $\hat V(s_{t+1})$ is just the current table entry of $V(s)$ for $s=s_{t+1}$.

Initialize: set $V(s)$ to zero (or a heuristic guess) for every state.
Observe: collect a transition $(s_t, a_t, r_t, s_{t+1})$.
Target: compute $y_t = r_t + \gamma V(s_{t+1})$.
Update: $V(s_t) \leftarrow V(s_t) + \alpha (y_t - V(s_t))$.
Repeat: keep looping over data until the table stops changing much.

Function approximation (state value): When the state space is too large for tabular TD(0), swap the table with a neural network $V_\theta(s)$ that outputs the value estimate. The TD idea stays the same, but we now solve a regression problem where the “labels” are bootstrapped targets.

$$ y_t = r_t + \gamma\,V_{\bar\theta}(s_{t+1}),\qquad \mathcal{L}(\theta) = \tfrac{1}{2}\big(V_\theta(s_t) - y_t\big)^2. $$

Initialize nets: pick weights $\theta$ and copy them into the slow target network $\bar\theta$.
Sample a batch: draw recent transitions from the online stream or a replay buffer.
Compute targets: use the frozen target network, $y_t = r_t + \gamma V_{\bar\theta}(s_{t+1})$.
Gradient step: minimize $\tfrac{1}{2}(V_\theta(s_t) - y_t)^2$, treating $y_t$ as constant (a semi-gradient update).
Refresh target: every $K$ steps copy $\theta$ into $\bar\theta$ (hard update every $K$ steps) or slowly blend them (Polyak-average) to keep $y_t$ stable.

n-step and TD($\lambda$): Use $y_t^{(n)} = \sum_{k=0}^{n-1} \gamma^k r_{t+k} + \gamma^n V_{\bar\theta}(s_{t+n})$ to reduce bias from bootstrapping while keeping variance lower than pure MC. TD($\lambda$) blends all $n$-step targets; online implementations use eligibility traces, and advantage estimators often use GAE.

Roll out: gather a short trajectory $s_t, r_t, \ldots, s_{t+n}$.
Build the $n$-step target: sum the observed rewards and add $\gamma^n V_{\bar\theta}(s_{t+n})$.
Plug in: replace the one-step target with $y_t^{(n)}$ inside the tabular or neural update.
TD($\lambda$): blend every $n$-step target using weights $\lambda^{n-1}$. In practice keep an eligibility trace per state, update it as $z(s) \leftarrow \gamma\lambda z(s) + \mathbf{1}[s=s_t]$, compute the TD error $\delta_t = y_t - V(s_t)$, then update all states by $V(s) \leftarrow V(s) + \alpha\,\delta_t\,z(s)$.
GAE note: Generalized Advantage Estimation mimics the same idea for actor–critic. After computing TD residuals $\delta_t = r_t + \gamma V(s_{t+1}) - V(s_t)$, maintain a running accumulator $A_t = \delta_t + \gamma\lambda A_{t+1}$ (working backward through the trajectory) and feed these smoothed advantages into the policy update.

Non-parametric control

So far we focused on value estimation (learning how good states or actions are). Control methods decide which action to actually take given those estimates. Here “non‑parametric” just means we use simple rules and tables, including $\varepsilon$-greedy, SARSA, or tabular Q‑learning, rather than using a learned neural policy network. These methods are the basic building blocks for exploration and policy improvement before we move on to deep, parametric control.

ε-Greedy

With probability $1-\varepsilon$: act greedily, i.e., pick the action with the highest current value estimate, $\arg\max_a Q(s,a)$.
With probability $\varepsilon$: ignore the values and pick a random action, so the agent keeps exploring parts of the state space it is unsure about.
In practice, start with a larger $\varepsilon$ (more random exploration) and slowly decay it toward a small value, so the agent explores a lot early on and then exploits its learned value estimates later.

On-Policy vs Off-Policy Learning

Reinforcement learning algorithms differ in whether they learn only from the same policy they are currently using to act (on‑policy), or whether they can also learn from data generated by a different policy or a fixed dataset (off‑policy).

On-Policy Learning: An RL algorithm is on-policy if it learns the value of the policy that is currently being used to generate data. In other words, the agent runs its current (exploratory) policy to collect trajectories, then directly uses those trajectories to evaluate and improve that same policy.

The agent must follow (or closely follow) the policy it is trying to learn.
Exploration (e.g., $\varepsilon$-greedy, softmax) is part of the policy itself.
It cannot freely learn from arbitrary experience unless the data comes from the same policy.

SARSA (On-Policy Control):
SARSA is an on-policy method for learning the action-value function $Q(s,a)$. The name comes from the five variables used in each update step: $(S_t, A_t, R_{t+1}, S_{t+1}, A_{t+1})$.

$$Q(S_t, A_t) \leftarrow Q(S_t, A_t) + \alpha \Big( R_{t+1} + \gamma Q(S_{t+1}, A_{t+1}) - Q(S_t, A_t) \Big).$$
- What it learns: $Q(s,a)$ estimates “how good is it to take action $a$ in state $s$ if I keep following my current (typically $\varepsilon$-greedy) policy afterward?”
- Target: the term $R_{t+1} + \gamma Q(S_{t+1}, A_{t+1})$ says “today’s reward plus the discounted value of what I actually plan to do next.”
- On‑policy part: the next action $A_{t+1}$ is sampled from the current behavior policy, so the update is tied to what the agent really did, not to an idealized greedy action.
- Update pattern:
  1. Observe a transition $(S_t, A_t, R_{t+1}, S_{t+1})$ from interacting with the environment.
  2. Choose the next action $A_{t+1}$ using your current (typically $\varepsilon$-greedy) policy.
  3. Compute the TD error: $\delta_t = R_{t+1} + \gamma Q(S_{t+1}, A_{t+1}) - Q(S_t, A_t)$.
  4. Update the Q-value: $Q(S_t, A_t) \leftarrow Q(S_t, A_t) + \alpha\,\delta_t$.

Off-Policy Learning: An algorithm is off-policy if it learns the value of a target policy (often greedy) while using a different behavior policy to generate experience.

The behavior policy can be more exploratory than the target policy.
Enables learning from arbitrary data, e.g., replay buffers or demonstrations.

Q-learning (Off-Policy Control) Q-learning updates the action-value function using the maximum over next-state actions, regardless of which action the behavior policy actually took:

$$Q(S_t, A_t) \leftarrow Q(S_t, A_t) + \alpha \Big( R_{t+1} + \gamma \max_a Q(S_{t+1}, a) - Q(S_t, A_t) \Big).$$

This makes Q‑learning off‑policy because the update uses the optimal greedy action, not necessarily the action selected by the behavior policy.

Policy Gradient Methods

Overview

Policy gradient methods directly optimize a parameterized policy $\pi_\theta(a|s)$ rather than conducting non-parametric control. This approach offers several advantages:

Continuous action spaces: Policy gradient naturally handles continuous actions without discretization
Stochastic and continuous policies: Can learn stochastic policies following various distributions, such as Gaussian or Categorical
Policy structure: Can incorporate domain knowledge through policy architecture design, e.g., using different types of neural networks

Learning Objective

The goal is to find parameters $\theta$ that maximize the expected cumulative (long-term) reward:

$$J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta} \left[ \sum_{t=0}^{T} \gamma^t r_t \right] = \mathbb{E}_{\tau \sim \pi_\theta} [G_0] = \mathbb{E}_{s_0} [ V^{\pi_\theta}(s_0)]$$

where $\tau = (s_0, a_0, r_0, s_1, a_1, r_1, \ldots)$ is a trajectory.

Policy Gradient

$$\nabla_\theta J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta} \left[ \sum_{t=0}^{T} \nabla_\theta \log \pi_\theta(a_t|s_t) \cdot A_t^{\pi_\theta} \right]$$

where $A_t^{\pi_\theta}$ can take several forms (all yielding unbiased gradient estimates with different variance properties):

$A_t^{\pi_\theta} = G_t = \sum_{k=t}^{T} \gamma^{k-t} r_k$: Total return from time $t$ (REINFORCE)
$A_t^{\pi_\theta} = G_t - V^{\pi_\theta}(s_t)$: Return with learned baseline (reduces variance)
$A_t^{\pi_\theta} = Q^{\pi_\theta}(s_t, a_t)$: Action-value function estimate
$A_t^{\pi_\theta} = A^{\pi_\theta}(s_t, a_t) = Q^{\pi_\theta}(s_t, a_t) - V^{\pi_\theta}(s_t)$: Advantage function (PPO, A2C)

Derivation of Policy Gradient

$$\nabla_\theta J(\theta) = \nabla_\theta \mathbb{E}_{\tau \sim \pi_\theta}[G_0] = \nabla_\theta \int p_\theta(\tau) G_0(\tau) d\tau = \int \nabla_\theta p_\theta(\tau) G_0(\tau) d\tau$$ $$= \int p_\theta(\tau) \frac{\nabla_\theta p_\theta(\tau)}{p_\theta(\tau)} G_0(\tau) d\tau = \int p_\theta(\tau) \nabla_\theta \log p_\theta(\tau) G_0(\tau) d\tau $$ $$= \mathbb{E}_{\tau \sim \pi_\theta}[\nabla_\theta \log p_\theta(\tau) \cdot G_0(\tau)]$$ $$\text{where } \log p_\theta(\tau) = \log p(s_0) + \sum_{t=0}^{T} \log \pi_\theta(a_t|s_t) + \sum_{t=0}^{T} \log P(s_{t+1}|s_t, a_t)$$ $$\Rightarrow \nabla_\theta \log p_\theta(\tau) = \sum_{t=0}^{T} \nabla_\theta \log \pi_\theta(a_t|s_t) \quad \text{(independent of environment model } P(s_{t+1}|s_t, a_t) \text{)}$$ $$\therefore \nabla_\theta J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta}\left[ \sum_{t=0}^{T} \nabla_\theta \log \pi_\theta(a_t|s_t) \cdot G_t \right] \quad \text{(by causality: } G_t = \sum_{k=t}^{T} \gamma^{k-t} r_k \text{)}$$

Actor-Critic Methods

Actor-Critic methods combine policy gradient (actor) with learned value functions (critic). The critic reduces variance by providing a learned baseline or advantage estimate, while the actor updates the policy using the critic's feedback.

Basic Actor-Critic Architecture

Actor: Policy network $\pi_\theta(a|s)$ with parameters $\theta$
Critic: Value network $V_w(s)$ (or $Q_w(s,a)$) with parameters $w$

$$\text{Advantage estimate (TD error): } A_t^{\pi_\theta} = \delta_t = r_t + \gamma V_w(s_{t+1}) - V_w(s_t)$$ $$\text{Critic update: } w \leftarrow w + \alpha_w A_t^{\pi_\theta} \nabla_w V_w(s_t)$$ $$\text{Actor update: } \theta \leftarrow \theta + \alpha_\theta A_t^{\pi_\theta} \nabla_\theta \log \pi_\theta(a_t|s_t)$$

A2C (Advantage Actor-Critic)

A2C uses the advantage function estimated via GAE (Generalized Advantage Estimation) for lower-variance updates (see above for the details of GAE).

A3C (Asynchronous Advantage Actor-Critic)

A3C runs multiple actor-learners in parallel, each interacting with its own copy of the environment. Workers asynchronously update shared global parameters, providing decorrelated updates and natural exploration through policy diversity. This improves the training speed and yields larger rollouts.

Trust Region Policy Optimization (TRPO)

TRPO enables monotonic improvement of the policy throughout training to improve learning stability and reduce variance.

TRPO Objective

TRPO maximizes this surrogate objective subject to a KL divergence constraint. The TRPO paper has two major parts: 1) deriving the objective function that ensures monotonicity; 2) applying a series of approximations to make the optimization tractable (mainly replacing the max KL constraint with an average KL constraint).

$$\max_{\theta_{new}} \mathbb{E}_{s,a \sim \pi_\theta} \left[ \frac{\pi_{\theta_{new}}(a|s)}{\pi_\theta(a|s)} A^{\pi_\theta}(s, a) \right]$$ $$\text{s.t. } \mathbb{E}_s [D_{KL}(\pi_\theta(\cdot|s) \| \pi_{\theta_{new}}(\cdot|s))] \leq \delta$$

The constraint ensures the new policy stays close to the old policy, preventing catastrophic updates. TRPO uses conjugate gradient and line search to approximately solve this constrained optimization problem.

Derivation of Monotonic Improvement

The expected return $\eta$ of a new policy $\tilde{\pi}$ can be expressed in terms of the advantage over the old policy $\pi$:

$$\eta(\tilde{\pi}) = \eta(\pi) + \mathbb{E}_{\tau \sim \tilde{\pi}} \left[ \sum_{t=0}^{\infty} \gamma^t A_\pi(s_t, a_t) \right]$$

Define $\rho_{\tilde{\pi}}(s) = \sum_{t=0}^{\infty} \gamma^t P(s_t = s \mid \tilde{\pi})$ as the (unnormalized) discounted state visitation frequency under $\tilde{\pi}$. Then the per-time-step expectation can be rewritten as a sum over states:

$$\eta(\tilde{\pi}) = \eta(\pi) + \sum_s \rho_{\tilde{\pi}}(s) \sum_a \tilde{\pi}(a|s) A_\pi(s, a)$$

The above requires sampling from $\tilde{\pi}$, which is unavailable. Define a local approximation $L_\pi$ using states from $\pi$:

$$L_\pi(\tilde{\pi}) = \eta(\pi) + \sum_s \rho_{\pi}(s) \sum_a \tilde{\pi}(a|s) A_\pi(s, a)$$

Note that $L_\pi(\pi) = \eta(\pi)$ and $\nabla_\theta L_\pi(\pi_\theta)|_{\theta=\theta_{old}} = \nabla_\theta \eta(\pi_\theta)|_{\theta=\theta_{old}}$ (first-order match).

Kakade & Langford (2002) showed that the difference between $\eta(\tilde{\pi})$ and $L_\pi(\tilde{\pi})$ can be bounded:

$$\eta(\tilde{\pi}) \geq L_\pi(\tilde{\pi}) - C \cdot D_{KL}^{\max}(\pi, \tilde{\pi})$$ where $C = \frac{4 \epsilon \gamma}{(1-\gamma)^2}$, $\epsilon = \max_s |E_{a \sim \tilde{\pi}}[A_\pi(s,a)]|$, and $D_{KL}^{\max} = \max_s D_{KL}(\pi(\cdot|s) \| \tilde{\pi}(\cdot|s))$.

Define $M_\pi(\tilde{\pi}) = L_\pi(\tilde{\pi}) - C \cdot D_{KL}^{\max}(\pi, \tilde{\pi})$. Then:

$$\eta(\tilde{\pi}) \geq M_\pi(\tilde{\pi})$$ $$\eta(\pi) = M_\pi(\pi) \quad \text{(since } D_{KL}^{\max}(\pi, \pi) = 0 \text{)}$$ $$\therefore \eta(\tilde{\pi}) - \eta(\pi) \geq M_\pi(\tilde{\pi}) - M_\pi(\pi)$$

Thus, any policy $\tilde{\pi}$ that improves $M_\pi$ is guaranteed to improve $\eta$. This yields a monotonically improving sequence: $\eta(\pi_0) \leq \eta(\pi_1) \leq \eta(\pi_2) \leq \cdots$

Proximal Policy Optimization (PPO)

PPO simplifies TRPO by replacing the hard KL constraint with a clipped surrogate objective. This achieves similar stability guarantees while being much simpler to implement and tune.

PPO-Clip Objective

Let $r_t(\theta) = \frac{\pi_\theta(a_t|s_t)}{\pi_{\theta_{old}}(a_t|s_t)}$ be the probability ratio. The clipped objective is:

$$L^{CLIP}(\theta) = \mathbb{E}_t \left[ \min\left( r_t(\theta) A_t, \text{clip}(r_t(\theta), 1-\epsilon, 1+\epsilon) A_t \right) \right]$$

where $\epsilon$ is typically 0.1 or 0.2. The clipping works as follows:

When $A_t > 0$ (good action): clip prevents $r_t$ from exceeding $1+\epsilon$, limiting how much we increase the action's probability
When $A_t < 0$ (bad action): clip prevents $r_t$ from going below $1-\epsilon$, limiting how much we decrease the action's probability

Full PPO Loss

The complete PPO objective typically includes value function and entropy terms:

$$L(\theta) = \mathbb{E}_t \left[ L^{CLIP}(\theta) - c_1 L^{VF}(\theta) + c_2 S[\pi_\theta](s_t) \right]$$

where $L^{VF}$ is the value function MSE loss and $S$ is an entropy bonus for exploration.

Bandits

Overview

The multi-armed bandit (MAB) problem is a fundamental framework for studying the exploration-exploitation trade-off. The name comes from imagining a gambler facing multiple slot machines ("one-armed bandits"), each with unknown reward distributions. The goal is to maximize cumulative reward by learning which arms are best while not spending too much time on suboptimal arms.

Problem Formulation

A stochastic $K$-armed bandit is defined by:

$K$ actions (arms): $\mathcal{A} = \{1, 2, \ldots, K\}$
Each arm $a$ has an unknown reward distribution $P_a$ with mean $\mu_a$
At each round $t$, the agent selects an arm $a_t$ and receives reward $r_t \sim P_{a_t}$
Optimal arm: $a^* = \arg\max_a \mu_a$ with optimal mean $\mu^* = \max_a \mu_a$

Objective: Maximize cumulative reward over $T$ rounds: $ \max \mathbb{E}\left[\sum_{t=1}^{T} r_t\right]$

The key difference between Bandits and RL is that Bandits can be viewed as a special case of RL with a single state, or equivalently, an MDP where all states are terminal.

Regret Analysis

The standard measure of bandit algorithm performance is regret: the difference between the reward of always playing the optimal arm and the reward actually obtained.

Cumulative Regret: $$R_T = \sum_{t=1}^{T} (\mu^* - \mu_{a_t}) = T\mu^* - \sum_{t=1}^{T} \mu_{a_t}$$

Expected Regret: $$\mathbb{E}[R_T] = \sum_{a: \mu_a < \mu^*} \Delta_a \cdot \mathbb{E}[N_T(a)]$$ where $\Delta_a = \mu^* - \mu_a$ is the suboptimality gap and $N_T(a)$ is the number of times arm $a$ is pulled.

Key regret bounds:

Lower bound (Lai-Robbins): Any consistent algorithm has $\mathbb{E}[R_T] = \Omega(\log T)$
UCB achieves: $\mathbb{E}[R_T] = O\left(\sum_{a:\Delta_a > 0} \frac{\log T}{\Delta_a}\right)$ — logarithmic and thus near-optimal
Worst-case: $\mathbb{E}[R_T] = O(\sqrt{KT \log T})$ when gaps are unknown

Key Methods

1. ε-Greedy

The simplest exploration strategy: exploit most of the time, explore randomly with small probability.

With probability $1-\varepsilon$: choose $a_t = \arg\max_a \hat{\mu}_a$ (greedy)
With probability $\varepsilon$: choose uniformly random arm

Estimated mean: $\hat{\mu}_a = \frac{1}{N_t(a)} \sum_{s=1}^{t-1} r_s \cdot \mathbf{1}[a_s = a]$

Properties:

Simple to implement and understand
Linear regret $O(\varepsilon T)$ with fixed $\varepsilon$
Decaying $\varepsilon_t = \min(1, K/(d^2 t))$ can achieve $O(\log T)$ regret
Does not use uncertainty information — explores uniformly over all arms

2. Upper Confidence Bound (UCB)

The principle of "optimism in the face of uncertainty": construct upper confidence bounds on each arm's mean and select the arm with the highest UCB. This naturally balances exploitation (high estimated mean) with exploration (high uncertainty).

UCB1 Algorithm: $$a_t = \arg\max_a \left[ \hat{\mu}_a + \sqrt{\frac{2 \ln t}{N_t(a)}} \right]$$

The confidence bonus $\sqrt{\frac{2 \ln t}{N_t(a)}}$ comes from Hoeffding's inequality, ensuring the true mean lies below the UCB with high probability.

3. Thompson Sampling

A Bayesian approach: maintain a posterior distribution over each arm's mean, sample from the posterior, and act greedily with respect to the samples. This implements "probability matching" — arms are selected proportionally to their probability of being optimal.

Thompson Sampling Algorithm:

Maintain posterior $P(\mu_a | \text{history})$ for each arm
Sample $\tilde{\mu}_a \sim P(\mu_a | \text{history})$ for each arm
Select $a_t = \arg\max_a \tilde{\mu}_a$
Observe reward, update posterior

For Bernoulli bandits with Beta prior:

Prior: $\mu_a \sim \text{Beta}(\alpha_a, \beta_a)$
After observing reward $r \in \{0,1\}$: update $\alpha_a \leftarrow \alpha_a + r$, $\beta_a \leftarrow \beta_a + (1-r)$

Properties:

Empirically often outperforms UCB
Achieves $O(\sqrt{KT \log T})$ Bayesian regret
Naturally handles prior knowledge
Can be extended to complex models (linear, neural network posteriors)

Advanced Topics

Contextual Bandits

Contextual bandits extend the basic setting by introducing context (features) that inform the reward distribution. At each round, the agent observes a context $x_t$, selects action $a_t$, and receives reward $r_t \sim P(r | x_t, a_t)$. For linear contextual bandits where $\mathbb{E}[r_t | x_t, a_t] = x_t^\top \theta_{a_t}$, the LinUCB algorithm selects: $a_t = \arg\max_a [ x_t^\top \hat{\theta}_a + \alpha \sqrt{x_t^\top A_a^{-1} x_t} ]$. Applications include recommendation systems, clinical trials, ad placement, and LLM serving.

Bayesian Optimization

When arm rewards are correlated (e.g., nearby hyperparameters have similar performance), Bayesian optimization uses Gaussian Processes to model the reward function and guide exploration efficiently.

Adversarial Bandits

When rewards are chosen by an adversary rather than drawn from fixed distributions, algorithms like EXP3 (Exponential-weight algorithm for Exploration and Exploitation) achieve $O(\sqrt{KT \log K})$ regret against any adversary.

Model-based RL

Overview

Model-based RL learns a model of the environment dynamics and uses it for planning, in contrast to model-free methods that learn policies or value functions directly from experience. The key idea is to learn $\hat{P}(s'|s,a)$ (transition model) and $\hat{R}(s,a)$ (reward model), then use these to simulate trajectories for policy optimization.

Core Framework

Model-based RL alternates between:

Model learning: Fit $\hat{P}, \hat{R}$ from collected transitions $\{(s_t, a_t, r_t, s_{t+1})\}$
Planning/Policy optimization: Use the model to improve the policy via simulated rollouts
Data collection: Execute the policy in the real environment to gather new data

Key Challenges

Model error compounding: Small prediction errors accumulate over long rollouts
Distribution shift: Model trained on old policy data may be inaccurate for new policies
Computational cost: Planning through a learned model can be expensive

Offline RL

Overview

Offline RL (also called batch RL) learns policies entirely from a fixed dataset of previously collected experience, without any online interaction with the environment. The difference between offline RL and behavior cloning (SFT) is that offline RL aims to learn an optimal policy that maximizes expected return, while behavior cloning simply mimics the behavior policy that generated the data.

Problem Setting

Given: a static dataset $\mathcal{D} = \{(s_i, a_i, r_i, s'_i)\}_{i=1}^N$ collected by behavior policy $\pi_\beta$.
Goal: learn a policy $\pi$ that maximizes expected return, using only $\mathcal{D}$ (no new environment interactions).

Key Challenge: Distribution Shift

The fundamental challenge is distribution shift between the learned policy $\pi$ and the behavior policy $\pi_\beta$. When $\pi$ selects actions outside the support of $\mathcal{D}$, the Q-function has no data to learn from and can be arbitrarily wrong. This leads to overestimation of out-of-distribution actions and policy degradation.

Classic Offline RL Methods

BCQ (Batch-Constrained Q-learning): Only consider actions similar to those in the dataset; uses a generative model to constrain action selection
CQL (Conservative Q-Learning): Add a regularizer that penalizes Q-values for out-of-distribution actions:
$$\min_Q \alpha \mathbb{E}_{s \sim \mathcal{D}} \left[ \log \sum_a \exp(Q(s,a)) - \mathbb{E}_{a \sim \pi_\beta}[Q(s,a)] \right] + \text{Bellman error}$$
IQL (Implicit Q-Learning): Avoid querying OOD actions by learning V and Q separately using expectile regression
Decision Transformer: Frame RL as sequence modeling; condition on desired return to generate actions

Inverse RL

Overview

Inverse Reinforcement Learning (IRL) is a special line of approaches in offline RL. It explicitly learns the reward function together with the policy. This is useful when rewards are difficult to specify but expert demonstrations are available. The learned reward can be used for continued training.

Problem Formulation

Given:

MDP without reward: $\mathcal{M} \setminus R = \langle \mathcal{S}, \mathcal{A}, P, \gamma \rangle$
Expert demonstrations: $\mathcal{D} = \{\tau_1, \tau_2, \ldots, \tau_N\}$, where $\tau_i = (s_0, a_0, s_1, a_1, \ldots)$

Goal: find a reward function $R(s, a)$ such that the expert policy is optimal under $R$.

Challenges

Non-identifiability and ambiguity: many reward functions can explain the same behavior (e.g., $R \equiv 0$ always works); expert behavior may be optimal under multiple reward functions.
Computational cost: classic IRL requires solving an RL problem in the inner loop.

Maximum Entropy IRL

Maximum Entropy IRL (Ziebart et al., 2008) resolves the ambiguity by assuming the expert follows a stochastic policy that maximizes entropy while favoring high-reward trajectories. This leads to the Boltzmann distribution over trajectories:

$$p(\tau) \propto \exp\left(\sum_t R(s_t, a_t)\right) = \exp(R(\tau))$$

How do we derive the Boltzmann trajectory distribution?

We seek a trajectory distribution $p(\tau)$ that is as uncommitted as possible, while still assigning high probability to high-reward trajectories. In MaxEnt IRL, this is formalized as:

Maximize entropy over trajectories: $$\max_{p} \; H(p) = -\sum_{\tau} p(\tau)\log p(\tau).$$
Satisfy normalization: $$\sum_{\tau} p(\tau) = 1.$$
Fix expected trajectory reward: $$\sum_{\tau} p(\tau)\,R(\tau) = \bar{R}.$$ This means we constrain the average reward level and otherwise choose the maximum-entropy distribution.

This is a constrained optimization problem. Using Lagrange multipliers $\alpha$ (normalization) and $\lambda$ (reward constraint):

$$ \mathcal{L}(p,\alpha,\lambda)= -\sum_{\tau} p(\tau)\log p(\tau) + \alpha\left(\sum_{\tau} p(\tau)-1\right) + \lambda\left(\sum_{\tau} p(\tau)R(\tau)-\bar{R}\right). $$

Set derivative w.r.t. $p(\tau)$ to zero:

$$ \frac{\partial \mathcal{L}}{\partial p(\tau)} = -(\log p(\tau)+1)+\alpha+\lambda R(\tau)=0 \Rightarrow \log p(\tau)=c+\lambda R(\tau). $$

Exponentiating both sides gives an exponential-family form:

$$p(\tau)=\frac{1}{Z}\exp\big(R(\tau)\big) \propto \exp\big(R(\tau)\big).$$

Intuitively: among all distributions consistent with a target reward level, choose the highest-entropy one, which yields soft exponential weighting by trajectory reward.

Let the reward be parameterized as $R_\theta(s,a)$ (any differentiable form). The objective is to maximize the likelihood of expert demonstrations:

$$\max_\theta \mathbb{E}_{\tau \sim \mathcal{D}}[\log p_\theta(\tau)] = \max_\theta \mathbb{E}_{\tau \sim \mathcal{D}}[R_\theta(\tau)] - \log Z_\theta$$ where $Z_\theta = \int \exp(R_\theta(\tau)) d\tau$ is the partition function.

The gradient compares reward sensitivities under expert trajectories and model trajectories, which involves solving an RL problem:

$$\nabla_\theta \mathcal{L} = \mathbb{E}_{\tau \sim \mathcal{D}}[\nabla_\theta R_\theta(\tau)] - \mathbb{E}_{\tau \sim p_\theta}[\nabla_\theta R_\theta(\tau)]$$ (Match expert-induced and model-induced reward gradients)

How do we derive this gradient?

Start from the log-likelihood objective: $$\mathcal{L}(\theta) = \mathbb{E}_{\tau \sim \mathcal{D}}[\log p_\theta(\tau)] = \mathbb{E}_{\tau \sim \mathcal{D}}[R_\theta(\tau)] - \log Z_\theta$$ where $Z_\theta = \int \exp(R_\theta(\tau))\,d\tau$. Taking the gradient w.r.t. $\theta$: $$\nabla_\theta \mathcal{L} = \mathbb{E}_{\tau \sim \mathcal{D}}[\nabla_\theta R_\theta(\tau)] - \nabla_\theta \log Z_\theta$$ The only non-trivial part is $\nabla_\theta \log Z_\theta$. By the chain rule: $$\nabla_\theta \log Z_\theta = \frac{\nabla_\theta Z_\theta}{Z_\theta} = \frac{1}{Z_\theta}\int \exp(R_\theta(\tau))\,\nabla_\theta R_\theta(\tau)\,d\tau$$ Recognizing that $\frac{\exp(R_\theta(\tau))}{Z_\theta} = p_\theta(\tau)$, this becomes: $$\nabla_\theta \log Z_\theta = \int p_\theta(\tau)\,\nabla_\theta R_\theta(\tau)\,d\tau = \mathbb{E}_{\tau \sim p_\theta}[\nabla_\theta R_\theta(\tau)]$$ Substituting back: $$\nabla_\theta \mathcal{L} = \mathbb{E}_{\tau \sim \mathcal{D}}[\nabla_\theta R_\theta(\tau)] - \mathbb{E}_{\tau \sim p_\theta}[\nabla_\theta R_\theta(\tau)]$$ Intuition: the first term increases reward along directions that the expert frequently visits; the second term decreases reward in directions that the current model frequently visits. Gradient ascent therefore shapes $R_\theta$ to score expert trajectories higher than model rollouts, which is a contrastive update that requires sampling from $p_\theta$ (i.e., running an RL inner loop) at each step.

GAIL: Generative Adversarial Imitation Learning

GAIL (Ho & Ermon, 2016) reframes imitation learning as a GAN problem, bypassing explicit reward recovery. Instead of the two-step process (IRL → RL), GAIL directly learns a policy that matches expert behavior.

Key Insight: Occupancy Measure Matching

Define the occupancy measure as the distribution of state-action pairs visited by a policy:

$$\rho_\pi(s, a) = \pi(a|s) \sum_{t=0}^{\infty} \gamma^t P(s_t = s | \pi)$$

GAIL matches the occupancy measure of the learned policy to that of the expert, using a discriminator to measure the difference.

GAIL Objective

$$\min_\pi \max_D \mathbb{E}_{\pi_E}[\log D(s, a)] + \mathbb{E}_{\pi}[\log(1 - D(s, a))] - \lambda H(\pi)$$

Generator (Policy $\pi$): The learned policy that generates state-action trajectories $(s,a)$.
Expert policy $\pi_E$: The demonstrator policy that generated expert trajectories; expectations $\mathbb{E}_{\pi_E}[\cdot]$ are over expert state-action samples.
Discriminator $D(s,a)$: A binary classifier outputting the probability that $(s,a)$ came from the expert distribution (high for expert-like pairs, low for policy-generated pairs).
Entropy bonus $H(\pi)$: Encourages exploration and prevents mode collapse

Training Procedure

Sample trajectories from current policy $\pi_\theta$
Update discriminator $D_w$ to classify expert vs. policy samples
Update policy using $-\log(1 - D_w(s,a))$ as reward signal (via policy gradient)
Repeat until convergence

Advantages over Classical IRL

Avoids explicitly solving RL in inner loop

In classical IRL, each reward update often requires running a separate RL optimization to compute the policy induced by that reward (the "inner loop"), which is computationally expensive. In GAIL, the discriminator directly provides a learning signal, so policy and discriminator are updated jointly without repeatedly solving a full RL problem from scratch.
Scales to high-dimensional continuous control tasks

GAIL uses function approximators (neural policy + discriminator) and policy-gradient updates, so it avoids tabular state/action representations and explicit density estimation over trajectories. This makes it practical for large observation spaces (e.g., images) and continuous actions (e.g., robotics control), where exact dynamic programming style methods are intractable.
Sample efficient in terms of expert demonstrations

The method only needs expert state-action examples as supervision for the discriminator rather than dense labels for the reward at every step. Because the discriminator reuses all collected expert trajectories across many updates, GAIL can learn useful imitation behavior from relatively limited demonstration datasets.
Model-free (no environment dynamics model required)

GAIL never estimates $P(s'\mid s,a)$ explicitly. It only needs sampled transitions from environment interaction to update policy and discriminator, so it can be applied when environment dynamics are unknown or too complex to model accurately.

Connection Between IRL and GANs

Finn et al. (2016) established a formal equivalence between maximum entropy IRL, GANs, and energy-based models. This connection unifies these approaches under a common mathematical framework.

The Equivalence

GAN	Maximum Entropy IRL	Energy-Based Model
Generator $G$	Policy $\pi$	Sampler
Discriminator $D$	Reward function $R$	Energy function $E$
Real data	Expert demonstrations	Training data
Generated samples	Policy rollouts	MCMC samples

Mathematical Connection

In MaxEnt IRL, the optimal policy under reward $R$ is:

$$\pi^*(a|s) \propto \exp(Q^*(s,a)) \propto \exp(R(s,a) + \gamma \mathbb{E}[V^*(s')])$$

Why is this the optimal policy?

Under MaxEnt IRL, the policy optimization objective at each state is to maximize causal entropy subject to matching the expected reward: $$\max_{\pi(\cdot|s)}\; H\!\left(\pi(\cdot|s)\right) + \sum_a \pi(a|s)\,Q^*(s,a)$$ where $H(\pi(\cdot|s)) = -\sum_a \pi(a|s)\log\pi(a|s)$ is the per-step entropy. Taking the derivative with respect to $\pi(a|s)$ and setting it to zero (with a Lagrange multiplier for the simplex constraint) gives: $$-\log\pi^*(a|s) - 1 + Q^*(s,a) = \lambda \;\Longrightarrow\; \pi^*(a|s) = \frac{\exp(Q^*(s,a))}{Z(s)}$$ i.e., a Boltzmann/softmax policy over Q-values. The soft Q-function is defined recursively as: $$Q^*(s,a) = R(s,a) + \gamma\,\mathbb{E}_{s'}[V^*(s')], \quad V^*(s) = \log\sum_a \exp(Q^*(s,a))$$ where $V^*$ is the soft value function (a log-sum-exp instead of the hard max in standard RL). This is the foundation of Soft Actor-Critic (SAC) and maximum-entropy RL, where entropy regularization naturally yields a Boltzmann policy rather than a deterministic greedy one.

This is analogous to the Boltzmann distribution in energy-based models:

$$p(x) \propto \exp(-E(x))$$ where $R = -E$ (reward is negative energy).

What is an energy-based model (EBM)?

An energy-based model (EBM) defines a probability distribution over inputs $x$ via a scalar energy function $E_\theta(x)$: $$p_\theta(x) = \frac{\exp(-E_\theta(x))}{Z(\theta)}, \quad Z(\theta) = \int \exp(-E_\theta(x))\,dx$$ Low energy → high probability; high energy → low probability. The normalizing constant $Z(\theta)$ (the partition function) is generally intractable; this is what makes EBMs difficult to train and sample from.

Training intuition: push down the energy of real/desired data points, push up the energy of undesired ones. In practice this requires sampling from the model (e.g., via MCMC) to estimate the gradient of $\log Z(\theta)$.

Connection to IRL: the reward function $R(s,a)$ plays the role of $-E(s,a)$. The trajectory distribution $p(\tau) \propto \exp(R(\tau))$ is exactly an EBM over trajectories. Learning the reward from demonstrations is therefore equivalent to fitting an EBM; this requires maximizing the (log-)likelihood of expert trajectories under the Boltzmann distribution, which then requires estimating the partition function $Z$ (hard in large state spaces, motivating GAIL's discriminator-based shortcut).

The GAN discriminator optimal solution is:

$$D^*(x) = \frac{p_{data}(x)}{p_{data}(x) + p_{gen}(x)}$$

When the generator density is available, the discriminator effectively learns an energy function that assigns low energy to real data and high energy to generated samples; this is exactly what the reward function does in IRL.

Implications

Algorithmic transfer: Techniques from GANs (e.g., Wasserstein distance, spectral normalization) can improve IRL stability
Theoretical grounding: GAIL's success is explained by its equivalence to MaxEnt IRL
Unified view: All three frameworks learn to distinguish "good" samples from "bad" samples via an energy/reward/discriminator function

@article{guo2026rl,
  title   = {Reinforcement Learning: Foundation and Modern Algorithms},
  author  = {Guo, Wenbo and Zhang, Ying},
  journal = {henrygwb.github.io},
  year    = {2026},
  url     = {https://henrygwb.github.io/posts/rl.htm}
}

Method	Purpose	Update Rule	Policy	Convergence
Policy Evaluation	Compute \(V^\pi\) for given policy \(\pi\)	\(V_{k+1}(s) = \sum_a \pi(a\|s) \sum_{s'} P(s'\|s,a)[R(s,a) + \gamma V_k(s')]\)	Fixed (given)	Always converges to \(V^\pi\)
Policy Iteration	Find optimal policy \(\pi^*\)	Alternate: Evaluate \(\rightarrow\) Improve \(\rightarrow\) Evaluate...	Updated after each evaluation	Converges to \(\pi^*\) in finite steps
Value Iteration	Find optimal value \(V^\) and policy \(\pi^\)	\(V_{k+1}(s) = \max_a \sum_{s'} P(s'\|s,a)[R(s,a) + \gamma V_k(s')]\)	Implicit (greedy w.r.t. \(V^*\))	Converges to \(V^*\) geometrically

GAN	Maximum Entropy IRL	Energy-Based Model
Generator \(G\)	Policy \(\pi\)	Sampler
Discriminator \(D\)	Reward function \(R\)	Energy function \(E\)
Real data	Expert demonstrations	Training data
Generated samples	Policy rollouts	MCMC samples