Reinforcement Learning in LLM and Agentic Training

Reinforcement Learning in LLM and Agentic Training: Methods, Tricks, and Applications in SWE

Mar 2026

This post discusses the application of reinforcement learning in LLM post-training, including the whole pipeline, representative methods (e.g., GRPO, DAPO, DPO), and key tricks for effective RL-based fine-tuning. At a high level, the training pipeline includes data and environment selection and construction, SFT warm-up, RL reward design, value estimation methods, rollout strategies, policy optimization, and other tricks. In the following, we discuss the key methods and tricks for each stage with an emphasis on software engineering and coding tasks.

Modeling post-training as an RL problem
Data and environment creation
SFT warm-up
Reward design
Value estimation
Policy learning
Learning tricks
Agentic RL and SWE tasks

Modeling post-training as an RL problem

LLM post-training can be naturally framed as a reinforcement learning problem. Given a prompt $x$, the language model (agent) generates a response autoregressively by producing one token at a time. At each decoding step $t$, the state $s_t = (x, y_1, \dots, y_{t-1})$ consists of the original prompt concatenated with all previously generated tokens. The action $a_t = y_t$ is the next token selected from the vocabulary $\mathcal{V}$, and the policy $\pi_\theta(y_t \mid s_t)$ is the language model's conditional distribution over tokens. After the full response $y = (y_1, \dots, y_T)$ is generated, a reward $r(x, y)$ is assigned — typically by a reward model trained on human preferences or by a rule-based verifier. The objective is to find parameters $\theta$ that maximize the expected reward: $$\max_\theta \; \mathbb{E}_{x \sim \mathcal{D},\, y \sim \pi_\theta(\cdot \mid x)} \big[ r(x, y) \big].$$ In this formulation, the token-level MDP has a deterministic transition (appending the chosen token to the sequence), and the reward is typically sparse — received only at the end of generation. A KL-divergence penalty $D_{\mathrm{KL}}(\pi_\theta \| \pi_{\mathrm{ref}})$ against a reference policy $\pi_{\mathrm{ref}}$ (the supervised fine-tuned model) is often added to prevent the policy from degenerating or exploiting the reward model.

As discussed later in the reward section, rather than a token-level MDP, we can also model RL as a sequence-level MDP or turn-level MDP, where the action and state are defined as a partial response or the response of one turn.

Base model selection

Here are some considerations for selecting a base model for doing RL fine-tuning:

General models work well: base models without any domain-specific fine-tuning provide a good foundation; sometimes base models are better than instruct models. In general, base models serve as better starting points because: 1) they avoid format OOD; 2) they do not carry too much domain-specific knowledge.
Domain-specific models are not necessarily better: coding-specific models do not provide clear advantages for SWE and security tasks.
Previously RL-fine-tuned models can be problematic: they may have reduced exploration capacity, making further RL less effective.

Data and environment creation

LLM reasoning

For pure LLM reasoning tasks, the environment is typically a static dataset of prompts and responses, and the agent interacts with this dataset by generating responses to the prompts. Here the key is to select proper datasets for RL training. The decision can be made based on the following dimensions:

Domain composition:
- Math only: simple, but may limit generalization.
- Math → Coding: sequential curriculum; can enable smooth transfer to coding tasks.
- Math + Code mixed: joint training; sometimes worse than the math-to-coding sequential curriculum.
Difficulty curriculum:
- Easy to hard: builds foundational reasoning patterns.
- Hard to easy: accelerates learning if base capability is sufficient.
- Hard only with PRM: recent research shows that process reward models can help when training only on hard tasks.

SFT warm-up

SFT warm-up equips the model with foundational capabilities — instruction following, domain-specific knowledge for the downstream task, and proper output formatting. Without SFT, the model often fails to produce any meaningful output in the early stages of RL, resulting in near-zero reward signals that prevent the policy from improving. SFT ensures that RL training receives positive reward early on.

Data and training duration: It is important to train SFT to a proper stage without overfitting to the given data. Another open question is how to partition data between SFT and RL—whether to use the same data for both stages, reserve more challenging tasks for RL, or curate separate datasets with different difficulty levels.
Distribution shift: SFT may cause a distribution shift from the base model. This is why starting with a base model without any fine-tuning can provide more flexibility. The distribution of the SFT-trained model and the RL data may differ significantly. Mitigation strategies include KL divergence constraints and reweighting the SFT loss with downstream RL in mind — e.g., PEAR.

Reward design

Reward is the core of RL training. Below are the three most widely adopted types of reward design.

Final verifiable reward: this is the most widely used type, which gives a final reward at the end of an episode. It is easy to obtain, typically indicating whether the RL agent has finished the given task or not. However, training only with a final reward can be challenging, especially for long-horizon tasks. At the early training stage, the reward signal may be all zeros and thus cannot provide positive and meaningful feedback. Besides, training tends to be unstable for long-horizon tasks with sparse rewards. In addition, reward hacking and credit assignment (knowing which states are more important to the final outcome) are two other challenges for training with only a final reward.
Non-parametric intermediate reward: having intermediate rewards is useful for addressing the challenges faced by final-only rewards, including providing meaningful reward signals at early training stages, mitigating reward hacking, and improving credit assignment. However, the design of intermediate rewards is important, and poorly designed rewards may constrain the search space of the learning process. Some popular non-parametric intermediate rewards are as follows:
- Verifiable sub-goals: designing intermediate checkpoints for a given task that can be automatically verified (e.g., reaching a stable standing posture in robotic control tasks).
- Self-distillation: using the policy's own predictions as a reward signal, e.g., SDPO uses $\log \frac{\pi(y \mid x, f, y_{\text{prev}})}{\pi(y \mid x, y_{\text{prev}})}$ as the intermediate reward, where $f$ is additional context.
- Entropy/confidence-based reward: using policy entropy or confidence as a reward signal; mainly effective during the early training stage, depends on the base model's capabilities, and may cause entropy collapse.
Parametric intermediate reward: training a reward model to give intermediate rewards, or using an LLM-as-judge. For training a reward model, a few methods exist:
- MCTS: roll out from the current state and assign an intermediate reward based on the rollout success rate; can be noisy and time-consuming.
- LLM-as-judge: write a set of rubrics and use another LLM to give rewards; can be expensive.
- Inverse RL: learn a reward function from expert trajectories.

Note that in LLMs, intermediate rewards can be either token-wise or turn-wise. However, the challenge is that different rollouts may have different dynamics, making baseline computation difficult.

Value estimation

In LLM post-training, the popular methods for estimating the value function are as follows.

Parametric value estimation methods:

REINFORCE (Monte Carlo policy gradient)
- Return: for each time step, compute the full discounted return $G_t = \sum_{k=t}^{T} \gamma^{k-t} r_k$.
- Advantage: form $A_t = G_t - b_t$, where $b_t$ is a baseline; pure REINFORCE uses $b_t = 0$.
- With value baseline: choosing $b_t = V(s_t)$ gives the Monte Carlo advantage $A_t = G_t - V(s_t)$, which is equivalent to GAE with $\lambda = 1$.
- Policy update: for each $t$, move policy parameters in the direction $\nabla_\theta \log \pi_\theta(a_t\mid s_t)\,A_t$ to increase the probability of actions with high advantage.
GAE (Generalized Advantage Estimation)
- TD residuals: first compute one-step TD errors $\delta_t = r_t + \gamma V(s_{t+1}) - V(s_t)$.
- Smoothed advantages: work backward through a trajectory to build $A_t^{GAE(\gamma,\lambda)} = \sum_{l=0}^{T-t-1} (\gamma\lambda)^l \delta_{t+l}$, which can be implemented with the recursion $A_t = \delta_t + \gamma\lambda A_{t+1}$.
- Bias–variance trade-off: $\lambda = 0$ gives TD(0)-style, low-variance but more biased advantages; $\lambda = 1$ recovers Monte Carlo advantages (same as REINFORCE with value baseline), which are low-bias but higher-variance.
- Use in actor–critic: plug $A_t^{GAE}$ into the policy gradient update instead of using raw returns, giving a smoother learning signal.

Non-parametric value estimation methods:

RLOO: REINFORCE with a leave-one-out mean as baseline and (optionally) a PPO clip. It is intended for settings where you have only a small number of rollouts per prompt and limited accelerator memory, so you cannot keep large batches or a big critic around. RLOO stays simple and lightweight by using small groups and basic statistics (leave-one-out mean, optional clipping) instead of large value networks or replay buffers. Given $K$ rollouts $\{o_1, \dots, o_K\}$ for a prompt $x$ with rewards $\{r_1, \dots, r_K\}$, the advantage for the $i$-th rollout is: \[ A_i^{\text{RLOO}} = r_i - \frac{1}{K-1} \sum_{j \neq i} r_j \]
GRPO: REINFORCE with a group mean as baseline, where the centered advantage is divided by the group standard deviation to normalize the advantage estimate. Given the same $K$ rollouts, GRPO normalizes advantages by the group standard deviation: \[ A_i^{\text{GRPO}} = \frac{r_i - \text{mean}(\{r_j\}_{j=1}^K)}{\text{std}(\{r_j\}_{j=1}^K)} \]
Group mean vs. batch mean baseline: the baseline can be computed per-prompt (group mean, as in GRPO/RLOO) or across all prompts in a batch (batch mean). Group mean removes prompt-level difficulty variation — each prompt's rollouts compete only against each other, so easy and hard prompts contribute equally to the gradient. Batch mean, by contrast, lets easy prompts (with higher average reward) dominate the gradient, which can bias learning toward prompts the model already handles well. In practice, batch mean can be preferable when the number of rollouts per prompt $K$ is very small (e.g., $K=2$), since the group statistics become noisy.

Policy learning

When used for LLM post-training, new methods introduce variations to PPO to improve either efficiency or stability, as well as to prevent entropy collapse and reward hacking. Below, we introduce some representative methods. For more recent works, please refer to our paper list: Agentic AI: Model, Agent, Training, and Applications. At a high level, recent works mainly focus on the following aspects: designing process/intermediate rewards, improving sample efficiency, improving stability, and training the model for self-criticism.

Group Relative Policy Optimization (GRPO)

For each prompt, multiple responses are sampled and rewards are normalized within the group (Group: rollouts for the same prompt):

$$\hat{A}_i = \frac{r_i - \text{mean}(\{r_j\}_{j=1}^G)}{\text{std}(\{r_j\}_{j=1}^G)}$$

The GRPO objective combines this with PPO-style clipping and a KL penalty:

$$J_{\text{GRPO}}(\theta) = \frac{1}{G} \sum_{i=1}^{G} \frac{1}{|o_i|} \sum_{t=1}^{|o_i|} \left[ \min\left( \rho_{i,t} \, \hat{A}_i,\; \text{clip}(\rho_{i,t}, 1-\epsilon, 1+\epsilon) \, \hat{A}_i \right) - \beta \, D_{\text{KL}}\!\left(\pi_\theta \,\|\, \pi_{\text{ref}}\right)\right]$$

where $\rho_{i,t} = \frac{\pi_\theta(o_{i,t} \mid x, o_{i,<t})}{\pi_{\text{old}}(o_{i,t} \mid x, o_{i,<t})}$ is the importance sampling ratio, $G$ is the number of rollouts per prompt, and $|o_i|$ is the length of the $i$-th response.

Key benefits: no critic network needed, reduced memory, simpler training pipeline.

Decoupled Clip and Dynamic Sampling Policy Optimization (DAPO)

DAPO improves upon GRPO by addressing entropy collapse and training efficiency. Key innovations include:

Clip-Higher: uses asymmetric clipping bounds $[1-\epsilon_{\text{low}},\, 1+\epsilon_{\text{high}}]$ with $\epsilon_{\text{high}} > \epsilon_{\text{low}}$. This allows the policy to more aggressively increase the probabilities of good actions while being conservative about decreasing probabilities, preventing entropy collapse.
Dynamic Sampling: filters out prompts where all samples have the same reward (all correct or all incorrect), as these provide no learning signal for relative ranking.
Token-level loss: computes the loss per token rather than per sequence for better credit assignment.
No KL penalty: removes the KL divergence term from the reference policy, relying solely on clipping for stability.

The DAPO objective with clip-higher:

$$J_{\text{DAPO}}(\theta) = \frac{1}{|\mathcal{G}_{\text{filtered}}|} \sum_{i \in \mathcal{G}_{\text{filtered}}} \frac{1}{|o_i|} \sum_{t=1}^{|o_i|} \min\!\left( \rho_{i,t} \, \hat{A}_i,\; \text{clip}(\rho_{i,t},\, 1-\epsilon_{\text{low}},\, 1+\epsilon_{\text{high}}) \, \hat{A}_i \right)$$

Removing Variance Normalization in GRPO (Dr. GRPO)

Dr. GRPO identifies that dividing by the group standard deviation in GRPO introduces two problems:

Upweights low-signal groups: when all rollouts for a prompt have similar rewards (low std), the division inflates their advantage, even though these groups carry little useful learning signal.
Downweights high-signal groups: groups with diverse outcomes (high std) — which are the most informative — get their advantages shrunk.

Dr. GRPO simply removes the std normalization, using only the mean-centered reward as the advantage:

$$\hat{A}_i^{\text{Dr.GRPO}} = r_i - \text{mean}(\{r_j\}_{j=1}^G)$$

It also removes the KL penalty (similar to DAPO). Critically, it replaces the per-token loss normalization (dividing by response length $|o_i|$) with a constant normalization factor (e.g., the max completion length $C$). In GRPO, dividing by $|o_i|$ under-penalizes long incorrect responses, causing the model to generate progressively longer wrong outputs — a form of reward hacking. Using a constant instead ensures that long incorrect sequences receive properly scaled penalties, improving token efficiency. The full Dr. GRPO objective is:

$$J_{\text{Dr.GRPO}}(\theta) = \frac{1}{G} \sum_{i=1}^{G} \frac{1}{C} \sum_{t=1}^{|o_i|} \min\left( \rho_{i,t} \, \hat{A}_i^{\text{Dr.GRPO}},\; \text{clip}(\rho_{i,t}, 1-\epsilon, 1+\epsilon) \, \hat{A}_i^{\text{Dr.GRPO}} \right)$$

where $\rho_{i,t} = \frac{\pi_\theta(o_{i,t} \mid x, o_{i,<t})}{\pi_{\text{old}}(o_{i,t} \mid x, o_{i,<t})}$ is the importance sampling ratio and $C$ is a constant (e.g., max completion length) replacing the per-response length $|o_i|$.

On/Off-policy

On-policy: learns only from data generated by the current policy; this is stable but discards old trajectories when the policy changes.
Off-policy: reuses past experience (e.g., from a replay buffer or a different behavior policy), which is sample-efficient but introduces a mismatch between the data distribution and the current policy.
Stability tricks: methods like importance sampling weights, target networks, and careful objective design (e.g., DQN-style targets) help correct this mismatch and keep off-policy learning stable with function approximation.

Learning tricks

This part summarizes some useful tricks along the learning pipeline.

Environment and data selection: Quality and diversity are important, but the choice of curriculum learning or domain composition is task-dependent.
SFT warm-up: avoid overfitting to SFT data, which may differ significantly from RL data; use importance sampling to constrain SFT so the model does not drift too far from the target policy.
Value estimation:
- Parametric vs. non-parametric: if we can estimate a good value function, parametric methods are better for long-horizon tasks as they can assign a reward to every token. However, estimating a good value function is costly, similar to MCTS.
- One way to estimate the token-level reward is to distribute the final reward across tokens, e.g., proportionally to each token's log-probability ratio: $r_t = \beta \log \frac{\pi_\phi(y_t \mid y_{<t})}{\pi_{\text{ref}}(y_t \mid y_{<t})}$.
KL approximation: a few widely used options are: (1) no KL (GLM-5); (2) the k2 approximation (used by Kimi-2.5) or the k3 approximation (used by DeepSeek-V3.2).
Token weighting: a growing body of work studies how to assign different weights to tokens during RL training. The key principles are: incentivize tokens with high advantage (encourage good actions), upweight tokens with high entropy (combat entropy collapse), and downweight tokens that are out-of-distribution relative to the current policy. Common approaches include:
- Entropy-based weighting: upweight tokens where the policy is uncertain (high entropy) to prevent entropy collapse.
- Advantage-entropy combination: encourage tokens with high advantage but low confidence (high entropy), focusing learning on informative yet undecided positions.
- Importance sampling ratio: use the ratio between the current and old policies (or the current and sampling policies) as per-token weights, naturally downweighting OOD tokens.
Rollout strategies: the key goals are to avoid out-of-distribution data and improve sampling efficiency, especially for long-horizon tasks.
- Use a large number of rollouts but filter out trajectories with low entropy or uniform reward (no learning signal).
- For off-policy training, filter out trajectories that deviate significantly from the current policy to reduce distribution mismatch.
- Leverage previous trajectories to bootstrap later rollouts, rather than performing independent rollouts across training iterations. For example, partial answer bootstrapping provides partial solutions from earlier rollouts as starting points, and more advanced strategies reuse full previous trajectories.
Learning rate:
- Standard scheduling: cosine or linear decay over the course of training; warmup at the beginning to stabilize early updates.
- Use a smaller learning rate for long-horizon tasks to mitigate large gradient norms caused by long sequences.
- Adaptive LR based on output length: scale down the learning rate when response length is large, since longer outputs accumulate more gradient.
Stability:
- Per-token baseline rather than shared baseline to reduce variance (reference).
- Mask out extreme gradients and use consistent experts for training and inference in MoE models (reference).
- Large batch size: averaging gradients over more rollouts.
Clip-higher, reward shaping, and early stopping

Agentic RL and SWE tasks

In the following, we discuss the unique aspects and challenges of agentic RL, compared to LLM reasoning.

Environments and data

The research community currently lacks robust environments suitable for large-scale agentic training. The key limitations are as follows:

Limited domain coverage: Most well-developed environments target web agents; environments for software engineering, security, and other agentic domains remain scarce.
Lack of training support: Many environments do not support reliable state reset between episodes and in-episode state management, making them unsuitable for RL training.
Infrastructure overhead: RL environments for agentic tasks typically run inside Docker containers. Current Docker setups are slow and resource-intensive at scale, creating a bottleneck for large-scale training.
Complex container orchestration: For SWE and security tasks, each rollout may require a standalone Docker container per target repository to avoid race conditions. Domain-specific tools (e.g., compilers, debuggers, security scanners) often need separate containers as well, making scheduling and inter-container communication complex.

Trajectory collection. High-quality trajectories are essential for training effective agents. There are two main approaches:

Real trajectories: Collecting traces by actually running the agent in the environment.
Simulated trajectories: Using existing LLMs to construct synthetic agentic trajectories.

While easier to scale, simulated trajectories are less effective for policy training (due to hallucination and lack of real tool calls). As a result, using LLMs to generate new environments and then running agents in those environments is a better approach than using LLMs to directly simulate trajectories. However, even synthetic environments can suffer from a lack of diversity. The model tends to reinforce high-density regions of its own distribution while failing to generate long-tail or low-density examples, which are often the most valuable for training.

Connection between reasoning and agentic training. The typical practice is a two-stage approach: train reasoning first, then fine-tune on agentic tasks. This is motivated by the observation that reasoning ability (e.g., chain-of-thought, planning, self-correction) provides a foundation that agentic skills build upon. During the agentic training stage, mixing in reasoning traces alongside agentic trajectories helps prevent catastrophic forgetting of reasoning capabilities. The ratio of reasoning to agentic data during the second stage is a hyperparameter that depends on the target task distribution.

Training infrastructure. The key challenge for agentic training infrastructure is environment management and its coordination with other components in the training pipeline. The training pipeline consists of four interconnected layers: environment creation and management, the agent framework, the training framework, and the inference framework. Their relationships are illustrated in the figure below.

Training task and rollout

Agentic tasks are typically long-horizon and multi-turn, which introduces the following challenges:

High variance from long sequences: The policy gradient estimator used in LLM post-training is: \[ \nabla_\theta J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta} \left[ A_i \cdot \sum_{t=1}^{T} \nabla_\theta \log \pi_\theta(a_t | a_{<t}) \right] \] where $A_i$ is the sequence-level advantage shared across all tokens in the $i$-th rollout. The variance of the inner sum grows with the horizon $T$. Letting $X_t = \nabla_\theta \log \pi_\theta(a_t | a_{<t})$, we have: \[ \text{Var}\!\left(\sum_{t=1}^{T} X_t\right) = \sum_{t=1}^{T} \text{Var}(X_t) + 2\sum_{i < j} \text{Cov}(X_i, X_j) \] If the terms were independent, variance grows linearly as $O(T)$. In practice, autoregressive dependencies make the terms positively correlated, so variance can grow faster than $T$ — up to $O(T^2)$ in the worst case.
Sparse reward: long-horizon tasks typically provide only a final success/fail signal, resulting in low training efficiency — especially in early stages. This also makes credit assignment difficult.
Low rollout efficiency: since trajectories are long, the standard rollout strategy — which treats each trajectory independently and waits for all rollouts to finish before updating — becomes highly inefficient.
Context management: long-horizon tasks involve long contexts and may require context compression during execution, creating extra challenges.

Common mitigation strategies include the following. For reducing variance, using a baseline (e.g., group mean or leave-one-out) is the standard approach; bounding the advantage values and increasing the number of rollouts per prompt also help. For large gradient norms, small learning rates and gradient clipping are widely used techniques. Sparse reward and rollout efficiency remain two areas where active research is most needed.

@article{guo2026rl_llm,
  title   = {Reinforcement Learning in LLM and Agentic Training: Methods, Tricks, and Applications in SWE},
  author  = {Guo, Wenbo and Zhang, Ying},
  journal = {henrygwb.github.io},
  year    = {2026},
  url     = {https://henrygwb.github.io/posts/rl_llm.htm}
}