🏠 Home

Reinforcement Learning in LLM and Agentic Training: Methods, Tricks, and Applications in SWE

Mar 2026


This post discusses the application of reinforcement learning in LLM post-training, including the whole pipeline, representative methods (e.g., GRPO, DAPO, DPO), and key tricks for effective RL-based fine-tuning. At a high level, the training pipeline includes data and environment selection and construction, SFT warm-up, RL reward design, value estimation methods, rollout strategies, policy optimization, and other tricks. In the following, we discuss the key methods and tricks for each stage with an emphasis on software engineering and coding tasks.

Table of Contents

Modeling post-training as an RL problem

LLM post-training can be naturally framed as a reinforcement learning problem. Given a prompt \(x\), the language model (agent) generates a response autoregressively by producing one token at a time. At each decoding step \(t\), the state \(s_t = (x, y_1, \dots, y_{t-1})\) consists of the original prompt concatenated with all previously generated tokens. The action \(a_t = y_t\) is the next token selected from the vocabulary \(\mathcal{V}\), and the policy \(\pi_\theta(y_t \mid s_t)\) is the language model's conditional distribution over tokens. After the full response \(y = (y_1, \dots, y_T)\) is generated, a reward \(r(x, y)\) is assigned — typically by a reward model trained on human preferences or by a rule-based verifier. The objective is to find parameters \(\theta\) that maximize the expected reward: $$\max_\theta \; \mathbb{E}_{x \sim \mathcal{D},\, y \sim \pi_\theta(\cdot \mid x)} \big[ r(x, y) \big].$$ In this formulation, the token-level MDP has a deterministic transition (appending the chosen token to the sequence), and the reward is typically sparse — received only at the end of generation. A KL-divergence penalty \(D_{\mathrm{KL}}(\pi_\theta \| \pi_{\mathrm{ref}})\) against a reference policy \(\pi_{\mathrm{ref}}\) (the supervised fine-tuned model) is often added to prevent the policy from degenerating or exploiting the reward model.

As discussed later in the reward section, rather than a token-level MDP, we can also model RL as a sequence-level MDP or turn-level MDP, where the action and state are defined as a partial response or the response of one turn.

Base model selection

Here are some considerations for selecting a base model for doing RL fine-tuning:


Data and environment creation

LLM reasoning

For pure LLM reasoning tasks, the environment is typically a static dataset of prompts and responses, and the agent interacts with this dataset by generating responses to the prompts. Here the key is to select proper datasets for RL training. The decision can be made based on the following dimensions:


SFT warm-up

SFT warm-up equips the model with foundational capabilities — instruction following, domain-specific knowledge for the downstream task, and proper output formatting. Without SFT, the model often fails to produce any meaningful output in the early stages of RL, resulting in near-zero reward signals that prevent the policy from improving. SFT ensures that RL training receives positive reward early on.


Reward design

Reward is the core of RL training. Below are the three most widely adopted types of reward design.

Note that in LLMs, intermediate rewards can be either token-wise or turn-wise. However, the challenge is that different rollouts may have different dynamics, making baseline computation difficult.


Value estimation

In LLM post-training, the popular methods for estimating the value function are as follows.

Parametric value estimation methods:

Non-parametric value estimation methods:


Policy learning

When used for LLM post-training, new methods introduce variations to PPO to improve either efficiency or stability, as well as to prevent entropy collapse and reward hacking. Below, we introduce some representative methods. For more recent works, please refer to our paper list: Agentic AI: Model, Agent, Training, and Applications. At a high level, recent works mainly focus on the following aspects: designing process/intermediate rewards, improving sample efficiency, improving stability, and training the model for self-criticism.

Group Relative Policy Optimization (GRPO)

For each prompt, multiple responses are sampled and rewards are normalized within the group (Group: rollouts for the same prompt):

$$\hat{A}_i = \frac{r_i - \text{mean}(\{r_j\}_{j=1}^G)}{\text{std}(\{r_j\}_{j=1}^G)}$$

The GRPO objective combines this with PPO-style clipping and a KL penalty:

$$J_{\text{GRPO}}(\theta) = \frac{1}{G} \sum_{i=1}^{G} \frac{1}{|o_i|} \sum_{t=1}^{|o_i|} \left[ \min\left( \rho_{i,t} \, \hat{A}_i,\; \text{clip}(\rho_{i,t}, 1-\epsilon, 1+\epsilon) \, \hat{A}_i \right) - \beta \, D_{\text{KL}}\!\left(\pi_\theta \,\|\, \pi_{\text{ref}}\right)\right]$$

where \(\rho_{i,t} = \frac{\pi_\theta(o_{i,t} \mid x, o_{i,<t})}{\pi_{\text{old}}(o_{i,t} \mid x, o_{i,<t})}\) is the importance sampling ratio, \(G\) is the number of rollouts per prompt, and \(|o_i|\) is the length of the \(i\)-th response.

Key benefits: no critic network needed, reduced memory, simpler training pipeline.

Decoupled Clip and Dynamic Sampling Policy Optimization (DAPO)

DAPO improves upon GRPO by addressing entropy collapse and training efficiency. Key innovations include:

The DAPO objective with clip-higher:

$$J_{\text{DAPO}}(\theta) = \frac{1}{|\mathcal{G}_{\text{filtered}}|} \sum_{i \in \mathcal{G}_{\text{filtered}}} \frac{1}{|o_i|} \sum_{t=1}^{|o_i|} \min\!\left( \rho_{i,t} \, \hat{A}_i,\; \text{clip}(\rho_{i,t},\, 1-\epsilon_{\text{low}},\, 1+\epsilon_{\text{high}}) \, \hat{A}_i \right)$$

Removing Variance Normalization in GRPO (Dr. GRPO)

Dr. GRPO identifies that dividing by the group standard deviation in GRPO introduces two problems:

Dr. GRPO simply removes the std normalization, using only the mean-centered reward as the advantage:

$$\hat{A}_i^{\text{Dr.GRPO}} = r_i - \text{mean}(\{r_j\}_{j=1}^G)$$

It also removes the KL penalty (similar to DAPO). Critically, it replaces the per-token loss normalization (dividing by response length \(|o_i|\)) with a constant normalization factor (e.g., the max completion length \(C\)). In GRPO, dividing by \(|o_i|\) under-penalizes long incorrect responses, causing the model to generate progressively longer wrong outputs — a form of reward hacking. Using a constant instead ensures that long incorrect sequences receive properly scaled penalties, improving token efficiency. The full Dr. GRPO objective is:

$$J_{\text{Dr.GRPO}}(\theta) = \frac{1}{G} \sum_{i=1}^{G} \frac{1}{C} \sum_{t=1}^{|o_i|} \min\left( \rho_{i,t} \, \hat{A}_i^{\text{Dr.GRPO}},\; \text{clip}(\rho_{i,t}, 1-\epsilon, 1+\epsilon) \, \hat{A}_i^{\text{Dr.GRPO}} \right)$$

where \(\rho_{i,t} = \frac{\pi_\theta(o_{i,t} \mid x, o_{i,<t})}{\pi_{\text{old}}(o_{i,t} \mid x, o_{i,<t})}\) is the importance sampling ratio and \(C\) is a constant (e.g., max completion length) replacing the per-response length \(|o_i|\).

On/Off-policy


Learning tricks

This part summarizes some useful tricks along the learning pipeline.


Agentic RL and SWE tasks

In the following, we discuss the unique aspects and challenges of agentic RL, compared to LLM reasoning.

Environments and data

The research community currently lacks robust environments suitable for large-scale agentic training. The key limitations are as follows:

Trajectory collection. High-quality trajectories are essential for training effective agents. There are two main approaches:

While easier to scale, simulated trajectories are less effective for policy training (due to hallucination and lack of real tool calls). As a result, using LLMs to generate new environments and then running agents in those environments is a better approach than using LLMs to directly simulate trajectories. However, even synthetic environments can suffer from a lack of diversity. The model tends to reinforce high-density regions of its own distribution while failing to generate long-tail or low-density examples, which are often the most valuable for training.

Connection between reasoning and agentic training. The typical practice is a two-stage approach: train reasoning first, then fine-tune on agentic tasks. This is motivated by the observation that reasoning ability (e.g., chain-of-thought, planning, self-correction) provides a foundation that agentic skills build upon. During the agentic training stage, mixing in reasoning traces alongside agentic trajectories helps prevent catastrophic forgetting of reasoning capabilities. The ratio of reasoning to agentic data during the second stage is a hyperparameter that depends on the target task distribution.

Training infrastructure. The key challenge for agentic training infrastructure is environment management and its coordination with other components in the training pipeline. The training pipeline consists of four interconnected layers: environment creation and management, the agent framework, the training framework, and the inference framework. Their relationships are illustrated in the figure below.

Environment Layer Creation, reset, state mgmt (Docker containers) Agent Framework Tool use, multi-turn logic (Action & observation loop) observations actions Rollout Collection Agent interacts with env; collects trajectories + rewards trajectories Training Framework (veRL, SLIME, A-real, etc.) Inference Engine vLLM / SGLang RL Update Gradient & reward comp. Distributed Orchestration Ray — scheduling, resource mgmt, multi-node coordination
Training task and rollout

Agentic tasks are typically long-horizon and multi-turn, which introduces the following challenges:

Common mitigation strategies include the following. For reducing variance, using a baseline (e.g., group mean or leave-one-out) is the standard approach; bounding the advantage values and increasing the number of rollouts per prompt also help. For large gradient norms, small learning rates and gradient clipping are widely used techniques. Sparse reward and rollout efficiency remain two areas where active research is most needed.




@article{guo2026rl_llm,
  title   = {Reinforcement Learning in LLM and Agentic Training: Methods, Tricks, and Applications in SWE},
  author  = {Guo, Wenbo and Zhang, Ying},
  journal = {henrygwb.github.io},
  year    = {2026},
  url     = {https://henrygwb.github.io/posts/rl_llm.htm}
}