🏠 Home
Comprehensive Pipeline and Key Takeaways on SFT Fine-tuning for LLM Reasoning
Oct. 2025
This post summarizes the key takeaways from existing papers and our experience on SFT fine-tuning for LLM reasoning.
Although RL has dominated reasoning post-training, SFT is still useful for many scenarios, e.g.,
(1) when training models for specific applications that do not require generalizability across applications;
(2) as a warm-start for RL training, especially for transferring models to new domains or tasks (given that RL
receives few positive signals at the beginning of training);
(3) when RL is too expensive to run or cannot find good reward functions.
In the following, we first go over the overall workflow of SFT fine-tuning for LLM reasoning, and then discuss the
recipes for each step.
We mainly focus on the experience we have drawn from SWE and security applications; some insights actually conflict
with the conventional wisdom for general math and coding reasoning, which is interesting!
Please refer to this repo for
notable papers in this space.
The typical SFT training process consists of the following steps:
- Data selection and processing: Select data/questions \(\mathcal{X}\) for generating supervised data.
- Teacher model selection: Select the proper teacher model(s) \(\mathcal{M}\) to generate reasoning data.
- Reasoning data generation: feed the selected data into the teacher model(s) and generate corresponding
responses \(\mathcal{D} = (\mathcal{X}, \mathcal{Y})\), where \(\mathcal{Y} = \mathcal{M}(\mathcal{X})\).
- Reasoning data filtering: filter the generated data based on certain criteria to ensure training
data quality.
- Base model(s) selection: select proper model(s) as the base model(s) for the student model(s).
- SFT training: fine-tune the base model on the filtered reasoning data \(\mathcal{D}\) via next-token
prediction using the cross-entropy loss.
- Testing/inference: generate responses for new inputs using the fine-tuned model.
The key steps that introduce different strategies are steps 1-5. Step 6 follows the standard supervised fine-tuning
process, and Step 7 is the standard inference process with many tricks introduced
here.
Data selection and processing
- Data quality is more important than quantity: a small amount of high-quality data is more useful
than a large amount of randomly selected data. This has been consistently observed in many applications, including general
coding, math, and SE and security tasks.
Here, high-quality data typically refers to well-structured and difficult problems that are closely related to
the target application.
These questions also need to have clear and unambiguous answers.
- There is no consensus on whether curriculum learning is helpful for SFT training, but some works show that it
is helpful for RL-based fine-tuning.
- There are mixed takes on whether data diversity is helpful; some works show that repeatedly using the same
high-quality data is more helpful than using diverse data.
However, in our SE and security tasks, we find that data diversity can be beneficial because it covers more
scenarios, as long as all the included data are of high quality.
- Avoiding data contamination is critical for SFT training. We make sure that the training and testing data come
from different projects, domains, vulnerabilities, and even programming languages.
Teacher model selection
We typically select strong open-source reasoning models as our teacher models, mainly Qwen and DeepSeek
models. For agentic tasks, Kimi and GLM models can also be used.
- Existing studies have found that stronger models are not necessarily good teacher models, especially DeepSeek-R1.
We also find that using DeepSeek-R1 as the sole teacher model can lead to model collapse, as its reasoning
trajectories contain many transition words (e.g., "wait" and "but").
- We observe in some security tasks that using multiple teacher models is better than using a single
teacher model. This is probably because different models have different strengths and weaknesses, and using multiple
models can provide more diverse and comprehensive reasoning data. For example, in vulnerability detection, Qwen
and DeepSeek models have different emphases on false positives and false negatives.
- For agentic tasks, we need teacher models that can conduct tool calling; commercial models may be better
choices.
- Some open-source models tend to produce long reasoning trajectories, which are not good for SFT training because
1) small models are not good at handling long sequences; 2) training on long trajectories is more likely to trigger
model collapse.
Reasoning data generation
After selecting the teacher models, we generate reasoning data by prompting the teacher models with the selected
inputs and collecting their outputs. The prompts are in general not that important unless some specific model
behaviors are observed in the generated data. For example, models tend to hallucinate on program run-time states
(variable values and dependencies). Some specific prompts can be helpful but are not always useful.
For cases with multiple teacher models, the data mixture ratio is not that important. For example, we tried
different ratios of Qwen and DeepSeek outputs and found that the model performance was relatively stable. An even
distribution is an easy choice.
Reasoning data filtering
This is the key step for SFT training with the most tricks:
- Two typical criteria for reasoning data filtering are reasoning trajectory length and answer correctness.
- For length filtering, there is consensus that setting a maximum length threshold is helpful for SFT
training. The threshold can vary across applications. For example, in vulnerability detection, it
can be around 2K to 4K tokens.
- For SE and security tasks, within a certain length constraint, longer and more diverse reasoning is better.
- For answer correctness, early works show that it is not necessary for coding and math reasoning. The argument
is that models mainly learn how to perform reasoning rather than learning the specific answers. However, in our
SE and security tasks, we find that answer correctness is very important for SFT training. This is
probably because the reasoning process in these tasks requires domain knowledge that base models do not have;
the learning process also acquires this knowledge along with the reasoning patterns.
- There are other criteria on the quality of the reasoning trajectories that are not easy to quantify, e.g.,
whether the reasoning is coherent and whether it contains hallucinated content. In coding-related tasks, having reasoning
processes that actually reason about program states is better than pure text-based reasoning.
- For certain applications, after filtering out wrong answers, we may not have enough training data. Here, two
methods are useful for recovering more training data. Their high-level idea is to prompt the teacher
models with more context and see if they can give correct answers.
- Rationalization: give the correct answers to the teacher models, ask them to reason about why they are
correct, and use the reasoning process as the training data. This method is prone to introducing
hallucination, especially in SE and security tasks. For example, the teacher models may assume
non-existent program states or dependencies.
- Constitution: prompt the teacher model with additional context, e.g., basic knowledge about certain
vulnerabilities and how humans typically analyze them.
This is useful for many cases, but requires human effort to write the constitutions.
- Sampling multiple answers from the same input question is helpful. Some math-specific filtering methods are not
useful here.
- Summarization: in SE and security tasks, we find that asking another model (typically a commercial model)
to summarize the raw reasoning of the teacher model, and training on the summarized output, is beneficial for both accuracy and
inference efficiency.
Data format and objective function
We need to separate the prompt (system, user), model response (assistant), and tool response (tool) using special tokens,
and only compute the loss on model responses (via masking).
SFT uses the next-token prediction loss. The loss function itself is fixed; the key is the token weighting.
For short responses, applying a small but non-zero loss weight to prompt tokens
can also help. There are three types of token weight assignments for multi-turn conversations:
- Token-equal weight: each token has the same weight; longer conversations receive a larger total weight.
- Turn-equal weight: each turn has the same weight; multi-turn samples receive a larger total weight.
- Sample-equal weight: each sample has the same weight regardless of length or number of turns.
With padding, we need an attention mask and must adjust the token weights accordingly, since padding changes the effective turn or sample length.
Care must be taken to ensure that padding does not alter the intended sample/turn weight.
Base model selection
Typically, we select open-source small models as the base models (mainly Qwen models).
- The base model does not have to be a reasoning model. Sometimes using a non-reasoning model is better than a reasoning
model, as the reasoning trajectories of the teacher models may be very different from those of the base models, leading to
an OOD issue during training.
- For SE and security tasks, we don't need a coding model as the base model. For example,
Qwen-2.5-Instruct is better than Qwen-2.5-Coder.
- For complex tasks, we can use multiple small models as base models, each responsible for one or more
specific sub-tasks.
In this section, we discuss two cases where we apply the above recipes for SFT training and achieve better results than
large open-source or commercial models.
GitHub issue resolving
In the Co-PatcheR paper, we train three 7B models and achieve better
performance than large open-source models trained via SFT or RL (e.g., SWE-RL-70B).
The key recipe is as follows:
- Data selection: we select 500 cases from the SWE-bench training set, prioritizing difficult cases. We
also find that increasing the number of cases to 2K does not help much.
- Teacher model selection: we use Claude as the teacher model, as it can call tools better than
open-source models.
- Reasoning data filtering: filtering out samples with wrong answers helps all components, but rationalization
does not.
- Having an additional critic step is helpful. That is, we mix generation data with
critic data to train the base model, where critic data is generated by prompting the teacher model to critique
its own generated outputs.
- Base model selection: we find that using one model for both localization and generation performs similarly to using
separate models; using multiple models for PoC generation provides the necessary diversity that a single model cannot
achieve.
Static vulnerability detection
We also propose a new SFT recipe for vulnerability detection. The trained 7B model outperforms SOTA commercial
models like o3 and Claude-3.7. The paper will be out soon.
The key recipe is as follows:
- Data selection: we construct our training set by mixing data from multiple existing benchmarks. Note that
existing benchmarks have false positives, especially for function-level code. We filter out these cases as
much as possible, as they may cause the trained model to hallucinate. We also mainly include patched code as
benign samples rather than normal functions, as this helps the model genuinely reason about vulnerabilities.
- Teacher model selection: we use multiple teacher models to balance false positives and false negatives.
- Reasoning data filtering: filtering out wrong answers and long trajectories is helpful; constitution and
summarization are also helpful.
- Base model selection: we find that using code-specific models as base models does not bring extra benefits.
- Agentic SFT: we further collect agentic trajectories that involve using basic search and
program analysis tools. We find that small models can learn to use these tools through SFT and thus can work
on large projects with context retrieval capability.
@article{guo2024mlbasis,
title = {Comprehensive Pipeline and Key Takeaways on SFT Finetuning for LLM Reasoning},
author = {Guo, Wenbo and Nie, Yuzhou and Tang, Yuheng and Zhu, Kaijie and Li, Hongwei},
journal = {henrygwb.github.io},
year = {2025},
url = {https://henrygwb.github.io/posts/sft.htm}
}