🏠 Home
Takeaways on SFT finetuning for LLM reasoning
Oct. 2025
This post summarizes the key takeaways from our experience on SFT finetuning for LLM reasoning.
Although RL has dominated the reasoning post-training, SFT is still useful for many scenarios, e.g.,
(1) When training models for specific applications that do not require generalizability across applications;
(2) Serve as a warm-start for RL training, especially for transferring models to new domains or tasks (given that RL
receives fewer positive signals at the beginning of training);
(3) When RL is too expensive to run and cannot find good reward functions.
In the following, we first go over the overall workflow of SFT finetuning for LLM reasoning, and then discuss the
recipes for each step.
We mainly focus on the experience we draw from SWE and security applications; some insights actually conflict
with the general math and coding reasoning, which is interesting!
Please refer to this repo for
notable papers in this space.
The typical SFT training process consists of the following steps:
- Data selection and processing: Select data/questions \(\mathcal{X}\) for generating supervised data.
- Teacher model selection: Select the proper teacher model(s) \(\mathcal{M}\) to generate reasoning data.
- Reasoning data generation: Feed the selected data into the teacher model(s) and generate corresponding
responses \(\mathcal{D} = \{\mathcal{X}, \mathcal{Y}\}\), \(\mathcal{Y} = \mathcal{M}(\mathcal{X})\).
- Reasoning data filtering: Filter the generated data based on certain criteria to ensure the quality of
the
training data.
- Base model(s) selection: Select proper model(s) as the base model(s) for student model(s).
- SFT training: Finetune the base model using the filtered reasoning data \(\mathcal{D}\) via next token
prediction using the cross-entropy loss.
- Testing/inference: Generate responses for new inputs using the fine-tuned model.
The key steps that introduce different strategies are steps 1-5. Step 6 follows the standardized supervised finetuning
process and Step 7 is the standard inference process with a lot of tricks introduced
here.
Data selection and processing
- Data quality is more important than quantity: a small number of high-quality data is more useful than a large
number of randomly selected data. This has been consistently observed in many applications, including general
coding, math, and SE and security tasks.
Here, high-quality data typically refers to well-structured and difficult problems that are mostly related to
the
target application. These questions also need to have clear and unambiguous answers.
- There is no convergence on whether curriculum learning is helpful for SFT training, but some works show that it
is
helpful for RL-based fine-tuning.
- This is a mixed take on whether data diversity is helpful; some shows repeatedly use the same high-quality data
is more helpful than using diverse data.
However, in our SE and security tasks, we find that data diversity can be beneficial as it covers more
scenarios, as long as all the included data are of high quality.
- Avoiding data contamination is critical for SFT training, making sure that the training and testing data come
from
different projects, domains, vulnerabilities, and even programming languages.
Teacher model selection
We typically select strong open-source reasoning models as our teacher models, which are mainly Qwen and DeepSeek
models. For agentic tasks, Kimi and GLM models can also be used.
- Existing studies discovered that stronger models are not necessary good teacher models, especially DeepSeek-r1.
We also find that using DeepSeek-r1 as the teacher model sololy can introduce model collapse, as its reasoning
trajectories have many transition words (e.g., "wait" and "but").
- We observe in some security tasks that using multiple teacher models is better than using a single teacher
model. This is probably because different models have different strengths and weaknesses, and using multiple
models can provide more diverse and comprehensive reasoning data. For example, in vulnerability detection, Qwen
and DeepSeek models have different emphasis on false positives and false negatives.
- For agentic tasks, we need teacher models that can conduct tool calling, maybe commercial models are better
choices.
- Some open-source models tend to give long reasoning trajectories, which are not good for SFT training because
1) Small models are not good at handling long sequences; 2) training on long trajectories is easier to trigger
model collapse.
Reasoning data generation
After selecting the teacher models, we generate reasoning data by prompting the teacher models with the selected
inputs and collecting their outputs. The prompts are in general not that important unless some specific model
behaviors are observed in the generated data. For example, models tend to hallucinate on program run-time states
(variable values and dependences). Some specific prompts can be helpful but not always useful.
For cases with multiple teacher models, the ratio of data mixture is not that important. For example, we tried
different ratios of Qwen and DeepSeek outputs and found that the model performance was relatively stable. Even
distribution is an easy choice.
Reasoning data filtering
This is the key step for SFT training with the most tricks:
- Two typical criteria for reasoning data filtering are reasoning trajectory length and answer correctness.
- For length filtering, it has been converged that setting a maximum length threshold is helpful for SFT
training. The threshold can be different for different applications. For example, in vulnerability detection, it
can be around 2K to 4K tokens.
- For SE and security tasks, within a certain length constraint, longer and more diverse reasoning is better.
- For answer correctness, early works show that it is not necessary for coding and math reasoning. The argument
is that models mainly learn how to perform reasoning rather than learning the specific answers. However, in our
SE and security tasks, we find that answer correctness is very important for SFT training. This is probably
because the reasoning process in these tasks requires domain knowledge that base models do not have. The
learning process also learns knowledge together with the reasoning patterns.
- There are other criteria on the quality of the reasoning trajectories that are not easy to quantify, e.g.,
whether the reasoning is coherent and contains hallucinated contents. In coding-related tasks, having reasoning
processes that actually reason about program states is better than pure text-based reasoning.
- For certain applications, after filtering out wrong answers, we will not have enough training data. Here, two
methods are useful for bringing back more training data. Their high-level ideas are all about prompting teacher
models with more context and seeing if they can give correct answers instead.
- Rationalization: Give the correct answers to the teacher models, ask them to reason why they are
correct, and use the reasoning process as the reasoning data for training. This method is easy to
introduce hallucination, especially in SE and security tasks. For example, the teacher models may assume
some non-existent program states or dependencies.
- Constitution: Prompt the teacher model with additional context, e.g., basic knowledge about certain
vulnerabilities and how humans typically analyze them.
This is useful for many cases, but requires human effort to write the constitutions.
- Sampling multiple answers from the same input question is helpful. Some math-specific filtering methods are not
useful here.
- Summarization: In SE and security tasks, we find that asking another model (typically commercial model)
to summarize the raw
reasoning of the teacher model and training on the summarized output is beneficial for both accuracy and
inference efficiency.
Based model selection
Typically, we select open-source small models as the base models (mainly Qwen models).
- Base model does not have to be a reasoning model. Sometimes using non-reasoning model is better than reasoning
model as the reasoning trajectories of the teacher models may be very different from the base models, leading to
the OOD issue during training.
- For SE and security tasks, we don't need a coding model as the base model. For example, Qwen-2.5-instruct is
better than Qwen-2.5-coder.
- For complex tasks, we can use multiple small models as base models, each is responsible for one or multiple
specific sub-tasks.
In this part, we discuss two cases where we apply the above recipes for SFT training and achieve better results than
large open-source or commercial models.
Github issue resolving
In the Co-PatcheR paper, we train three 7B models and achieve better
performance than large open-source models trained from SFT or RL (e.g., SWE-RL-70B).
The key recipe is as follows:
- Data selection: we selected 500 cases from the SWE-bench training set, where we prioritized difficult cases. We
also found that increasing the cases to 2K does not help that much.
- Teacher model selection: here we use Claude model as the teacher model, as it can better call tools than
open-source models .
- Reasoning data filtering: Filtering out samples with wrong answers helps all components, but rationalization
does not.
- Having an additional critic step is helpful. That is, we mix generation data with
critic data to train the base model, where critic data is generated by prompting the teacher model to critique
their generated stuff.
- Base model selection: We found that using one model for localization and generation performs similarly to using
separate models; Multiple models for PoC generation provide the necessary diversity that a single model cannot
achieve.
Static vulnerability detection
We also proposed a new SFT recipe for vulnerability detection. The trained 7B model outperforms SOTA commercial
models like O3 and Claude-3.7. The paper will be out soon.
The key recipe is as follows:
- Data selection: We constructed our training set by mixing data from multiple existing benchmarks. Note that
existing benchmarks have false positives, especially for function-level code. We filtered out these cases as
much as possible, as they may cause the trained model to hallucinate. We also mainly include patched code as
benign samples rather than the normal functions, as it helps the model to really reason about vulnerabilities.
- Teacher model selection: We used multiple teacher models to balance FP and FN.
- Reasoning data filtering: Filtering out wrong answers and long trajectories are helpful; Constitution and
summarization are also helpful.
- Base model selection: We found that using code-specific models as base models does not bring extra benefits.
- Agentic SFT: We further collect agentic trajectories that involving using basic searching and
programming analysis tools. We found that small models learned to use these tools through SFT and thus can work
on large projects with the context retrieval capability.
@article{guo2024mlbasis,
title = {Takeaways on SFT finetuning for LLM reasoning},
author = {Guo, Wenbo and Nie, Yuzhou and Tang, Yuheng and Zhu, Kaijie and Li, Hongwei},
journal = {henrygwb.github.io},
year = {2025},
url = {https://henrygwb.github.io/posts/sft.htm}
}