SFT

Comprehensive Pipeline and Key Takeaways on SFT Finetuning for LLM Reasoning

Oct. 2025

This post summarizes the key takeaways from existing papers and our experience on SFT finetuning for LLM reasoning. Although RL has dominated the reasoning post-training, SFT is still useful for many scenarios, e.g., (1) When training models for specific applications that do not require generalizability across applications; (2) Serve as a warm-start for RL training, especially for transferring models to new domains or tasks (given that RL receives fewer positive signals at the beginning of training); (3) When RL is too expensive to run and cannot find good reward functions. In the following, we first go over the overall workflow of SFT finetuning for LLM reasoning, and then discuss the recipes for each step. We mainly focus on the experience we draw from SWE and security applications; some insights actually conflict with the general math and coding reasoning, which is interesting! Please refer to this repo for notable papers in this space.

Table of Content

Overall process for SFT training

The typical SFT training process consists of the following steps:

Data selection and processing: Select data/questions \(\mathcal{X}\) for generating supervised data.
Teacher model selection: Select the proper teacher model(s) \(\mathcal{M}\) to generate reasoning data.
Reasoning data generation: Feed the selected data into the teacher model(s) and generate corresponding responses \(\mathcal{D} = \{\mathcal{X}, \mathcal{Y}\}\), \(\mathcal{Y} = \mathcal{M}(\mathcal{X})\).
Reasoning data filtering: Filter the generated data based on certain criteria to ensure the training data quality.
Base model(s) selection: Select proper model(s) as the base model(s) for student model(s).
SFT training: Finetune the base model using the filtered reasoning data \(\mathcal{D}\) via next token prediction using the cross-entropy loss.
Testing/inference: Generate responses for new inputs using the fine-tuned model.

The key steps that introduce different strategies are steps 1-5. Step 6 follows the standardized supervised finetuning process and Step 7 is the standard inference process with a lot of tricks introduced here.

Recipes for key steps

Data selection and processing

Data quality is more important than quantity: a small number of high-quality data is more useful than a large number of randomly selected data. This has been consistently observed in many applications, including general coding, math, and SE and security tasks. Here, high-quality data typically refers to well-structured and difficult problems that are mostly related to the target application. These questions also need to have clear and unambiguous answers.
There is no convergence on whether curriculum learning is helpful for SFT training, but some works show that it is helpful for RL-based fine-tuning.
This is a mixed take on whether data diversity is helpful; some shows repeatedly use the same high-quality data is more helpful than using diverse data. However, in our SE and security tasks, we find that data diversity can be beneficial as it covers more scenarios, as long as all the included data are of high quality.
Avoiding data contamination is critical for SFT training, making sure that the training and testing data come from different projects, domains, vulnerabilities, and even programming languages.

Teacher model selection

We typically select strong open-source reasoning models as our teacher models, which are mainly Qwen and DeepSeek models. For agentic tasks, Kimi and GLM models can also be used.

Existing studies discover that stronger models are not necessary good teacher models, especially DeepSeek-r1. We also find that using DeepSeek-r1 as the teacher model sololy can introduce model collapse, as its reasoning trajectories have many transition words (e.g., "wait" and "but").
We observe in some security tasks that using multiple teacher models is better than using a single teacher model. This is probably because different models have different strengths and weaknesses, and using multiple models can provide more diverse and comprehensive reasoning data. For example, in vulnerability detection, Qwen and DeepSeek models have different emphasis on false positives and false negatives.
For agentic tasks, we need teacher models that can conduct tool calling, maybe commercial models are better choices.
Some open-source models tend to give long reasoning trajectories, which are not good for SFT training because 1) Small models are not good at handling long sequences; 2) training on long trajectories is easier to trigger model collapse.

Reasoning data generation

After selecting the teacher models, we generate reasoning data by prompting the teacher models with the selected inputs and collecting their outputs. The prompts are in general not that important unless some specific model behaviors are observed in the generated data. For example, models tend to hallucinate on program run-time states (variable values and dependences). Some specific prompts can be helpful but not always useful. For cases with multiple teacher models, the ratio of data mixture is not that important. For example, we try different ratios of Qwen and DeepSeek outputs and found that the model performance was relatively stable. Even distribution is an easy choice.

Reasoning data filtering

This is the key step for SFT training with the most tricks:

Two typical criteria for reasoning data filtering are reasoning trajectory length and answer correctness.
For length filtering, it has been converged that setting a maximum length threshold is helpful for SFT training. The threshold can be different for different applications. For example, in vulnerability detection, it can be around 2K to 4K tokens.
For SE and security tasks, within a certain length constraint, longer and more diverse reasoning is better.
For answer correctness, early works show that it is not necessary for coding and math reasoning. The argument is that models mainly learn how to perform reasoning rather than learning the specific answers. However, in our SE and security tasks, we find that answer correctness is very important for SFT training. This is probably because the reasoning process in these tasks requires domain knowledge that base models do not have. The learning process also learns knowledge together with the reasoning patterns.
There are other criteria on the quality of the reasoning trajectories that are not easy to quantify, e.g., whether the reasoning is coherent and contains hallucinated contents. In coding-related tasks, having reasoning processes that actually reason about program states is better than pure text-based reasoning.
For certain applications, after filtering out wrong answers, we will not have enough training data. Here, two methods are useful for bringing back more training data. Their high-level ideas are all about prompting teacher models with more context and seeing if they can give correct answers instead.
- Rationalization: Give the correct answers to the teacher models, ask them to reason why they are correct, and use the reasoning process as the reasoning data for training. This method is easy to introduce hallucination, especially in SE and security tasks. For example, the teacher models may assume some non-existent program states or dependencies.
- Constitution: Prompt the teacher model with additional context, e.g., basic knowledge about certain vulnerabilities and how humans typically analyze them. This is useful for many cases, but requires human effort to write the constitutions.
Sampling multiple answers from the same input question is helpful. Some math-specific filtering methods are not useful here.
Summarization: In SE and security tasks, we find that asking another model (typically commercial model) to summarize the raw reasoning of the teacher model and training on the summarized output is beneficial for both accuracy and inference efficiency.

Based model selection

Typically, we select open-source small models as the base models (mainly Qwen models).

Base model does not have to be a reasoning model. Sometimes using non-reasoning model is better than reasoning model as the reasoning trajectories of the teacher models may be very different from the base models, leading to the OOD issue during training.
For SE and security tasks, we don't need a coding model as the base model. For example, Qwen-2.5-instruct is better than Qwen-2.5-coder.
For complex tasks, we can use multiple small models as base models, each is responsible for one or multiple specific sub-tasks.

Case studies

In this part, we discuss two cases where we apply the above recipes for SFT training and achieve better results than large open-source or commercial models.

Github issue resolving

In the Co-PatcheR paper, we train three 7B models and achieve better performance than large open-source models trained from SFT or RL (e.g., SWE-RL-70B). The key recipe is as follows:

Data selection: we select 500 cases from the SWE-bench training set, where we prioritize difficult cases. We also find that increasing the cases to 2K does not help that much.
Teacher model selection: here we use Claude model as the teacher model, as it can better call tools than open-source models .
Reasoning data filtering: Filtering out samples with wrong answers helps all components, but rationalization does not.
Having an additional critic step is helpful. That is, we mix generation data with critic data to train the base model, where critic data is generated by prompting the teacher model to critique their generated stuff.
Base model selection: We find that using one model for localization and generation performs similarly to using separate models; Multiple models for PoC generation provide the necessary diversity that a single model cannot achieve.

Static vulnerability detection

We also propose a new SFT recipe for vulnerability detection. The trained 7B model outperforms SOTA commercial models like O3 and Claude-3.7. The paper will be out soon. The key recipe is as follows:

Data selection: We construct our training set by mixing data from multiple existing benchmarks. Note that existing benchmarks have false positives, especially for function-level code. We filter out these cases as much as possible, as they may cause the trained model to hallucinate. We also mainly include patched code as benign samples rather than the normal functions, as it helps the model to really reason about vulnerabilities.
Teacher model selection: We use multiple teacher models to balance FP and FN.
Reasoning data filtering: Filtering out wrong answers and long trajectories are helpful; Constitution and summarization are also helpful.
Base model selection: We find that using code-specific models as base models does not bring extra benefits.
Agentic SFT: We further collect agentic trajectories that involving using basic searching and programming analysis tools. We find that small models learned to use these tools through SFT and thus can work on large projects with the context retrieval capability.

@article{guo2024mlbasis,
 title   = {Comprehensive Pipeline and Key Takeaways on SFT Finetuning for LLM Reasoning},
 author  = {Guo, Wenbo and Nie, Yuzhou and Tang, Yuheng and Zhu, Kaijie and Li, Hongwei},
 journal = {henrygwb.github.io},
 year    = {2025},
 url     = {https://henrygwb.github.io/posts/sft.htm}
}