🏠 Home

Comprehensive Pipeline and Key Takeaways on SFT Fine-tuning for LLM Reasoning

Oct. 2025


This post summarizes the key takeaways from existing papers and our experience on SFT fine-tuning for LLM reasoning. Although RL has dominated reasoning post-training, SFT is still useful for many scenarios, e.g., (1) when training models for specific applications that do not require generalizability across applications; (2) as a warm-start for RL training, especially for transferring models to new domains or tasks (given that RL receives few positive signals at the beginning of training); (3) when RL is too expensive to run or cannot find good reward functions. In the following, we first go over the overall workflow of SFT fine-tuning for LLM reasoning, and then discuss the recipes for each step. We mainly focus on the experience we have drawn from SWE and security applications; some insights actually conflict with the conventional wisdom for general math and coding reasoning, which is interesting! Please refer to this repo for notable papers in this space.

Table of Content

Overall process for SFT training

The typical SFT training process consists of the following steps:

  1. Data selection and processing: Select data/questions \(\mathcal{X}\) for generating supervised data.
  2. Teacher model selection: Select the proper teacher model(s) \(\mathcal{M}\) to generate reasoning data.
  3. Reasoning data generation: feed the selected data into the teacher model(s) and generate corresponding responses \(\mathcal{D} = (\mathcal{X}, \mathcal{Y})\), where \(\mathcal{Y} = \mathcal{M}(\mathcal{X})\).
  4. Reasoning data filtering: filter the generated data based on certain criteria to ensure training data quality.
  5. Base model(s) selection: select proper model(s) as the base model(s) for the student model(s).
  6. SFT training: fine-tune the base model on the filtered reasoning data \(\mathcal{D}\) via next-token prediction using the cross-entropy loss.
  7. Testing/inference: generate responses for new inputs using the fine-tuned model.
The key steps that introduce different strategies are steps 1-5. Step 6 follows the standard supervised fine-tuning process, and Step 7 is the standard inference process with many tricks introduced here.

Recipes for key steps

Data selection and processing
  1. Data quality is more important than quantity: a small amount of high-quality data is more useful than a large amount of randomly selected data. This has been consistently observed in many applications, including general coding, math, and SE and security tasks. Here, high-quality data typically refers to well-structured and difficult problems that are closely related to the target application. These questions also need to have clear and unambiguous answers.
  2. There is no consensus on whether curriculum learning is helpful for SFT training, but some works show that it is helpful for RL-based fine-tuning.
  3. There are mixed takes on whether data diversity is helpful; some works show that repeatedly using the same high-quality data is more helpful than using diverse data. However, in our SE and security tasks, we find that data diversity can be beneficial because it covers more scenarios, as long as all the included data are of high quality.
  4. Avoiding data contamination is critical for SFT training. We make sure that the training and testing data come from different projects, domains, vulnerabilities, and even programming languages.
Teacher model selection

We typically select strong open-source reasoning models as our teacher models, mainly Qwen and DeepSeek models. For agentic tasks, Kimi and GLM models can also be used.

  1. Existing studies have found that stronger models are not necessarily good teacher models, especially DeepSeek-R1. We also find that using DeepSeek-R1 as the sole teacher model can lead to model collapse, as its reasoning trajectories contain many transition words (e.g., "wait" and "but").
  2. We observe in some security tasks that using multiple teacher models is better than using a single teacher model. This is probably because different models have different strengths and weaknesses, and using multiple models can provide more diverse and comprehensive reasoning data. For example, in vulnerability detection, Qwen and DeepSeek models have different emphases on false positives and false negatives.
  3. For agentic tasks, we need teacher models that can conduct tool calling; commercial models may be better choices.
  4. Some open-source models tend to produce long reasoning trajectories, which are not good for SFT training because 1) small models are not good at handling long sequences; 2) training on long trajectories is more likely to trigger model collapse.

Reasoning data generation

After selecting the teacher models, we generate reasoning data by prompting the teacher models with the selected inputs and collecting their outputs. The prompts are in general not that important unless some specific model behaviors are observed in the generated data. For example, models tend to hallucinate on program run-time states (variable values and dependencies). Some specific prompts can be helpful but are not always useful. For cases with multiple teacher models, the data mixture ratio is not that important. For example, we tried different ratios of Qwen and DeepSeek outputs and found that the model performance was relatively stable. An even distribution is an easy choice.

Reasoning data filtering

This is the key step for SFT training with the most tricks:

  1. Two typical criteria for reasoning data filtering are reasoning trajectory length and answer correctness.
  2. For length filtering, there is consensus that setting a maximum length threshold is helpful for SFT training. The threshold can vary across applications. For example, in vulnerability detection, it can be around 2K to 4K tokens.
  3. For SE and security tasks, within a certain length constraint, longer and more diverse reasoning is better.
  4. For answer correctness, early works show that it is not necessary for coding and math reasoning. The argument is that models mainly learn how to perform reasoning rather than learning the specific answers. However, in our SE and security tasks, we find that answer correctness is very important for SFT training. This is probably because the reasoning process in these tasks requires domain knowledge that base models do not have; the learning process also acquires this knowledge along with the reasoning patterns.
  5. There are other criteria on the quality of the reasoning trajectories that are not easy to quantify, e.g., whether the reasoning is coherent and whether it contains hallucinated content. In coding-related tasks, having reasoning processes that actually reason about program states is better than pure text-based reasoning.
  6. For certain applications, after filtering out wrong answers, we may not have enough training data. Here, two methods are useful for recovering more training data. Their high-level idea is to prompt the teacher models with more context and see if they can give correct answers.
  7. Sampling multiple answers from the same input question is helpful. Some math-specific filtering methods are not useful here.
  8. Summarization: in SE and security tasks, we find that asking another model (typically a commercial model) to summarize the raw reasoning of the teacher model, and training on the summarized output, is beneficial for both accuracy and inference efficiency.

Data format and objective function

We need to separate the prompt (system, user), model response (assistant), and tool response (tool) using special tokens, and only compute the loss on model responses (via masking).

SFT uses the next-token prediction loss. The loss function itself is fixed; the key is the token weighting. For short responses, applying a small but non-zero loss weight to prompt tokens can also help. There are three types of token weight assignments for multi-turn conversations:

  1. Token-equal weight: each token has the same weight; longer conversations receive a larger total weight.
  2. Turn-equal weight: each turn has the same weight; multi-turn samples receive a larger total weight.
  3. Sample-equal weight: each sample has the same weight regardless of length or number of turns.

With padding, we need an attention mask and must adjust the token weights accordingly, since padding changes the effective turn or sample length. Care must be taken to ensure that padding does not alter the intended sample/turn weight.

Base model selection

Typically, we select open-source small models as the base models (mainly Qwen models).

  1. The base model does not have to be a reasoning model. Sometimes using a non-reasoning model is better than a reasoning model, as the reasoning trajectories of the teacher models may be very different from those of the base models, leading to an OOD issue during training.
  2. For SE and security tasks, we don't need a coding model as the base model. For example, Qwen-2.5-Instruct is better than Qwen-2.5-Coder.
  3. For complex tasks, we can use multiple small models as base models, each responsible for one or more specific sub-tasks.

Case studies

In this section, we discuss two cases where we apply the above recipes for SFT training and achieve better results than large open-source or commercial models.
GitHub issue resolving

In the Co-PatcheR paper, we train three 7B models and achieve better performance than large open-source models trained via SFT or RL (e.g., SWE-RL-70B). The key recipe is as follows:

  1. Data selection: we select 500 cases from the SWE-bench training set, prioritizing difficult cases. We also find that increasing the number of cases to 2K does not help much.
  2. Teacher model selection: we use Claude as the teacher model, as it can call tools better than open-source models.
  3. Reasoning data filtering: filtering out samples with wrong answers helps all components, but rationalization does not.
  4. Having an additional critic step is helpful. That is, we mix generation data with critic data to train the base model, where critic data is generated by prompting the teacher model to critique its own generated outputs.
  5. Base model selection: we find that using one model for both localization and generation performs similarly to using separate models; using multiple models for PoC generation provides the necessary diversity that a single model cannot achieve.

Static vulnerability detection

We also propose a new SFT recipe for vulnerability detection. The trained 7B model outperforms SOTA commercial models like o3 and Claude-3.7. The paper will be out soon. The key recipe is as follows:

  1. Data selection: we construct our training set by mixing data from multiple existing benchmarks. Note that existing benchmarks have false positives, especially for function-level code. We filter out these cases as much as possible, as they may cause the trained model to hallucinate. We also mainly include patched code as benign samples rather than normal functions, as this helps the model genuinely reason about vulnerabilities.
  2. Teacher model selection: we use multiple teacher models to balance false positives and false negatives.
  3. Reasoning data filtering: filtering out wrong answers and long trajectories is helpful; constitution and summarization are also helpful.
  4. Base model selection: we find that using code-specific models as base models does not bring extra benefits.
  5. Agentic SFT: we further collect agentic trajectories that involve using basic search and program analysis tools. We find that small models can learn to use these tools through SFT and thus can work on large projects with context retrieval capability.




@article{guo2024mlbasis,
 title   = {Comprehensive Pipeline and Key Takeaways on SFT Finetuning for LLM Reasoning},
 author  = {Guo, Wenbo and Nie, Yuzhou and Tang, Yuheng and Zhu, Kaijie and Li, Hongwei},
 journal = {henrygwb.github.io},
 year    = {2025},
 url     = {https://henrygwb.github.io/posts/sft.htm}
}