One thought on "Understanding Reasoning Large Models - Understanding Reasoning LLMs
Overview:
- Explain."inference model"Meaning of
- Discuss the advantages and disadvantages of reasoning models
- An overview of DeepSeek R1 training methods
- Describe four main approaches to building and improving inference models
- Share DeepSeek V3 and R1 Releases in the LLM Space
- Provides tips for training inference models at a small cost
How to define "reasoning model"?
If you work in the field of AI (or machine learning), you're probably familiar with vague and controversial definitions. The term "inference model" is no exception. Eventually, someone will formally define it in a paper, but it'll soon be redefined in the next paper~
For the purposes of this paper, I define "reasoning" as the process of answering questions that require complex, multi-step generation and include intermediate steps. For example, a factual quiz like "What is the capital of France?" for example, a factual quiz like "Where is the capital of France?" does not involve reasoning. In contrast, a question like "How far can a train travel if it travels at 60 miles per hour for three hours?" A question like this requires some simple reasoning. For example, it requires recognizing the relationship between distance, speed, and time before arriving at an answer.

Whereas an ordinary LLM may provide only a short answer (as shown on the left), reasoning models often include intermediate steps that show part of the thought process (note that many LLMs not specifically trained for reasoning tasks can also provide intermediate reasoning steps in their answers)
Most LLMs have the basic reasoning skills to be able to answer questions like, "If a train travels at 60 miles per hour for three hours, how far will it go?" Questions like this. So when we refer to reasoning models, we mean LLMs that excel at more complex reasoning tasks such as solving puzzles, riddles, and mathematical proofs.
In addition, most LLMs, now called reasoning models, include a "thinking" or "thinking" process in their responses.
Whereas intermediate steps in inference models can appear in two ways, the first may be explicitly included in the response, as shown in the figure. The second, as in some inference LLMs such as OpenAI's o1, runs multiple iterations of intermediate steps and does not display them to the user.

When to use an inference model?
Reasoning models are intended to excel at complex tasks such as solving puzzles, advanced mathematical problems, and challenging programming tasks. However, they are not necessary for simple tasks such as summarization, translation, or knowledge-based question answering. Using inference models for all tasks can be inefficient and expensive, and sometimes more prone to error due to "overthinking". The advantages and disadvantages of inference models are shown in the figure below and we need to choose the right tool or LLM for the task.

Advantages and disadvantages of inferential models
Overview of the DeepSeek training process
DeepSeekThree different variants were released:DeepSeek-R1-Zero,DeepSeek-R1and DeepSeek-R1-Distill.
The training process of the model is summarized as shown below.

-
DeepSeek-R1-Zero: applying reinforcement learning directly on top of the DeepSeek-V3 base model without using any SFT data for cold-start. -
DeepSeek-R1: Based on the DeepSeek-V3 base model, the "cold-start" R1-Zero model has been improved by first refining it with additional SFT stages and further RL training. -
DeepSeek-R1-Distill*: using the SFT data generated in the previous steps, the Qwen and Llama models were fine-tuned to enhance their inference, purely SFT.
Four approaches to constructing and improving inference models
An overview of the key technologies currently used to enhance the reasoning capabilities of LLMs and to build specialized inference models (e.g., DeepSeek-R1, OpenAI's o1 and o3, etc.).
Note: The exact workings of o1 and o3 are unknown and purely speculative.
Inference-time scaling
Reasoning time scaling refers to the addition of computational resources to improve the quality of the output during reasoning.
A rough analogy is that humans tend to generate better answers when they have more time to think about complex questions. Similarly, we can apply techniques that encourage LLMs to "think more" when generating answers.
A straightforward way to extend reasoning is to engineer hints. A classic example of this is Chain of Thought (CoT) hints, where phrases such as "think step-by-step" are added to the input hints. Encouraging the model to generate intermediate reasoning steps, rather than jumping straight to the final answer, will often (but not always) lead to more accurate results on more complex problems.

From https://arxiv.org/abs/2205.11916
The above CoT approach can be considered as an inference time extension as it makes inference more expensive by generating more output tokens.
Another way to reason about time dilation is to use voting and search strategies. A simple example is majority voting, which allows the LLM to generate multiple answers and then select the correct answer by majority voting. Similarly, bundle search and other search algorithms can be used to generate better answers.
For more detailed information on these different strategies read the article Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters.

Different search-based approaches rely on a process-reward based model to select the best answer
purely intensive learning
One of the highlights of the DeepSeek R1 paper was the discovery that inference can be learned from pure reinforcement learning (RL). Unlike typical RL approaches (where SFT is supervised and fine-tuned before RL), DeepSeek-R1-Zero is trained using only reinforcement learning, without an initial SFT phase, which is why it is "pure" RL, as shown in the figure below.

The training process of DeepSeek-R1-Zero model
For rewards, instead of using a trained reward model based on human preferences, two types of rewards were used: accuracy and format rewards.
-
Accuracy was rewarded by using the LeetCode compiler to verify coded answers and a deterministic system to evaluate math responses. -
Formatting incentives that rely on an LLM referee to ensure that responses follow the expected format, such as placing reasoning steps within tags.
Surprisingly, it is possible to make LLM more capable of reasoning in this way. Although R1-Zero is not the best performing inference model, it demonstrates inference by generating intermediate "thinking" steps, as shown in the figure above. Confirming that it is possible to train inference models using pure RL, the DeepSeek team was the first to demonstrate (or at least publish) this approach.

From the DeepSeek R1 Technical Report
Monitoring fine-tuning and enhanced learning
It is actually quite common to include an SFT phase before RL. openAI's o1 was likely trained using a similar approach.

Training process of DeepSeek-R1 model
As shown above, so-called "cold-start" SFT data were generated using DeepSeek-R1-Zero.
The model was first trained with instruction fine-tuning using these "cold-start" SFT data, and then went through an RL phase where not only the accuracy and formatting bonuses used in DeepSeek-R1-Zero's RL process were retained, but a consistency bonus was added to prevent language mixing and switching between multiple language situations in the responses.
The RL phase was followed by another round of SFT data collection, where 600k Chain of Thought (CoT) SFT samples were generated using the latest model checkpoint, while an additional 200k knowledge-based SFT samples were created using the DeepSeek-V3 base model. After another round of RL, a rule-based approach was used to provide accuracy rewards for math and coding questions, while human preference labels were used for other question types.
The final model, DeepSeek-R1, shows a significant performance improvement compared to DeepSeek-R1-Zero, as shown in the table below.

From the DeepSeek-R1 Technical Report
Pure supervised fine tuning (SFT) and distillation
DeepSeek has also released a report via"Distillation" process trainingof the smaller models. Distillation here refers to the fine-tuning of instructions for smaller LLMs (e.g., Llama 8B and 70B and the Qwen 2.5 model (0.5B to 32B)) on SFT datasets generated from larger LLMs. The distillation section is highlighted in the figure below.

Training process of DeepSeek-R1-Distill model
Why train these distillation models? There are two key reasons:
-
Smaller models are more efficient. This means they are cheaper to run and can be run on lower-end hardware, making them more appealing to researchers and big model enthusiasts. -
A methodological study of pure SFT. These distillation models serve as an interesting benchmark to show how far pure supervised fine-tuning (SFT) can take a model without reinforcement learning.
The following table compares the performance of these distillation models with other popular models, as well as DeepSeek-R1-Zero and DeepSeek-R1.

Benchmarking distillation vs. non-distillation models
As we've seen, the distillation models are significantly weaker than DeepSeek-R1, but the results are strong compared to DeepSeek-R1-Zero, even though they're orders of magnitude smaller. The models also perform quite well compared to the o1 mini (I suspect that the o1-mini itself may be a similarly distilled version of the o1).
It was also measured whether the pure RL methods seen in DeepSeek-R1-Zero could also appear in smaller models by applying the same pure RL methods of DeepSeek-R1-Zero directly to Qwen-32B.

Benchmark comparison of distillation and RL for model 32B
The results show that distillation is more effective than pure RL for smaller models.
Summary of the four methods
-
Scaling while reasoning does not require additional training, but it increases the cost of reasoning and becomes more expensive to deploy at scale as the number of users or queries increases. However, it remains a no-brainer for models that already perform strongly. I strongly suspect that o1 utilizes inference-time scaling, which helps explain why it costs more per token than DeepSeek-R1. -
Pure RL is interesting for research purposes because it provides insights into reasoning as an emergent behavior. However, in real-world model training, RL + SFT is the preferred approach because it produces more robust inference models. -
RL + SFT is a key approach for building high-performance inference models. -
Distillation is an attractive approach, especially for creating smaller, more efficient models. However, it is limited in that distillation does not drive innovation or generate next-generation inference models. For example, distillation always relies on an existing, stronger model to generate supervised fine-tuning (SFT) data.
Reflections on DeepSeek R1
In recent weeks, many people have asked me what I think of the DeepSeek-R1 models. In short, I think they are a marvelous piece of work. As an engineer, I especially appreciate detailed technical reports that provide methodological insights I can learn from.
One of the most interesting takeaways is the emergence of reasoning as a behavior from pure RL. And it's impressive that DeepSeek open-sourced their model under a loose open-source MIT license, which is less restrictive than Meta's Llama model.
How it compares to o1
Is DeepSeek-R1 superior to o1?
I would say they are roughly at the same level. But DeepSeek-R1 is more efficient at reasoning. This suggests that DeepSeek may have invested more in the training process, while OpenAI may have relied more on o1's inference time scaling.
That said, directly comparing o1 and DeepSeek-R1 is difficult because OpenAI does not disclose much information about o1. For example, we don't know:
-
Is o1 also Mixed by Experts (MoE)? -
How big is o1? -
Is o1 just a slightly improved version of GPT-4o with only minimal RL + SFT and extensive inference-time extensions?
Without knowing these details, the direct comparison remains apples to oranges.
Cost of training DeepSeek-R1
Another point of discussion was the cost of training DeepSeek-R1. Some people have mentioned about $6 million in training costs, but they may be confusing theDeepSeek-V3and DeepSeek-R1.
The $6 million estimate is based on an assumption of $2 per GPU hour and the number of GPU hours needed for the final training run of DeepSeek-V3, which was originally discussed in December 2024.
However, the DeepSeek team never disclosed the exact GPU hours or training costs for R1, so any cost estimates are purely speculative.
In any case, in the end, DeepSeek-R1 is an important milestone for open source inference models, and its efficiency when reasoning makes it an interesting alternative to OpenAI's o1.
Training Inference Models at Small Costs
Training a DeepSeek-R1 level inference model can cost hundreds of thousands to millions of dollars, even starting from an open source base model such as DeepSeek-V3. This can be frustrating for researchers or engineers with limited budgets.
Good news: distillation can go a long way
Model distillation offers a more cost-effective alternative, as demonstrated by the DeepSeek team's R1 distillation models, which excel in inference performance despite being much smaller than DeepSeek-R1. Of course this approach isn't exactly cheap, distilling 800,000 SFT samples also requires significant computational resources.
Just a few days before DeepSeek-R1 was released, I came across an article about Sky-T1 that trained an open source 32B model using only 1.7w SFT samples for a total cost of only $450.
This one shows that smaller, targeted fine-tuning efforts can still achieve impressive results.

Sky-T1: Train Your Own Inference Model in $450 - https://novasky-ai.github.io/posts/sky-t1/
According to their benchmarks, the Sky-T1 performs roughly on par with the o1.
Pure RL on a Budget by TinyZero
Sky-T1 focuses on model distillation, but also has some work in the "pure RL" space - TinyZero, a 3B parametric model that replicates DeepSeek-R1-Zero's methodology (training costs less than $30).
Even with only 3B parameters, TinyZero shows some emerging self-validation capabilities, which supports the idea that "inference can emerge in small models through pure RL".
Corresponding to Github:https://github.com/Jiayi-Pan/TinyZero

Both of the above projects are efforts to train inference models with limited budgets, one focusing on pure RL (TinyZero) and the other on pure SFT (Sky-T1).
Beyond Traditional SFT: Journey Learning
I came across a particularly interesting paper last year, O1 Replication Journey: a Strategic Progress Report - Part1.
The key idea in the paper is "journey learning" as an alternative to "shortcut learning".
-
Shortcut learning refers to the traditional instruction fine-tuning approach where the model is trained using only the correct solution path. -
Journey learning, on the other hand, includes error resolution paths that allow the model to learn from its mistakes.
This approach is related to the self-validation capabilities observed in TinyZero's pure RL training, but it focuses on improving the model entirely through SFT. By exposing the model to faulty inference paths and their corrections, journey learning may also strengthen the self-correcting ability, thus making the inference model more reliable in this regard.

This could be an exciting direction for future work, especially for low-budget inference model training where RL methods may not be computationally feasible.
put at the end
There is currently a lot of interesting work on inference modeling preamble techniques, and I'm sure we'll see more exciting work in the coming months!
Original: https://magazine.sebastianraschka.com/p/understanding-reasoning-llms
This article is from NLP Workstation , by Liu Cong NLP
© Copyright notes
The copyright of the article belongs to the author, please do not reprint without permission.
Related posts
No comments...