AI enters the era of reasoning models, an article to read the chain of thought
In recent times, inference models DeepSeek-R1 Arguably the number one topic in AI. Those of you who have used it know that the model outputs a final answer before outputting a segment ofthought chainContent. Doing so will enhance the accuracy of the final answer.
Today's article will take you through the research and technology related to Chain of Thought (CoT).
(Figure Note) Some forms of reasoning skills.
Chain of Thought (CoT) has been around for quite some time. Technically, it is a form of advanced cue engineering. Various forms of CoT typically force large language models to reason.
After OpenAI released a preview of its model o1 in September 2024, we saw the hype around CoT intensify.
Other than OpenAI, no one fully knows how o1 works, whether it's a combinatorial system, what kind of data it's fine-tuned with, whether it uses reinforcement learning, or whether there are several models working together.
Perhaps one model does the planning, another does the thinking, and a third does the scoring. But we know that they all use some kind of stepwise reasoning.
There has been a lot of published research on this. This post will present the existing research so you know what you can use. I will also test different techniques to see if we can realize real improvements.
Researchers have published many papers in the last two years. You can see the reasoning techniques they talked about in the chart below.

CoT technology that has been discussed more in the past two years.
Most of the work comes directly from DeepMind or Princeton University. Kudos to them for their open source work.
The term CoT was proposed by DeepMind in 2022 and is only used in prompts. The latest paper explores a 'Tree of Thought' (ToT) that combines Monte Carlo Search and CoT without hints.
Next, we will introduce simple Chain of Thoughts (CoT), CoT Chaining, Greedy Decoding, CoT-SC, Decoding CoT, and 'Thinking Tree' combined with Monte Carlo Tree Search.
Baseline scores for LLM
To understand how to improve LLM results, we first need to establish some sort of baseline score.
When models are introduced, they are usually accompanied by evaluation metrics. There are several popular evaluation metrics such as MMLU (language understanding), BigBench (reasoning), HellaSwag (commonsense reasoning), etc.

Interesting dataset.
However, some of these datasets are outdated and may be somewhat contaminated.
Hugging Face introduced a new LLM ranking in December, based on the evaluation of newer datasets. You can clearly see that most of the models score much lower than they did on the original dataset.
It's worth doing some research here to understand how you should think in terms of model evaluation and what you and your organization should base your evaluation on. Testing with internal private datasets is not the worst idea.
I took about 350 questions from different datasets, plus some popular questions I found online, and then evaluated 11 different models.
I need to know what these datasets and the answers generated by LLM look like.
Therefore, I constructed my own script to traverse the problems and then evaluate the LLM with 0 or 1 for each problem.
Here are the results I found.

You can find the entire dataset and results in this Google Form: https://docs.google.com/spreadsheets/d/1awPb5klHx-v1oafgZrV_-hdFHnibxla1BGHZ8vCY2CE/edit
There's not much we can read into it.
I used questions from Big Bench, MMLU, and Putnam, as well as popular questions like 'How many r's are in Strawberry', but there's no way to know if they're contaminated by those questions. Also, this is a fairly small dataset.
However, we can clearly see that bigger models perform better.
We are interested in whether these scores can be improved by having the model reason and think before answering.
Chain of Thought (CoT)
The Chain of Thought (CoT) prompt was proposed by DeepMind in the paper Chain-of-Thought Prompting Elicits Reasoning in Large Language Models, published in 2022. Thus, the concept of CoT has been around for a long time.
However, this first paper investigates how to activate the model's inherent reasoning capabilities through the use of cueing strategies, thereby forcing the model to reason about the problem.
At the time, people were simply prompted in the right way, asking the model to 'think step by step', either through a zero-sample (no examples provided) or a few-sample (a few examples provided) approach.

Zero samples vs. few samples.
For Claude, ChatGPT, or various other models, just add "Let's think step by step" to the end of the prompt. If you want to try less sample learning, you can give it some examples in the prompt.
DeepMind reports that they can confirm a significant improvement using CoT techniques with the right cues.
Since then, many papers have built on these techniques and blazed the trail of increasing sophistication.
Building a chain of reasoning
There are a lot of people in the cue word engineering community experimenting with CoT style techniques. I've collected most of the libraries I've found here for readers to find.

Some implementations of CoT style techniques are detailed at https://github.com/ilsilfverskiold/Awesome-LLM-Resources-List
Benjamin Klieger built a cue word style app using Groq and Llama 3.1 70b to trigger chains of thought by further breaking down the thought process. It can be accessed here:https://github.com/bklieger-groq/g1
The idea is to ask the LLM to break down their thinking into chains and keep thinking until they are confident in the answer.
The system will then proceed to generate LLM calls for each part of the chain, rather than putting the entire thought process into one response.
The following example shows a sample application of this technique on Grok-Beta with the question "How many R's are in Strawberry?"

Answer for How many R's are in Strawberry using CoT chain on Grok?
Each part is set up by the model itself, giving it a title and deciding whether it needs to move on to another 'thought' or whether it has arrived at a final answer.
This is still a CoT style technique because it's linear, but it's slightly more advanced than simply asking the model to 'think step by step'.
I used parts of his code to build a script that loops through some of the basic problems I tested against LLM to see how much improvement would actually result from using such a system. I adapted Claude's and Grok's scripts to evaluate how such a strategy would affect them. Here's what the boost looks like:

For the first three categories, Llama 3.1 70B showed the most improvement. grok performed poorly on the popular questions (as did Haiku).
The Putnam dataset is higher math, and few LLMs perform well on it, so I was deeply surprised when Claude Sonnet 3.5 got 68.75% with these CoT chains, while o1-preview only had 63%.
Overall, Sonnet's higher math scores improved by 811 TP4T after using CoT.
Keep in mind that I'm using a very small dataset here, and this is just to get an idea of where they perform well and if we can improve the scores. It needs to be tested on a larger scale of data to get more specific information.
However, I have also observed that smaller models produce worse results if they start to over-analyze simple problems, as exemplified by the performance of Grok-Beta and Haiku on common 'simpler' problems.
Simpler non-mathematical problems may not get the same benefit from CoT.
We must also remember that we can make a model perform better than it is capable of, but rarely can we exceed it. If it doesn't know the answer, it just can't know it.
Fine-tuning for reasoning
Here we need to mention fine-tuning. there is a very interesting research direction in AI: fine-tuning smaller models on CoT datasets in order to improve their accuracy to the level of larger models that are 1-2 times larger.
I've found multiple resources, but unfortunately we weren't able to find cases where there was a significant improvement over the base model. Below is a list of the open source models I have found:


That's not to say that fine-tuning doesn't work for CoT, just that better models need to be constructed and well documented.
If you like to try fine-tuning on your own, check out these resources. I'm sure there are more resources out there.
Other generation techniques
We discuss thought chaining techniques, but there are other ways to optimize the output accuracy of a language model without prompting.
This involves those sampler settings that we mostly ignore when calling LLM -- parameters like temperature, top_p, and do_sample -- which can play a role in controlling output behavior.
Now, we don't always have access to all of these settings in the commercial API, but we do have access to temperature. in technical terms, temperature means that when we set it to high, we can scale the logit, which increases the chances that a low-probability token will be selected. This is shown below:

How does temperature make logit go up and down?
Suppose token 'mat' starts with the highest initial logit, but as the temperature rises, we find that it starts to fall, thus decreasing the probability. For an initial logit with a lower number, the opposite is true.
What does this mean? It means that if the temperature is high, the model is more likely to choose a word that feels less 'safe'.
Most people call it randomness or creativity.
For top_p, which is not accessible by all commercial APIs, you can limit or expand the token pool based on numbers you set. A low score will limit the pool to tokens with high probability scores, and vice versa - a low score means that only high probability tokens will appear in the candidate pool.
The combination of high top_p and high temperature will produce a more innovative and creative output as more tokens will be candidates.
The do_sample parameter determines whether the model uses sampling to generate the next token or not. when set to True, the model samples from the candidate pool and has more degrees of freedom. When set to False, it only selects the token with the highest probability (and completely ignores temperature or top_p).
We can use this setting to force the model to produce a more deterministic output, i.e., the token with the highest probability at each stage.
This is known as Greedy Decoding.
This is a strategy: the model chooses the token with the highest probability at each step, which may produce a more accurate answer (if it has the required intrinsic knowledge).
I did also use do_sample to apply greedy decoding to the model Llama 3 8B to test if I could get a boost on the base problem. The results were as follows:

As you can see, there is an improvement in MMLU and Big-Bench, but very little progress in higher math.
Now, commercial APIs mostly don't provide do_sample, so to apply something like that without access to the model, you could set the temperature to 0 to try to mimic the behavior, but that's not guaranteed.
So, you may now have a question: if we do see small improvements, why not always use greedy decoding?
If we ignore the need for creativity in the output, you'll also find that less capable LLMs may get stuck in repetitive loops, such as 'The color is blue blue blue blue blue', where 'blue' seems to be the highest probability token, and therefore repeats.
Senior CoT
The previous talk was all about linear techniques, where the model produces output in a thread or chain.
But shortly after the first CoT paper was published, DeepMind proposed another, more advanced technique called Chain of Thoughts with Self-Consistency (CoT-SC).
The technique creates multiple paths of reasoning and uses some method to select the most consistent answer (or path) at the end.

CoT-SC Demo
They reported a 1-8% improvement in arithmetic reasoning using this method.
Another approach, which is quite similar in idea, is to use multiple paths but not use any hints.
Remember the greed decoding I talked about in the last section?
This approach is similar, except that it not only forces the selection of the most likely token, but also looks at the confidence score of the entire response.

Assessment of internal confidence scores
To do this, the system first starts an initial top token of a certain number k and then generates paths from each token. After generating the answer, it calculates the confidence score by analyzing the probability (logit) of each token in different paths.
The result returned is the answer (or path) with the highest probability.
This method is called Decoding CoT and was proposed by DeepMind. The idea of this method is to see the internal confidence of the model on the returned answer.
But what happens if it doesn't have the knowledge inherent in answering the question? As with CoT-SC, this approach depends heavily on the model having the right answers in the first place.
However, that doesn't mean we shouldn't test it.
For all of these technologies, someone has open sourced a different implementation, and this one is no exception.
Therefore, it was easy for me to set up a system to test these methods and compare which performs better on the smaller open source model Llama 3 8B.

Thanks to Codelion for open sourcing his implementation so I can easily reproduce it:https://github.com/codelion/optillm
As can be seen from the results above, using Decoding CoT clearly produces the best results compared to other methods (e.g. Entropy) or using only greedy decoding for this particular model.
Updated technology
Research is moving so fast these days that it's hard to keep up with it completely. Won't go into too much detail here, but I did want to mention Tree of Thoughts (ToT), especially when combined with Monte Carlo search.
ToT was proposed by Princeton University and DeepMind in late 2023, but typically builds on previous tree-based reasoning methods.
ToT differs from a self-consistent chain of thought (CoT-SC.) Instead of generating multiple paths and evaluating them after they have been generated, ToT dynamically evaluates them as the thought progresses.

Simple Demo ToT
We can think of ToT as 4 different people coming together to solve a problem. At each step, they come up with their own ideas and jointly evaluate which ones seem most promising. If one person's reasoning seems flawed, he leaves and the others move forward with their solutions.
In the end, the person who reasoned correctly will be able to provide you with the answer.
This allows the model to dynamically prune seemingly lackluster paths and focus on more promising threads, thus saving resources.
But, one might ask, how does the system decide which thread is right and which is wrong? This is determined by the model itself.
This is why extensions like Monte Carlo Tree Search (MCTS) can provide more unbiased evaluation mechanisms.MCTS allows for backpropagation, which means that it can revisit and improve earlier steps in light of new information, whereas simple ToT only moves forward.
In a 4-person problem solving case, MCTS would allow people to have less than optimal thinking, but still stay in the game longer. This situation would be assessed differently.
MCTS can model multiple future paths, assess their potential, and look back to improve early decisions. It introduces external metrics (incentives) rather than relying exclusively on modeling.
Statistics like the UCB (upper confidence limit) use these rewards to decide which ideas to explore further or revisit.
MCTS is slightly more complex than simple ToT and deserves its own article.
Economics of CoT
So, so far, you might be thinking: well, we've made some improvements, why not always use a more advanced form of thought chain?
First, the cost (and thinking time).
For chains applied to different models, the average number of inference steps is calculated here.

To put that in perspective, you're paying on average up to 8 times as much per problem. For Sonnet, which performs best on advanced math problems, you're paying up to $15 per 500 problems.
This may not seem like much, but once you're using this system every day to generate answers for customer service or your team, that can amount to hundreds or even thousands of dollars consumed each month.
In some cases, it makes sense to use advanced reasoning, but not always.
There may now be cases where fine-tuning of the CoT can radically eliminate the need for multiple calls.
There is a trade-off to consider here: we want to increase the thinking time to give the model enough time to reason effectively, but doing so also increases user frustration and costs.
构建smart (phone, system, bomb etc)systems
Last September, a paper entitled "To CoT or not to CoT?" was published, which argued that most of the improvements resulting from applying CoT are primarily in the areas of math and complex reasoning.
We see this here as well, where CoT brings limited lift on simple problems.
When we apply these chains, we have to wait longer for an answer. Is this worth it? It should be noted that all of these strategies can be a bit much for simple tasks.
However, if you are building a system where you need to make sure that the answers are correct, some form of CoT or decoding may be of great benefit.
Perhaps an approach worth considering would be to start with a model that sets up the first few steps based on the difficulty of the task, and then analyze whether it is confident in answering the question in the first place. Then let the model reason (through the chain) and finally use another model to score its response.
(Text: Heart of the Machine by Ida Silfverskiöld)
© Copyright notes
The copyright of the article belongs to the author, please do not reprint without permission.
Related posts
No comments...