Domestic AI most rolled overnight! DeepSeek, Kimi, and OpenAI o1, how strong is the actual test experience?

Newsflash3mos agoupdate AiFun
505 0

In time for the holiday, the domestic AI big model vendors that have been supporting the prongs up have released a whole lot of Spring Festival gifts in a spurt.

forefoot DeepSeek-R1 The new model k1.5 is also officially released, claiming that its performance is on par with the full-blooded multimodal version of OpenAI o1, and the new model k1.5 is also officially released, claiming that its performance is on par with the full-blooded multimodal version of OpenAI o1.

If you add the previously strong debut of the Chi Spectrum GLM-Zero, Step Star Reasoning Model Step R-mini, Starfire Depth Reasoning Model X1, the end of the year on the big score of the domestic big models opened the curtain on the real thing. It also puts a wave of pressure on overseas vendors represented by OpenAI.

国产AI 最卷一夜!大模型黑马 DeepSeek、Kimi 硬刚 OpenAI o1,实测体验到底有多强

DeepSeek-R1: Outperforms OpenAI o1 on math, code, natural language reasoning, and other tasks.

Dark Side of the Moon k1.5: Math, Code, Visual Multimodal and General Purpose Capabilities Overall Beyond GPT-4o and Claude 3.5 Sonnet

Smart Spectrum GLM-Zero: specializes in mathematical logic, code, and complex problems requiring deep reasoning

Step-2 mini: Extremely responsive, with an average first-word latency of 0.17 seconds, and Step-2 Literary Master Edition

Starfire X1: Shines in math, has a thorough thinking process, and holds math across elementary, middle, high school, and college.

The blowout is not an accidental outbreak, but a power that has been accumulated for a long time. It can be said that the breakout of domestic AI models on the eve of the Spring Festival will hopefully redefine the world coordinates of AI development.

China's "Source God" explodes overseas, this is the real OpenAI!

The DeepSeek-R1, which was first released last night, is now available on the DeepSeek website and app, and is open and ready to go.

The puzzles of which is bigger, 9.8 or 9.11, and how many r's are there in Strawberry, went over well in the first test, and although the chain of thought was a bit lengthy, the correct answers speak for themselves.

国产AI 最卷一夜!大模型黑马 DeepSeek、Kimi 硬刚 OpenAI o1,实测体验到底有多强

In the face of the retarded bar's question "How high can I jump to skip the ads on my phone?", the fast-responding DeepSeek-R1 not only avoids the language trap, but also provides a lot of suggestions to avoid ads, which is very user-friendly.

国产AI 最卷一夜!大模型黑马 DeepSeek、Kimi 硬刚 OpenAI o1,实测体验到底有多强

A few years ago, a logical reasoning question titled "If yesterday is tomorrow and today is Friday, what day of the week is actually today" went viral.

After being grilled on the same question, OpenAI o1 gave an answer on Sunday and DeepSeek-R1 on Wednesday, but so far, at least DeepSeek-R1 is closer to the answer.

国产AI 最卷一夜!大模型黑马 DeepSeek、Kimi 硬刚 OpenAI o1,实测体验到底有多强

According to the report, DeepSeek-R1 outperforms the official version of OpenAI o1 in tasks such as math, code, and natural language reasoning, and is theoretically more biased towards science students.

Just in time for the friendly exchange of math homework between Chinese and American netizens on Xiaohongshu, we also asked DeepSeek-R1 to help us with our questions and answers.

As a piece of cold knowledge, last time when DeepSeek exploded overseas, some users found that DeepSeek also supports image recognition, so we can directly let it analyze the test paper images.

国产AI 最卷一夜!大模型黑马 DeepSeek、Kimi 硬刚 OpenAI o1,实测体验到底有多强

There are two questions in total, the first one chooses C and the second one chooses A. Moreover, the "confident" DeepSeek-R1 speculates that there is no 18 in the options of the original question of the second question, and combining with the options, it speculates that there may be a clerical error in the original question.

In the subsequent linear algebra proof questions, DeepSeek-R1 provides logical and rigorous proof steps and multiple verification methods for the same question, demonstrating deep mathematical skills.

国产AI 最卷一夜!大模型黑马 DeepSeek、Kimi 硬刚 OpenAI o1,实测体验到底有多强

Start with performance, get stuck in cost, and stay true to open source, DeepSeek-R1 is officially released, and the model weights are also open-sourced. I declare that DeepSeek from the East of China is the real OpenAI.

It is reported that DeepSeek-R1 follows the MIT License, which allows users to train other models with the help of R1 through distillation technology.DeepSeek-R1 is uploaded to the API, which opens up the output of the chain of thought to users, and can be invoked by setting the model='deepseek-reasoner' by setting model='deepseek-reasoner'.

国产AI 最卷一夜!大模型黑马 DeepSeek、Kimi 硬刚 OpenAI o1,实测体验到底有多强

And, DeepSeek-R1 training techniques are all publicly available, with papers pointing the way to
https://github.com/deepseek-ai/DeepSeek-R1/blob/main/DeepSeek\_R1.pdf

One noteworthy finding mentioned in the DeepSeek-R1 technical report is the "aha moment" that occurs during R1 zero training.

In the mid-training phase of the model, DeepSeek-R1-Zero starts to actively reevaluate the initial solution ideas and allocates more time to optimize the strategy (e.g., multiple attempts at different solutions). In other words, through the RL framework, the AI may spontaneously develop human-like reasoning capabilities, even beyond the limits of preset rules.

And it will also hopefully provide direction for the development of more autonomous, adaptive AI models, such as dynamically adjusting strategies in complex decisions (medical diagnosis, algorithm design). As the report states. "This moment is not only the 'epiphany moment' for the model, but also for the researcher when observing its behavior."

国产AI 最卷一夜!大模型黑马 DeepSeek、Kimi 硬刚 OpenAI o1,实测体验到底有多强

In addition to the main big models, DeepSeek's small models are equally strong.

DeepSeek has open-sourced 6 mini-models by distilling two 660B models, DeepSeek-R1-Zero and DeepSeek-R1. Of these, the 32B and 70B models are at the level of the OpenAI o1-mini in several areas.

And. DeepSeek-R1-Distill-Qwen-1.5B, with a parameter size of only 1.5B, outperforms GPT-4o and Claude-3.5-Sonnet in math benchmarks, with an AIME score of 28.91 TP3T and a MATH score of 83.91 TP3T.

HuggingFace Link:
https://huggingface.co/deepseek-ai

国产AI 最卷一夜!大模型黑马 DeepSeek、Kimi 硬刚 OpenAI o1,实测体验到底有多强

In terms of API service pricing, DeepSeek, which is known as the AI class of Poundland, has also adopted flexible ladder pricing: each million input tokens is charged 1-4 yuan according to the cache situation, and the output tokens are unified at 16 yuan, which once again significantly reduces the cost of development and use.

After the release of DeepSeek-R1, it has once again caused a sensation in overseas AI circles and gained a large amount of "self-water". Among them, blogger Bindu Reddy even gave Deepseek the title of open source AGI and the future of civilization.

国产AI 最卷一夜!大模型黑马 DeepSeek、Kimi 硬刚 OpenAI o1,实测体验到底有多强

国产AI 最卷一夜!大模型黑马 DeepSeek、Kimi 硬刚 OpenAI o1,实测体验到底有多强

The outstanding reviews stem from the model's excellent performance in real-world applications on the web. From a 30-second detailed explanation of the Pythagorean Theorem to a 9-minute in-depth explanation of the principles of quantum electrodynamics with visualizations, DeepSeek-R1 leaves nothing to chance.

国产AI 最卷一夜!大模型黑马 DeepSeek、Kimi 硬刚 OpenAI o1,实测体验到底有多强

There are even some netizens who particularly appreciate the chain of thought demonstrated by DeepSeek-R1, thinking that it "resembles the inner monologue of a human being, which is both professional and adorable".

国产AI 最卷一夜!大模型黑马 DeepSeek、Kimi 硬刚 OpenAI o1,实测体验到底有多强

Jim Fan, Senior Research Scientist at NVIDIA, spoke highly of DeepSeek-R1. He noted that it represents a non-US company that is living up to OpenAI's original mission of openness, realizing impact by making raw algorithms and learning curves publicly available. And by the way, it also connotes a wave of OpenAI.

DeepSeek-R1 not only open-sources a range of models, but also reveals all the training secrets. They may be the first open source project to demonstrate significant and continued growth of the RL flywheel.

Impact can be achieved either through legendary projects such as 'ASI Internal Implementation' or 'Strawberry Project', or simply by making the original algorithms and matplotlib learning curves publicly available.

国产AI 最卷一夜!大模型黑马 DeepSeek、Kimi 硬刚 OpenAI o1,实测体验到底有多强

After delving into the paper, Jim Fan highlighted a few key findings:

Driven entirely by reinforcement learning, without any SFT ('cold start'). Reminiscent of AlphaZero - mastering Go, Shogi and Chess from scratch, rather than mimicking the moves of human masters first. This is the most critical finding in the paper. Real rewards computed using hard-coded rules.

Avoid learning reward models that are easy to crack with reinforcement learning. As training progresses, the model's thinking time gradually increases - this is not a pre-written program, but an emergent property! Emergence of self-reflective and exploratory behavior.

GRPO replaces PPO: it removes the comment network of PPO in favor of averaging rewards across multiple samples. This is a simple way to reduce memory usage. Note that GRPO is an innovative approach proposed by the authors' team.

Overall, this work demonstrates the groundbreaking potential of reinforcement learning for practical applications in large-scale scenarios and proves that certain complex behaviors can be achieved by simpler algorithmic structures without tedious tuning or human intervention.

A picture is worth a thousand words, and a more obvious comparison is below:

国产AI 最卷一夜!大模型黑马 DeepSeek、Kimi 硬刚 OpenAI o1,实测体验到底有多强

In this way, DeepSeek again at home and abroad to complete the second burst of fire, not only a technological breakthrough, but also China and the world's open source spirit of victory, and therefore harvested a lot of overseas loyal fans.

New model rivals OpenAI o1. Three breakthroughs in three months.Kimi Getting the overseas collectives buzzing

Kimi v1.5 Multimodal Thinking Model was also launched on the same day.

This is the third major addition to the K series since Kimi released the k0-math math model last November and the k1 visual thinking model in December.

In the short-CoT competition, Kimi k1.5 demonstrated overwhelming mathematical, coding, visual multimodal, and generalization capabilities over industry leaders GPT-4o and Claude 3.5 Sonnet.

国产AI 最卷一夜!大模型黑马 DeepSeek、Kimi 硬刚 OpenAI o1,实测体验到底有多强

In the long-CoT competition, Kimi k1.5's code and multimodal reasoning capabilities already rival those of the official OpenAI o1 version. Become the first model in the world to achieve o1-level multimodal inference performance outside of OpenAI.

国产AI 最卷一夜!大模型黑马 DeepSeek、Kimi 硬刚 OpenAI o1,实测体验到底有多强

Along with the model re-launch, Kimi also debuted a full technical report on model training.

GitHub Link:
https://github.com/MoonshotAI/kimi-k1.5

According to the official introduction, the core technological breakthroughs of the k1.5 model are mainly reflected in four key dimensions:

Long Context Extension. We extend the context window of RL to 128k and observe continued performance improvements as the context length increases. A key idea behind our approach is to use partial rollouts to improve training efficiency - i.e., to avoid the cost of regenerating new trajectories from scratch by reusing a large number of previous trajectories to sample new ones. Our observations suggest that context length is a key dimension of continuous RL expansion through LLMs.

Improved strategy optimization. We derive an RL formulation for long-CoT and use a variant of online mirror descent for robust strategy optimization. The algorithm is further improved by our efficient sampling strategy, length penalty, and optimization of data formulation.

A concise framework. Long context extensions combined with improved policy optimization methods create a concise RL framework for learning through LLMs. Because we were able to extend the context length, the learned CoTs exhibit planning, reflection, and revision. The effect of increasing the context length is to increase the number of search steps. As a result, we show that robust performance can be achieved without relying on more sophisticated techniques such as Monte Carlo tree search, value functions, and process reward models.

Multi-modal capability. Our model is trained jointly on textual and visual data, with the ability to reason jointly about both modalities. The model has outstanding mathematical ability, but some geometric problems that rely on graphical comprehension are difficult to handle due to the fact that it mainly supports text input in formats such as LaTeX.

The preview version of k1.5 multimodal thinking model will be launched on the official website and official app one after another, and it is worth mentioning that the release of k1.5 has also aroused great repercussions overseas. Some netizens have been praising the model and witnessed the rise of China's AI power.

国产AI 最卷一夜!大模型黑马 DeepSeek、Kimi 硬刚 OpenAI o1,实测体验到底有多强

国产AI 最卷一夜!大模型黑马 DeepSeek、Kimi 硬刚 OpenAI o1,实测体验到底有多强

In fact, the intensive release of inference models in China at the end of the year is no coincidence, and it is a clear sign that the ripples in the global AI field caused by the release of OpenAI's o1 model in October last year have finally spread to China.

From catching up to comparing in just a few months, the domestic big model proves China's speed with action.

Fields Medalist and mathematical genius Tao Zhexuan once thought that this type of reasoning model might only need one or two more rounds of iteration and capability enhancement to reach the level of "qualified graduate students". And the vision for AI development goes far beyond that.

国产AI 最卷一夜!大模型黑马 DeepSeek、Kimi 硬刚 OpenAI o1,实测体验到底有多强

Currently, we are witnessing a critical moment of transformation for AI intelligences. Crossing over from mere 'knowledge augmentation' to 'execution augmentation', they are beginning to actively participate in the process of decision making and task execution.

At the same time, AI is also breaking through the limitations of a single mode, and rapidly evolving in the direction of multi-modal integration. When execution meets thinking, AI really has the power to change the world.

国产AI 最卷一夜!大模型黑马 DeepSeek、Kimi 硬刚 OpenAI o1,实测体验到底有多强

Based on this, models that think like people are opening up more possibilities for the practical implementation of AI.

On the surface, this year-end wave of intensive domestic inference modeling may bear the shadow of 'Chinese-style follower innovation'.

However, in-depth observation will find that, whether in the depth of the open source strategy, or in the accuracy of the technical details, Chinese manufacturers are still out of a unique development path.

© Copyright notes

Related posts

No comments

none
No comments...