Ali's new-generation Qwen3 model released: surpasses DeepSeek R1 on top of the open-source list

With all the speculation about the DeepSeek V4 or R2 and theQwen3Who came first when Qwen3 was released.

In the early morning of April 29, Alibaba open-sourced a new generation of Tongyi Thousand Questions Qwen3 series models, covering eight different sizes. Among them, the flagship model Qwen3 235B adopts the Mixed Expert (MoE) architecture, with a total number of parameters of 235B (only for theDeepSeek-R1of 1/3), with only 22B activation parameters and 36 trillion tokens of pre-training data.

In terms of performance, according to the official introduction, Qwen3 has excelled in a number of evaluations, surpassing mainstream models such as DeepSeek-R1 and OpenAI-o1, and becoming the current leading open source large language model in terms of performance.

Specifically, Qwen3 has greatly enhanced its reasoning, instruction following, tool invocation, and multi-language capability: in the AIME25 evaluation of the AoA level, Qwen3 scored 81.5 points, setting a new open-source record; in the LiveCodeBench evaluation that examines coding capability, Qwen3 broke the 70-point mark, even outperforming Grok3; and in the ArenaHard evaluation that evaluates the alignment of models and human preferences, Qwen3 surpassed OpenAI-o1 and DeepSeek-R1 with 95.6 points. In the ArenaHard evaluation, which assesses human preference alignment, Qwen3 surpassed OpenAI-o1 and DeepSeek-R1 with 95.6 points.

We're used to seeing repeated outperformances in model list performance, but there's something else that's different about Qwen3 this time around, which is more interested in highlighting the ability to do more with less than simply pushing the boundaries of models. And it has once again given the open source community a different recipe from R1 in its turn with DeepSeek to push open source models forward.

Didn't quite use R1's method, but accomplished surpassing R1's

Similar to R1, Qwen3 also follows the idea of "using models to train models".

A very important source of performance optimization for Qwen3 during the pre-training phase is the large amount of high quality synthetic data.

Quantitatively, the Qwen3 dataset is significantly expanded compared to Qwen2.5, which was pre-trained on 18 trillion tokens, while Qwen3 uses almost twice the amount of data, amounting to about 36 trillion tokens, covering 119 languages and dialects. Part of it comes from extracting information from PDF documents, and the other part is the data synthesized by the Qwen2.5 series of models.

The technical report explicitly mentions, "We used Qwen2.5-VL to extract text from these documents and used Qwen2.5 to improve the quality of the extracted content. To increase the amount of math and code data, we synthesized data using Qwen2.5-Math and Qwen2.5-Coder, two expert models in the fields of math and code, synthesizing data in a variety of forms including textbooks, quiz pairs, and code snippets."

This also means that in the process of pre-training, Qwen3 builds another data system for self-iterative improvement with its own ecological advantages.

While the pre-training lays the foundation of Qwen3's basic ability, the post-training phase on top of it is Qwen3's most crucial technological innovation, which realizes the integration of reasoning ability and direct-response ability through a multi-stage training method.

The image above is an example of thinking and non-thinking modes implemented within the same model. Looking at the official application interface, the method of choosing which mode to use still looks like it's left to the user to choose, but with the Deep Thinking mode selected, the user has the additional ability to set a thinking budget, allowing the model to be dynamically allocated according to the difficulty of the problem.

For post-training, Qwen3 uses a similar "back-and-forth" iteration as the overall R1 pipeline: fine-tuning, RL, fine-tuning again, and then more specific RL.

It distills small models with large models just like DeepSeek, but Qwen thoroughly distills itself.

Another particularly noteworthy aspect of Phase 2 RL, which the Qwen team used, was rule-based rewards to enhance the model's ability to explore and drill down.

"The second phase focuses on large-scale reinforcement learning, utilizing rule-based rewards to enhance the model's ability to explore and drill down." The official blog writes. This is in stark contrast to GRPO (results-based reward optimization), which is currently considered key to the success of models such as DeepSeek R1. qwen3 does not rely exclusively on results-based reward mechanisms such as GRPO.

Immediately following the third phase of fine-tuning, Qwen3 used a copy of the long thought chain data and commonly used instruction fine-tuning data on the combined data to fine-tune the model, enabling the integration of non-thinking modes into the thinking model and ensuring a seamless combination of reasoning and rapid response capabilities.

Finally, in the fourth phase, Qwen3 applied reinforcement learning on tasks in more than 20 generalized domains including instruction adherence, format adherence, and Agent capabilities.

Qwen3 didn't use R1 exactly the way it was meant to be used, but accomplished surpassing R1.

The model is full size, but the parameters are getting "smaller".

Like Qwen's previous ecological route, Qwen3 released 8 different model versions in one go, including 2 MoE models of 30B and 235B, and 6 dense models of 0.6B, 1.7B, 4B, 8B, 14B, and 32B, each of which won the SOTA (best performance) for open source models of the same size.

The full-size, this time, did not disappoint the long-awaited community with cheers.

MLX is an efficient machine learning framework designed specifically for Apple Silicon. The team of MLX finished the support work for Qwen 3 before the model was released. Among them, 0.6B and 4B can be applied to cell phones, and 8B, 30B, and 30B MOE can be used on computers ......

Size is one thing. What's more important is that Qwen continues to achieve the same performance results at more and smaller sizes that it used to achieve at larger sizes. In many scenarios, the model has the capability and capacity to run on the end side.

According to the official blog, Qwen3's 30B parameter MoE model realizes more than 10 times model performance leverage improvement, only activating 3B can match the performance of the previous generation of Qwen2.5-32B model; Qwen3's dense model performance continues to break through, half the number of parameters can achieve the same high performance, such as the 32B version of the Qwen3 model can be cross-classed beyond the Qwen2.5-72B performance. 72B performance.

Qwen3 is clearly one of the hottest models that the open source community has been able to play with and dismantle for some time, and it is expected that the release of its more comprehensive technical report will reveal more "exclusive formulas" that will continue to drive open source modeling advances and innovations.

Newsflash # Qwen3

The copyright of the article belongs to the author, please do not reprint without permission.

OpenAI发布GPT-4.1全新系列模型！全面超越GPT-4o 更聪明、更便宜

Ali's new-generation Qwen3 model released: surpasses DeepSeek R1 on top of the open-source list

Didn't quite use R1's method, but accomplished surpassing R1's

The model is full size, but the parameters are getting "smaller".

Endor Labs Raises $93 Million in Series B Funding as AI Code Security Track Becomes New Battleground

OpenAI Leads Investment in AI Programming Tool Cursor, Valuation Triples to $9 Billion in Six Months

Related posts

OpenAI releases a new series of models for GPT-4.1! Comprehensively outperforms GPT-4o Smarter and cheaper!

Manus: Chinese team releases world's first universal AI Agent to blow up tech scene

Google's strongest open source model, Gemma 2, is released!

DeepSeek V3 low-key update: big programming boost, R2 model coming too?

No comments

Popular Articles

Popular Sites