New MoE architecture! Ali open source Qwen3-Next, training costs straight down 90%!

Large Language Modeling (LLM), is moving to Next Level.

In the early hours of Friday morning, the Ali Tongyi team officially released and open-sourced the next generation of base model architectureQwen3-NextThe performance of a model with 80B total parameters can be comparable to a 235B model of the flagship version of Thousand Questions 3 with only 3B activations. With only 3B activations for a model with 80B total parameters, the performance is comparable to the 235B model of the flagship version of Thousand Questions 3, and it also surpasses Gemini-2.5-Flash-Thinking, realizing a major breakthrough in the computational efficiency of the model.

The new model immediately went live on Qwen.ai and was uploaded to HuggingFace.

New model web version:https://chat.qwen.ai/

HuggingFace:https://huggingface.co/collections/Qwen/qwen3-next-68c25fd6838e585db8eeea9d

Kaggle:https://www.kaggle.com/models/qwen-lm/qwen3-next-80b

Qwen3-Next is designed for the future trend of large models in Context Length Scaling and Total Parameter Scaling. According to the Tongyi team, its model structure adds a variety of new technologies and core improvements compared to the MoE model of Qwen3, which was launched at the end of April, including a hybrid attention mechanism, a high sparsity MoE structure, a series of optimizations to improve the stability of training and a multi-token prediction (MTP) mechanism to improve the efficiency of inference, and so on.

Schematic diagram of the model structure:

The Tongyi team described some of the mechanisms used in the new architecture.

Hybrid Architecture: Gated DeltaNet + Gated Attention

Linear attention breaks the quadratic complexity of standard attention and has higher efficiency in dealing with long contexts. The Tongyi team found that there are limitations in using either linear attention or standard attention alone: the former is efficient in modeling long sequences but weak in recall, and the latter has high computational overhead and unfriendly reasoning.

Through systematic experiments, it has been found that Gated DeltaNet has stronger in-context learning capability than the commonly used Sliding Window Attention and Mamba2, and can consistently outperform a single architecture with a 3:1 mixing ratio (i.e., 75% layers using Gated DeltaNet and 25% layers retaining standard attention). DeltaNet for the 75% layer and retaining standard attention for the 25% layer) consistently outperforms a single architecture, optimizing both performance and efficiency.

Tongyi further introduces a number of enhancements to the standard attention span that it retains:

(1) Follow the output gating mechanism from previous work to mitigate the low-rank problem in attention;

(2) Extending the single attention header dimension from 128 to 256;

(3) Adding rotational position coding to the positional dimension of 25% in front of the attention head only improves length extrapolation.

Extremely sparse MoE: only 3.7% activated Parameters

Qwen3-Next employs a highly sparse Mixture-of-Experts (MoE) architecture with 80B total parameters, activating only about 3B parameters per inference. Experiments show that after using global load balancing, continuously increasing the total expert parameters leads to a steady decrease in training loss when the activation experts are fixed.

Compared to Qwen3 MoE's 128 total experts and 8 routing experts, Qwen3-Next expands to a combination of 512 total experts, 10 routing experts and 1 shared expert to maximize resource utilization without sacrificing effectiveness.

Training Stability Friendly Design

Tongyi's team found that the attention output gating mechanism can eliminate the phenomena of attention pooling and extreme activation, and ensure the stability of the values of each part of the model.Qwen3 adopts QK-Norm, and the norm weight value of some layers will be abnormally high. In order to alleviate this phenomenon and further improve the stability of the model, Tongyi adopts the Zero-Centered RMSNorm in Qwen3-Next and applies weight decay to the norm weight on this basis to avoid the unbounded growth of the weight.

Tongyi also normalizes the parameters of the MoE router during initialization to ensure that each expert is selected unbiasedly early in training, reducing the perturbation of the initialization on the experimental results.

Multi-Token Prediction

Qwen3-Next introduces the native Multi-Token Prediction (MTP) mechanism, which not only obtains the MTP module with higher acceptance rate of Speculative Decoding, but also improves the comprehensive performance of the backbone itself.Qwen3-Next also optimizes the performance of Multi-Step Prediction (MTP), which further improves the acceptance rate of Speculative Decoding in practical scenarios, through the training of inference-consistent multistep training. Qwen3-Next also optimizes the MTP multi-step inference performance by training the inference consistent multi-step training, which further improves the acceptance rate of Speculative Decoding in practical scenarios.

Tongyi Thousand Questions Big Model Leader Jun Yang Lin shared details about the development of the next generation of models at X. He said the team has been experimenting with hybrid models and linear attention mechanisms for about a year. He said the team has been experimenting with hybrid models and linear attention mechanisms for about a year. The new solution should be stable and reliable enough to cope with ultra-long contexts.

Gated DeltaNet plus mixing was implemented only after a lot of trial and error, and the Gated Attention implementation is like a free lunch for extra benefits.

Thanks to the innovative hybrid model architecture, Qwen3-Next shows significant advantages in inference efficiency. Compared to Qwen3-32B, Qwen3-Next-80B-A3B shows excellent throughput in the prefill phase: nearly seven times the throughput at a context length of 4k tokens, and more than ten times when the context length exceeds 32k.

In the decode phase, the model also performs well - achieving nearly four times the throughput improvement in 4k contexts, and maintaining more than ten times the throughput advantage in long context scenarios of more than 32k.

Based on the model structure of Qwen3-Next, the Tongyi team trained the Qwen3-Next-80B-A3B-Base model, which has 80 billion parameters (only 3 billion parameters are activated), and achieves similar or even slightly better performance than the Qwen3-32B dense model, while the training cost (GPU hours) is less than one tenth of that of the Qwen3-32B. At the same time, the training cost (GPU hours) is less than one-tenth of Qwen3-32B, and the inference throughput is more than ten times that of Qwen3-32B in contexts above 32k, realizing the ultimate price/performance ratio of training and inference.

The Tongyi team has open sourced the Insctruct and Thinking models of Qwen3-Next-80B-A3B. The new model solves the long-standing stability and efficiency problems of hybrid attention mechanism + high sparsity MoE architecture in reinforcement learning training, and realizes the double improvement of RL training efficiency and final results.

Qwen3-Next-Instruct outperformed even the flagship open-source model of Thousand Questions in programming (LiveCodeBench v6), human preference alignment (Arena-Hard v2), and general competence (LiveBench), and outperformed the SOTA dense model Qwen3-32B in core measures that include generalized knowledge (SuperGPQA), mathematical reasoning (AIME25), and Qwen3-Next-Thinking outperformed Gemini2.5-Flash-Thinking in all aspects. Qwen3-Next-Instruct even outperforms the flagship open source model of Qwen, and surpasses the SOTA dense model Qwen3-32B in the core evaluation including general knowledge (SuperGPQA), mathematical reasoning (AIME25), and so on, while Qwen3-Next-Thinking surpasses the Gemini2.5-Flash-Thinking, and obtains a score of 87.8 in the mathematical reasoning AIME25 evaluation. This high level of model performance was achieved by activating only 3B of the 80B total parameters of Qwen3-Next.

The Qwen3-Next model is also now available in many third-party platforms.

Example of vibe coding in anycoder using the new model:

artifact # Qwen3

The copyright of the article belongs to the author, please do not reprint without permission.

Easy to get started, 8 good 3D modeling AI tools highly recommended

artifact # 3D Model Generation # Tool Recommendations

谷歌推出超小型AI模型Gemma 3 270M！手机能跑，智能设备离线运行新突破

Google launches ultra-small AI model Gemma 3 270M! Cell phones can run it, a new breakthrough for smart devices running offline!

artifact # Gemma 3

New MoE architecture! Ali open source Qwen3-Next, training costs straight down 90%!

Ali ends up with the strongest voice model Qwen3-ASR-Flash: Hear clearly, recognize accurately!

Tencent Mixed 3D-Omni, Mixed 3D-Part released and open source: 3D generation into the era of accurate and controllable

Related posts

Easy to get started, 8 good 3D modeling AI tools highly recommended

Google launches ultra-small AI model Gemma 3 270M! Cell phones can run it, a new breakthrough for smart devices running offline!

The Best of the Best: 18 AI Writing Tools to Easily Handle Marketing Copy, Academic Papers & Kumon Writing

Tencent's new-generation fast-thinking hybrid model Turbo S released, supports "second reply"

No comments

Popular Articles

Popular Sites