Qwen3-Next

4dys agorelease 93 0 0

Ali open source 80 billion parameters of the big model, 1:50 super sparse activation, millions of contexts, the cost down 90%, the performance is comparable to the hundreds of billions of models.

Language:
zh,en
Collection time:
2025-09-12
Qwen3-NextQwen3-Next

What is Qwen3-Next?

Qwen3-Next is the next-generation base model architecture released by AliCloud's Tongyi team on September 12, 2025, aiming to achieve extreme contextual processing power and parameter efficiency through architectural innovation. Its core model, Qwen3-Next-80B-A3B, has 80 billion total parameters, but only 3 billion parameters are activated during inference (activation ratio 1:50), which significantly reduces computational costs while maintaining high performance. The model supports millions of tokens of ultra-long contexts, reduces the training cost by more than 90% compared with the previous generation of dense model Qwen3-32B, and improves the throughput of long text inference by more than 10 times, which is comparable to the flagship version of the Qwen3 model with 235 billion parameters.

Qwen3-Next's core technology

  1. High sparsity MoE architecture
    • Dual-track expert designThe model contains 512 expert modules, with 10 sparse experts + 1 shared expert dynamically selected for each inference. Shared experts provide a stable computational base, while sparse experts handle specialized tasks, realizing "general practitioner + specialist" collaboration.
    • extreme sparsity: Activation parameter ratio up to1:50The company's computational efficiency has been enhanced by the fact that it is well above the industry average (e.g., 1:10 for Qwen3).90%Above.
  2. Hybrid Attention mechanism (Hybrid Attention)
    • Gated DeltaNet (linear attention): by O(N) complexity Modeling long distance dependencies (e.g., entire book veins) with reduced memory consumption 50%.
    • Gated Attention: Efficiently capture localized information (e.g., phrases, keywords) and mix the two in a 3:1 ratio to balance performance and efficiency.
  3. Multi Token Prediction (MTP)
    • The pre-training phase predicts multiple future Tokens (e.g., t+1, t+2, ..., t+n) at the same time to improve the model's understanding of causal relationships.
    • Adapt Speculative Decoding in the inference phase to generate multiple candidate Tokens at once and validate them in parallel for faster decoding. several times (bigger).
  4. Training stability optimization
    • Zero-Centered RMSNorm: Impose constraints on normalization layer weights to avoid gradient explosion or vanishing and improve training stability.
    • MoE route initialization optimization: Ensure that expert modules are selected unbiased early in training to reduce initialization perturbations.

Scenarios for Qwen3-Next

  1. Long Text Processing
    • Analysis of legal instruments: Support for multi-million Tokens contexts for complete parsing of long documents such as contracts and judgments.
    • Review of scientific literature: Efficiently process long papers and lab reports, extract key information and generate summaries.
  2. Efficient Reasoning
    • real time interactive application: The low activation parameter design enables it to excel in domestic arithmetic and is suitable for intelligent customer service, online education and other scenarios.
    • Low latency generation: MTP technology accelerates the decoding process and improves conversation smoothness.
  3. complex reasoning task
    • Math and Programming: Score on the AIME25 Math Reasoning Assessment87.8, approaching SOTA levels; outperforming the flagship Thousand Questions open source model in the LiveCodeBench programming review.
    • Multi-step logic chain construction: Reasoning models (Thinking versions) excel at solving problems that require step-by-step reasoning, such as logic puzzles and strategic planning.

Qwen3-Next project address

Recommended Reasons

  1. Ultimate price/performance ratio
    • Lower training costs90%above, reasoning throughput is increased10 timesIt significantly lowers the threshold for enterprise AI adoption.
  2. technological leadership
    • Innovative technologies such as Hybrid Attention Mechanism, High Sparsity MoE, and MTP represent the cutting edge of the industry and set a new standard for long context processing.
  3. Open Source Ecological Advantage
    • The number of models derived from Tongyi's thousand questions exceeds170,000The company is the world's No. 1, and developers can quickly customize applications based on open source code.
  4. Strong scenario adaptability
    • It supports diverse scenarios from long text analysis to real-time interaction, covering a wide range of industries such as law, scientific research, education, and customer service.

data statistics

Relevant Navigation

No comments

none
No comments...