DeepSeek "open source week" five consecutive bomb: the power of software, reshaping the AI arithmetic landscape
"OpenAI is not Open.DeepSeekTrue Deep."
This week, the "open source week" activities in full swing, DeepSeek every day from time to time on the new "black technology", so that programmers around the world called out: this wave is simply in the atmosphere!
From computation to communication to storage, DeepSeek "five consecutive explosions" covers almost the entire chain of AI development, without upgrading the existing hardware to maximize the "dry" arithmetic, and then realize the leap in training efficiency, which is called the "strongest aid". The "strongest auxiliary" is very powerful and generous.
We combed through the technical components DeepSeek has released these days and were pleasantly surprised to find that they seem to have coincidentally been built into a sophisticated and synergistic system.
If you use"central kitchen" as a metaphor for the system.So whenever the chef of the big model wants to start "cooking", each part of the process can work together in a sophisticated way.After the process of "picking up food, order processing, serving food and cooking", "the big meal is produced efficiently".
Day1: FlashMLA - The Dispensing Robot
FlashMLA is DeepSeek's deep optimization specifically for NVIDIA's H800 generation of high-end accelerator cards, designed to optimize GPU decoding, handle variable-length sequences, and improve computational efficiency.
Simply put, the biggest advantage of FlashMLA is that theDynamic allocation of arithmetic resources in the face of text sequences of varying lengths.
Just like an intelligent robot in the kitchen that can dynamically allocate ingredients according to orders, FlashMLA can automatically adjust the speed of chopping vegetables (GPU resource allocation) in the face of orders with different specifications (long and short data of text/speech)."Short orders" in seconds, long orders in pressure cooker mode.Save processing time.
According to benchmarks, FlashMLA was able to spike the H800 graphics card with a580 trillion cycles per secondof arithmetic, the equivalent of writing the entire Trilogy in 1 second, and also cuts the video memory footprint down to the traditional solution's1/5.
Day2: DeepEP - Transmission Dispatch Console
DeepEP is the world's first open-source, high-performance communication library customized for Mixed Model of Expertise (MoE) and Expert Parallelism (EP), designed to address communication bottlenecks in large-scale AI model training and inference.
In the AI central kitchen, DeepEP is like a new type of transmission scheduler. Compared with the drawbacks of traditional walkie-talkies (old communication protocols) that tend to lead to confusing commands, DeepEP is able to, through the FP8 compression technology, in the face of complex tasks.Simplify communicating task requirements and also update menus in real time.
RDMA technology is more of a "conveyor belt" when it comes to transferring ingredients (parameters) across kitchens (server nodes).Deliver ingredients straight to the stove (GPU).
The performance data is also very powerful: based on the H800 GPU, DeepEP can realize extremely fast communication between GPUs within a single node through NVLink technology, with a bandwidth of up to about 150 GB/s.It is equivalent to transferring 30 HD movies in 1 second.
Day3: DeepGEMM - Smart Stovetop
DeepGEMM is a library focusing on efficient generalized matrix multiplication (GEMM) for FP8, which mainly meets the computational needs in general matrix computation as well as mixed expert (MoE) grouping scenarios.
Still taking the example of a centralized kitchen, DeepGEMM can be viewed as a universal cooktop thatAble to achieve dynamic fire controlThe first is to fry steak over high heat (FP8 precision for intensive computing), and the second is to cook soup over low heat (BF16 precision for MoE gated networks); the second can be achieved through JIT technology.The 1 square meter cooktop can handle up to 10 dishes at the same time.
Unlike the CUDA library, which is a traditional cooker, it takes 3 hours to make a Buddha jumping wall, and through a series of tawdry operations such as dynamic switching of precision.DeepGEMM can be done in just 1 hour and save half the gas (video memory).
Notably, DeepGEMM employs a lightweight Just-In-Time (JIT) compilation module that supports dynamic compilation of the kernel at runtime, eliminating the need to complete compilation and installation in advance.
In other words.With only 300 lines of CUDA code, DeepGEMM is able to outperform traditional multi-million line libraries.There is a point of view teased: this DeepSeek simply know more about GPUs than NVIDIA.
Day 4: DualPipe & EPLB - Back of House Pipeline Commander
DualPipe and EPLB are two core technologies for large-scale AI model training, focusing on distributed training efficiency optimization and expert parallel load balancing, respectively, both designed for V3/R1.
In fact, the training of large models is most afraid of encountering "assembly line fishing", the time spent in the calculation of cells and other data is generally called "bubble", and DualPipe and EPLB are designed to minimize the "bubble". DualPipe and EPLB are designed to minimize "bubbles".
In a centralized kitchen, the DualPipe is a "two-way conveyor belt" that allows dishwashers to "back-propagate" on the one hand, and food servers to "forward-calculate" on the other. "working on two parallel conveyor belts.It is equivalent to "washing dishes while cooking", which solves the embarrassment of "waiting for the dishes to be washed before serving".
The EPLB acts as a "smart scheduler" that clones the chef (redundant experts) to the free cooker (GPUs), and the EPLB acts as a "smart scheduler" that clones the chef (redundant experts) to the free cooker (GPUs).Make sure the French chef doesn't get tired and pass out during the Valentine's Day package rush (load balancing).
Day5: 3FS File System - Centralized Cold Storage + Lightning Delivery
The grand finale, Fire-Flyer File System (3FS), is a high-performance distributed file system designed for high-performance computing, designed to meet the challenges of AI training and reasoning workloads, solving the problem of combining "high-throughput writing" and "low-latency reading". It is designed to address the challenges of AI training and reasoning workloads, solving the pain point of "high throughput writing" and "low latency reading".
For central kitchens, 3FS plays more of a back-office storage role, with two main technological advantages.
First, light-speed access.: 6.6 TB/s of throughput, equivalent toEmpty 300 freezers (traditional hard drives) of ingredients per minute (data).
secondlyFreshness Black Technology: Through the combination of SSD + RDMA technology, it ensures thatThe steak you see at the Beijing branch and the Shanghai branch is always the same.This is also known as "strong data consistency".
First shot at AI's "open source heyday" continues to topple the ivory tower
Whether it's a transmission scheduling desk or a food preparation robot, DeepSeek's open source technology components this time are designed to further reduce arithmetic costs and optimize training efficiency.
Some analysts believe that the most hardcore significance of this wave of open source is that through the systematic optimization of the software stack (from file system to communication protocols), multiplier efficiency leaps can be achieved on the basis of existing hardware.
This means that AI performance improvement no longer relies solely on chip process breakthroughs.Not stacking hardware, optimizing software, and "squeezing" arithmetic power are precisely the secrets of DeepSeek's ability to achieve ultra-low cost "overtake" a number of top overseas models.
Some users said that OpenAI should "dedicate" their domain name to DeepSeek, which is truly open source.

There are also netizens who say that open source AI is not rare, but DeepSeek is a "combination of garage spirit and AGI ambition":

Others offered terrier pictures in honor of the situation:

We also asked DeepSeek to comment on the Open Source Week event, and here's what it had to say:

As DeepSeek previously declared:
"There are no lofty ivory towers in this field, just pure garage entrepreneurship and the power of community-built innovation."
"Sharing our small but sincere progress without reservation."
And an even bolder conjecture is emerging:As DeepSeek continues to break through hardware bottlenecks with technical optimizations, will it redefine what arithmetic means to AI?
The tech frenzy that began in China's garages continues to rewrite the rules of global AI.
This article is fromWeChat "Hard AI"For more information about AI, please visitMove here.
© Copyright notes
The copyright of the article belongs to the author, please do not reprint without permission.
Related posts
No comments...