Uncover Cambrian's 10 years of iteration: from the big model to the search and push, domestic arithmetic to attack the “software moat”
Arithmetic power has become the core engine that drives the continuous evolution of the AI industry. With the rapid rise of the domestic large model system, the construction of an autonomous, controllable and sustainable domestic arithmetic ecological closed loop, so that the domestic arithmetic platform is deeply adapted to the domestic AI ecosystem, has become the consensus of the industry and the key direction.
Happily, this year, domestic computing power platform and domestic AI ecological “Day 0” adaptation, joint innovation and other news have been rumored one after another. As a representative enterprise of domestic arithmetic, at the ecological level.CambrianActively embrace the domestic AI ecosystem with a more open stance and maintain deep synergy with mainstream AI communities and leading companies.
Cambrian announced the completion of the adaptation on the day of the release of the Ali Qwen 3 series model and DeepSeek-V3.2-Exp model this year, which means that the two sides have launched in-depth cooperation before the release of the model. In addition, Cambrian has also officially announced cooperation with Shangtang, Smart Spectrum, and Step Star to promote the deep adaptation of domestic arithmetic and domestic large models.
These collaborations enable developers to complete migration and deployment on the Cambrian platform at low cost, significantly lowering the threshold for arithmetic replacement and ecological integration.
Cambrian was founded with the intention of providing the underlying arithmetic support for the “Great Outbreak of Artificial Intelligence”, which requires not only powerful hardware arithmetic, but also versatile and easy-to-use software. Cambrian has always adopted the strategy of training and reasoning integration, unified basic software platform development, and built a complete system from self-developed chip architecture to high-performance software platform, realizing the deep integration of computing architecture, compilation optimization, and algorithm scheduling: hardware provides the ultimate parallel performance and energy-efficiency for the algorithms, while the software maximizes the release of each arithmetic power through intelligent compilation, scheduling, and adapting.
Cambricon NeuWare, the basic software platform built by Cambricon, enables users and developers to work across different Cambricon hardware and application scenarios, reducing the difficulty of getting started, improving development efficiency, and rapidly migrating and deploying AI applications.
After years of investment and accumulation, Cambricon NeuWare has become increasingly mature, fully compatible with the community's latest PyTorch version and the Triton operator development language, supporting rapid migration of user models and custom operators, and has reached the industry's leading level in a number of indicators.
In response to the industry's challenging large-scale cluster operation and maintenance practices, Cambricon NeuWare has further enriched and improved a number of cluster tools, providing a solid foundation for the deployment, debugging and tuning of large-scale training and reasoning operations in a cluster environment.
The trust of many domestic AI vendors in Cambrian arithmetic fully validates the stability and competitiveness of Cambrian's basic software platform, Cambricon NeuWare, which has already met real-world commercial requirements.
I. From the big model to the search and promotion of training and promotion solutions.CambrianCompletion of large-scale technology and product validation
Big model technology is becoming the core driving force of the smart economy, profoundly reshaping the way of human-computer interaction. The “Search and Push” scenario - i.e., search, advertising and recommendation system - has become one of the most valuable frontiers for the realization of big model technology. The search and push system empowered by the big model not only brings significant improvement in user experience, but also reshapes the logic of traffic distribution: letting “find information”, “watch content” and “buy things” move from passive recommendation to active understanding, from passive recommendation to active understanding, from keyword matching to intent insight.
The integration of Big Model and Search and Push is not only a technological innovation, but also a reengineering of the business model. Cambrian has completed large-scale technical and product validation on the training and reasoning of the big model and the search and push.
In the direction of search and push training, Cambrian has been steadily promoting technology and product validation. The validation results show that the solution can support streaming training tasks in multiple scenarios, and can continue to run stably for more than a few months, with both accuracy and stability meeting the requirements. In terms of continuous performance optimization, we have completed the fusion of Layernorm/RMSNorm/L2Norm and other graph matching, which significantly improves the performance. On the basis of graph fusion, we further optimize XLA support and obtain more significant acceleration ratio results.
In the direction of large model training, Cambrian focuses on supporting the training of MoE class models such as DeepSeek V3/V3.1, Qwen2.5/Qwen3/Qwen3-next, etc., and at the same time extends the training support for models such as GLM4.5, Flux, Wan2.1/2.2, Qwen3-VL and Hunyuan-Video. Based on the computing power of native FP8, the training support of FP8 for Qwen/DeepSeek and other series of networks is added, and the accuracy is in line with the expectation.
In the direction of large model inference, Cambrian researches and practices new data types such as W4A4 and MX-FP8/MX-FP4, and explores and supports a variety of highly efficient attention mechanisms, including Sparse Attention and Linear Attention.
Cambrian keeps up with the evolution of advanced models, supporting multimodal fusion models such as Qwen-Omni, 3D generative models such as Hunyuan3D, CosyVoice, and otherspeech productionmodels, as well as emerging architectures such as DLM and VLM, to ensure that the technology stack is advanced and complete.
It is worth mentioning that through deep ecological cooperation, for DeepSeek V3.2-Exp model, Cambrian realizes the support of release-as-adaptation and open-sources the adaptation code synchronously with partners.
At the same time, it continues to optimize the vLLM inference engine, perfects the mixed-precision low-bit quantization inference mechanism, supports the parallel optimization of pass-computing, supports PD separation deployment, supports the extreme low-latency large-scale expert parallelism based on the class of IBGDA, and supports the optimization of host-side bottlenecks by the Torch.compile feature, which realizes the all-around acceleration of the large-model application.
Cambrian continues to carry out extreme performance optimization of the latest series of open-source models, such as DeepSeek, Qwen, Wan, Hunyuan, etc., and specifically attacks the performance optimization of scenarios such as long sequences and ultra-low decoding latency, to continue to maintain the leading performance advantage.
Cambrian's ability to make rapid breakthroughs in large models and “search and push” training and push, and to complete large-scale technology and product validation, stems from Cambrian's long-term technological deep cultivation and hardware-software synergy. It is this core competitiveness of integrating hardware and software, leading performance and efficient deployment, that enables Cambrian to quickly gain market trust and recognition.

▲ Cambricon NeuWare, Cambricon's basic software platform, only some components are listed in the figure, please see the end of the article for notes on related acronyms.
Second, highly stable drivers and runtime libraries for worry-free expansion of AI enterprises
The high stability of the underlying driver is an important precondition for business deployment, and Cambrian's driver can support enterprise business operations for months without downtime. At the same time, Cambrian's driver has dramatically improved throughput capacity in business optimization iterations, maximally eliminating bottlenecks on the host side in challenging search and wide push and large model reasoning scenarios, laying a solid foundation for reaching leading computational efficiency from end to end.
Cambrian decouples data dependency and scheduling dependency through fine-grained parallelism technology to squeeze the throughput capacity of Kernel function in the limit, superimposed with multi-way DSA asynchronous scheduling and collaborative optimization, the scheduling throughput of Kernel function can reach hundreds of thousands of tasks per second, realizing the industry's leading Kernel throughput capacity.
Comprehensively supports the batch issuance function of Kernel graph, which can converge multiple operators in a single issuance at runtime, and supports the residency and issuance on the device side, realizing the multi-Kernel issuance with extremely low latency, and the latency level is comparable to that of the international competitors.
Add IBGDA-like interface to provide system guarantee for very low latency expert parallel communication for communication library.
Cambrian's driver and runtime libraries support rich device slicing usage scenarios:
(1) visible cluster: runtime programmable elastic splitting for rapid deployment;
(2) sMLU: based on time-division multiplexing technology, can be used for docker rapid deployment;
(3) MIM: Physical division, fully benchmarked against international competitor MIG technology.
Compiler and debugging tuning tools continue to iterate to achieve industry-leading efficiency.
BANG C is the programming language for the Cambrian BANG heterogeneous parallel programming model, which extends the C/C++ language for the characteristics of the MLU architecture to efficiently write parallel programs running on MLUs and fully utilize the MLU massively parallel architecture to accelerate computing tasks.
BANG C supports a rich set of compilation optimization techniques, including Link Time Optimization (LTO), Profiling Based Feedback Optimization (PGO), function-level on-chip space reuse based on function call relationships, Device-side dynamic linking mechanism, compiler static derivation of the address space of the access instructions, automatic synchronization algorithms for intra-task parallel instruction streams, optimized memory dependency analysis, instruction-level parallelism for local instruction scheduling, global instruction scheduling, and high-performance instruction layout optimization for MLU architecture.
Through this array of technologies, the full potential of the chip is maximized, with arithmetic such as matrix multiplication achieving industry-leading efficiencies.
Continuously fast iteration of Triton arithmetic development language, supporting all features of Triton 3.4, including FP8/FP4 data types. Introduced fast libentry to optimize the host-side overhead of the Triton Kernel, resulting in significant performance gains in small workload scenarios. multiple optimizations implemented in the backend of the Triton compiler:
(1) Optimize the on-chip ram occupancy of software pipelining, optimize the concurrency degree of software pipelining, and implement an automatic software pipelining scheme that balances the performance of software pipelining and the performance of a single instruction;
(2) Achieve multi-objective instruction scheduling optimization for instruction parallelism, on-chip ram occupancy, and instruction delay masking;
(3) Realization of auto-tuning and auto-scheduling for task parallelism;
(4) Realization of automatic circular merging;
(5) Implementing access and computation optimization based on operator semantics, such as transpose penetration and merging, slice, broadcast penetration, etc;
(6) Optimizing performance modeling for instruction fusion and instruction selection.
Through the above optimization, the Triton Kernel performance generalization is improved, in which the Matmul, FlashAttention class and HSTU class operators' performance is significantly improved, and some of the hotspot operators have been comparable to the performance of handwritten operators.
Further improve the debugging and tuning tools of the system and operators: support core dump of operators, realize core dump of abnormal sites, provide accurate correspondence between site and debugging information, and provide tools for parsing core dump files, which can quickly analyze and locate the root causes of abnormalities in operators.
In terms of host-side and device-side parallelism tuning, CNPerf can realize full-dimensional performance data collection with very low tracking overhead, accurately capturing key information such as host-side and device-side execution flow, PMU performance indicators, function call stacks, etc. It supports multiple types of task tracking such as Kernel computation, memory copy, communication tasks, etc., and covers the full-stack performance data from the underlying hardware to the upper-layer applications.
CNPerf-GUI has outstanding intelligent tuning capabilities, and the built-in expert suggestion system can automatically detect problems such as empty equipment bubbles, underutilization, and aggregate communication waiting, and accurately locate hotspot operators and performance bottlenecks. In addition, CNPerf-GUI provides multiple logs and cluster iterative analysis for multi-machine and multi-card scenarios, which further simplifies the tuning complexity for users in complex scenarios.
In terms of single-operator tuning, CNPerf can support hardware working state sampling at GHz sampling frequency to accurately record the working state of MLU front-end and back-end. Users can analyze inter-stream/inter-core synchronization, operator software scheduling and other issues based on this function to maximize the use of hardware back-end resources.
CNPerf-GUI adapts to Linux, macOS, Windows multi-platform, supports CNPerf, PyTorch Profiler, Tensorflow Profiler, CNTrainKit and other log formats, and supports fast loading and smooth operation of large log files (hundreds of millions of function records).
CNSantizer, a new program correctness analysis tool, automatically completes the detection of competing accesses between multiple cores, the detection of competing accesses within a single core for multiple instruction streams, the detection of out-of-bounds accesses on the Device side, the detection of undefined program behaviors, and the detection of the use of uninitialized memory by using runtime staking technology.
CNAdvisor, a new tool for program performance analysis and tuning suggestions, uses runtime stakeout collection and hardware performance counter collection to obtain the program runtime state, and automatically analyzes program performance problems and marks the corresponding source code location according to the performance tuning experience library, and further gives optimization suggestions.
Fourth, continue to polish the core basic arithmetic, to create a reliable dimension measurement platform
Cambrian Computing Library actively embraces the technology evolution of the open source community, and continues to iteratively polish the functions, performance and stability of the core basic operators, so as to support the efficient and stable operation of open source and private models on the Cambrian SmartChip in a faster and better way. The library has made in-depth functional extensions and performance optimizations for hot scenarios such as search and push, large language models, text-to-graph and text-to-video:
Large-scale Embedding Table sparse access and computation is extremely optimized for performance comparable to GPU competitors;
The performance generalization of matrix multiplication class operators such as GEMM/BatchGEMM/GroupGEMM has been significantly enhanced, and the large-scale matrix multiplication HFU has reached the industry leading level;
The matrix multiplication class of operators supports a wide range of community public/private customized low-precision quantization functions;
Supports extended development and AutoTuning of the CUTLASS GEMM-like template library;
The exploration and R&D results of Attention class operators in the direction of low-precision acceleration and other directions have successfully completed the verification and obtained good acceleration results;
Supporting MTP techniques used in large language models, fusion operators such as Top-k and Top-p sampling and random sampling for optimizing MTP performance were developed.
In order to support the continuous and rapid iteration of the computational library, and to guarantee the quality of the software of the computational library while achieving accuracy and performance without regression, the Cambrian computational library team has also built a reliable dimension testing platform, provided a rich set of dimension testing tools, developed high-coverage functional performance test cases, and formulated scientific acceptance standards.
V. Communication library scalability is comparable to international mainstream competitors, and clustering tool empowers Wanka scenarios
The communication library is specially optimized for large-scale scenarios: new HDR/DBT and other Allreduce communication algorithms are added to give priority to improving the communication bandwidth under large-scale conditions, and the Alltoall operation is optimized in depth so that its large-scale scalability reaches a level comparable to that of international mainstream competitors.
Communication library synchronously strengthens the functions related to maintainability and testability, supports online punching, modular logging, high reliability service module, etc., which helps users to quickly analyze the communication sending error, abnormal jamming and other problems, and improves the availability of cluster communication. The communication library significantly optimizes the ALL2ALL communication delay in large-scale expert parallel scenarios by supporting the RDMA operation (IBGDA-like) of RoCE NICs in Kernel, and improves the end-to-end throughput of MoE-like model inference tasks.
CntrainKit-Accu (Large Scale Cluster Accuracy Localization Tool): Provides end-to-end accuracy localization for WC distributed training scenarios, online monitoring of accuracy indicators, automated grading of accuracy problems, collection of information, intelligent analysis and provision of corresponding solutions, and full support for NaN/Inf anomaly detection and fast localization. The CntrainKit-Accu tool also fully supports NaN/Inf anomaly detection and fast positioning, realizing second-level traceability at the anomaly level, greatly improving the efficiency of large-scale training accuracy problem troubleshooting in large-scale modeling and search and push scenarios, so that every accuracy problem can be accurately captured.
CntrainKit-Monitor (large-scale cluster monitoring and tuning tool): Realizes real-time communication and arithmetic performance image of 10,000-caliber cluster training tasks, with millisecond-level task health visualization capability, supports arithmetic granularity performance analysis, and identifies performance bottlenecks in AI operations. It has the ability to visualize, investigate, and optimize training tasks at the scale of 10,000 cards, and truly realizes the “self-awareness” of large-scale clusters.
CNCE (Cluster Supervision Platform): builds a panoramic data center monitoring system covering computing, network, and storage, and realizes second-level state acquisition and topology visualization of 100,000-calorie arithmetic clusters. The platform is equipped with closed-loop fault management capabilities of automatic discovery, intelligent diagnosis and automatic processing, supporting multi-dimensional abnormality diagnosis and root cause localization of 10,000-calorie tasks, allowing users to focus on algorithm innovation and model training without the need to be distracted by fluctuations in the underlying hardware.The launch of CNCE has enabled the cluster operation and maintenance to move from manual inspection to intelligent self-governance. CNCE enables cluster operation and maintenance to move from “manual inspection” to “intelligent autonomy”, significantly improving the availability and stability of large-scale AI training.
CNAnalyzeInsight (fault analysis tool): CNAnalyzeInsight is an intelligent log analysis and root cause diagnostic engine, supporting second-level retrieval of GB-level logs and multi-dimensional aggregation analysis. It has dual modes of online real-time diagnostic alerts and offline rapid analysis, which can realize the closed-loop fault diagnosis of “abnormality discovery, problem localization, cause summarization, and repair proposal generation”, and significantly improve the stability of the training task and the efficiency of problem processing.
Embrace the open source trend and provide zero-cost GPU migration tools
Cambrian quickly follows up on the progress of PyTorch in the community, supporting all community versions from PyTorch 2.1 to PyTorch 2.8, adapting a range of key features including DDP, FSDP, FSDP2, HSDP, Tensor Parallelism, Context Parallel, Pipeline Parallelism, SDPA, Inductor, MLU Graph, AOTInductor, and the Inductor cppwrapper.
Torch compile performance is on par with GPU compile acceleration ratio, which efficiently supports the successful validation of the product in multiple training and inference scenarios.
Cambrian also provides GPU Migration, a one-click migration tool that helps users migrate their models from GPU to MLU at nearly zero cost, as well as TorchDump accuracy debugging tool and Torch Profiler performance debugging tool, which help users efficiently locate and solve accuracy and performance problems.
In addition, Cambrian also supports community ecosystems such as PyTorch Lightning, TorchTitan, TorchRec, etc., and establishes a long-term mechanism to quickly follow up on the community version, which can realize the release of MLU-adapted version within 2 weeks after the release of the community version.
Seven, nearly a decade of continuous polishing iteration, Cambrian to help AI into thousands of lines of industry
Through nearly a decade of continuous polishing and iteration, Cambrian has built a set of highly efficient, easy-to-use, stable, mature and highly scalable software and hardware integrated product system. With the leading chip technology and perfect basic software platform, Cambrian products have been successfully validated in large model, search and push, image and video generation, and various multimodal training and reasoning scenarios, winning wide recognition.
In this process, Cambrian products continue to accept more large-scale scenarios of high-intensity testing, promoting the continuous evolution of the software platform and chip system, forming a virtuous cycle of “application promotion optimization, optimization to promote stronger applications”.
By providing users with more efficient, stable and broader coverage support, Cambrian accelerates the intelligent transformation of industries and promotes AI capabilities to truly enter thousands of industries. Cambrian's vision of “enabling machines to better understand and serve humans” is becoming a reality step by step.
Appendix:
Cambricon NeuWare Foundation Software Platform Cambricon NeuWare Full name of acronyms labeled in the figure
1. Cambricon HLO:Cambium backend for the High-Level Operations Set (HLO) for machine learning models;
2. CNNL:Cambricon Network Library, Cambrian Artificial Intelligence Computational Library;
3. CNNL-Extra:Cambricon CNNL Extra, an extension library for the Cambrian AI computational library;
4. CNCV:Cambricon Computer Vision Library.;
5. CNCL:Cambricon Communications Library, Cambricon High Performance Communications Library;
6. CNFFmpeg:Cambricon FFmpeg, a hardware acceleration library based on the open source FFmpeg development;
7. CNCC:Cambricon Compiler Collection, Cambricon BANG C language compiler;
8. CNAS:Cambricon Assembler, Cambricon Assembler component;
9. CNGDB:Cambricon GNU Debugger, Cambricon BANG C language debugging tool;
10. CNSanitizer:Cambricon Sanitizer, the Cambricon code inspection tool;
11. CNPAPI:Cambricon Profiling API, Cambricon Performance Analysis Interface Library;
12. CNPerf:Cambricon Performance, a Cambricon performance analysis tool;
13. CNPerf-GUI:Cambricon Performance Graphical User Interface, a graphical tool for Cambricon performance profiling;
14. CNMon:Cambricon Monitor, the Cambricon device monitoring and management command line tool;
15. CNVS:Cambricon Validation Suite, Cambricon device validation toolset;
16. CNFieldiag:Cambricon Field Diagnostic, Cambricon Field Diagnostic Tool;
17. CNAnalyzeInsight:Cambium Failure Analysis Tool;
18. CNCL-benchmark:Cambricon Communications Library Benchmark, a Cambricon Communications Library performance benchmarking tool;
19. Cambricon Device Plugin:Cambium device plugin;
20. CCOMP:Cambricon Cluster Operation Management Platform.
Article source: Smart Stuff Author | Chen Junda
© Copyright notes
The copyright of the article belongs to the author, please do not reprint without permission.
Related posts
No comments...