
The SuperCLUE evaluation system is a system that focuses on ChineseLarge Model ReviewThe open-source tool aims to truly reflect the generalized capabilities of large models through a multi-dimensional and multi-perspective evaluation system.
Sponsoring organization and background
The SuperCLUE evaluation system was jointly constructed by Tsinghua University, Noodle Intelligence, Zhihu and other organizations in the OpenBMB open source community. Its predecessor can be traced back to the third-party Chinese Language Understanding Evaluation benchmark CLUE (The Chinese Language Understanding Evaluation), which has been committed to providing scientific, objective, and neutral language model evaluation since its inception in 2019.
Review Features
- Multi-dimensional Comprehensive Assessment: The SuperCLUE assessment system provides comprehensive assessment through multiple dimensions, including basic competency, professional competency, and Chinese characterization competency, etc. The basic competency covers 10 competencies such as semantic comprehension, conversation, and logical reasoning. Basic competence covers 10 competencies such as semantic comprehension, conversation, logical reasoning, etc. Professional competence includes secondary school, university and professional exams, covering more than 50 competencies from mathematics, physics, geography to social sciences, etc. Chinese specific competence is for tasks with Chinese characteristics, such as Chinese idioms, poems and so on.
- Automated Assessment Technology: As a completely independent third-party assessment organization, SuperCLUE adopts automated assessment technology to effectively eliminate uncertainties caused by human factors and ensure the provision of unbiased and objective assessment results.
- Open Subjective Question Evaluation: In order to ensure consistency with the real user experience, SuperCLUE incorporates open subjective question evaluation, through a multi-dimensional, multi-perspective, multi-level evaluation system and the form of dialog, to realistically simulate the application of large model scenarios, and to truly and effectively examine the model generation capability.
- Multi-Round Dialogue Scenario Evaluation: SuperCLUE builds multi-round dialogue scenarios to examine the application effect of the big model in real multi-round dialogue scenarios at a deeper level, and evaluates the big model's context, memory, and dialogue ability in all aspects.
Evaluation data sets and tasks
SuperCLUE's evaluation dataset includes 2,194 questions covering the ten basic tasks of computation, logical reasoning, code, tool use, knowledge encyclopedia, language comprehension, long text, role play, generation and creation, and security. For example, in the April 2024 evaluation, Yun Zhisheng Shanhai Big Model achieved an excellent total score of 69.51, ranking among the Top 10 big models in China. in terms of long text capability, which has industrial landing significance, Shanhai Big Model achieved an excellent score of 68.2, ranking the fourth big model in the world and the third big model in China.
Impact and significance
The SuperCLUE evaluation system provides important guidance for the technical progress and application of large models. By comparing the performance of different models in the SuperCLUE evaluation system, researchers and developers can have a clearer understanding of the strengths and weaknesses of the models, and then optimize and improve them in a targeted manner. Meanwhile, the SuperCLUE evaluation system also provides an important reference for the landing of big models in the industry, which helps to promote the practical application and industrialization of big model technology.
In conclusion, the SuperCLUE evaluation system is a comprehensive, objective and fair evaluation tool for Chinese big models, which provides strong support for the research and application of big model technology.
data statistics
Related Navigation

A comprehensive, scientific, and fair big model evaluation system and open platform aims to help researchers assess the performance of basic models and training algorithms in an all-round way by providing multi-dimensional evaluation tools and methods.

OpenCompass
An open-source big model capability assessment system designed to comprehensively and quantitatively assess the capabilities of big models in knowledge, language, understanding, reasoning, etc., and to drive iterative optimization of the models.

AGI-Eval Review Community
It is a comprehensive assessment platform focusing on evaluating the general ability of large models in human cognition and problem solving tasks, which is jointly created by well-known universities and organizations, providing diversified assessment methods and authoritative rankings to help the development and application of AI technology.

MMBench
A multimodal benchmarking framework designed to comprehensively assess and understand the performance of multimodal models in different scenarios, providing robust and reliable evaluation results through a well-designed evaluation process and labeled datasets.

C-Eval
The Chinese Basic Model Assessment Suite, jointly launched by Shanghai Jiao Tong University, Tsinghua University and the University of Edinburgh, covers objective questions assessed in multiple domains and difficulty levels, aiming to measure the ability of the Big Model in Chinese comprehension and reasoning.

HELM
Initiated by Stanford University, it aims to comprehensively assess the capabilities of big language models through multiple dimensions and scenarios in order to drive technological advancement and model optimization of the evaluation benchmark.
No comments...