
The sponsoring organization of the HELM rubric is Stanford University. The rubric was designed and implemented by researchers at Stanford University to comprehensively assess the capabilities of the Great Language Model through multiple dimensions and scenarios.
The HELM evaluation system includes three modules: scenarios, adaptations, and metrics, and the evaluation covers the English language and benchmarks on several core scenarios and tasks. Its evaluation metrics cover seven main aspects, including accuracy, uncertainty/calibration, robustness, fairness, bias, toxicity, and inference efficiency. In addition, the HELM evaluation system includes 16 core scenarios covering multiple user tasks and domains, and intensive benchmarking was conducted on these scenarios.
The rubric not only provides researchers with an important tool to assess the performance of large models, but also advances the understanding of model capabilities, limitations, and risks across the field, and promotes technological advancement. Up to now, the HELM review system has nearly 700 citations on Google Scholar, showing its wide influence and recognition in the field.
The HELM evaluation system is a comprehensive benchmark for evaluating the capabilities of large language models, which is designed to test the generalization ability of models through multiple dimensions and scenarios.The HELM evaluation system is systematically designed and implemented to comprehensively evaluate large language models on multiple core scenarios and metrics. Its goal is not only to accurately measure the performance of the model, but also to provide insights about the model in terms of fairness, robustness, and efficiency.
Review Design
- Task design: The HELM evaluation system covers several core scenarios, including question and answer, information retrieval, summarization, sentiment analysis, and toxicity detection. Each scenario corresponds to a specific dataset and assessment index to ensure the comprehensiveness and accuracy of the assessment.
- Indicator dimensions: The HELM evaluation system utilizes a multi-indicator approach that includes seven main indicators, including accuracy, calibration, robustness, fairness, bias, toxicity, and efficiency. These indicators aim to comprehensively reflect the performance of the model in different aspects.
model implementation
- 模型选择:HELM评测体系涵盖了多种著名的语言模型,包括open source model(如GPT-NeoX、OPT、BLOOM等)和闭源模型(如GPT-3、Anthropic-LM等)。这些模型在评估过程中被密集地进行了核心场景和度量的基准测试。
- Evaluation deployment: To ensure fairness and consistency in the evaluation, all models were evaluated under the same conditions. This includes using the same dataset, evaluation metrics, and experimental setup.
Assessment results
The HELM Review System has produced several findings about the interactions between different scenarios, metrics, and models through large-scale experiments and evaluations. Some of the key findings include:
- The InstructGPT davinci v2 model excels in terms of accuracy, with a win rate of over 90% proving its superior performance.
- There is a clear threshold effect between model size and accuracy, i.e., models that perform well are larger, but the best performing models are not necessarily the largest.
- Instruction tuning and human feedback are effective means of improving model accuracy.
The HELM evaluation system provides a comprehensive and systematic framework for the evaluation of large language models. Through the evaluation of multiple dimensions and scenarios, the system can not only help researchers better understand and improve the models, but also promote the progress and development of the whole field. With the continuous progress of the technology and the optimization of the model, the HELM assessment system will also continue to improve and develop.
data statistics
Relevant Navigation

A comprehensive, scientific, and fair big model evaluation system and open platform aims to help researchers assess the performance of basic models and training algorithms in an all-round way by providing multi-dimensional evaluation tools and methods.

MMBench
A multimodal benchmarking framework designed to comprehensively assess and understand the performance of multimodal models in different scenarios, providing robust and reliable evaluation results through a well-designed evaluation process and labeled datasets.

AGI-Eval Review Community
It is a comprehensive assessment platform focusing on evaluating the general ability of large models in human cognition and problem solving tasks, which is jointly created by well-known universities and organizations, providing diversified assessment methods and authoritative rankings to help the development and application of AI technology.

OpenCompass
An open-source big model capability assessment system designed to comprehensively and quantitatively assess the capabilities of big models in knowledge, language, understanding, reasoning, etc., and to drive iterative optimization of the models.

C-Eval
The Chinese Basic Model Assessment Suite, jointly launched by Shanghai Jiao Tong University, Tsinghua University and the University of Edinburgh, covers objective questions assessed in multiple domains and difficulty levels, aiming to measure the ability of the Big Model in Chinese comprehension and reasoning.

SuperCLUE
A comprehensive evaluation tool for Chinese big models, which truly reflects the general ability of big models through a multi-dimensional and multi-perspective evaluation system, and helps technical progress and industrialization development.
No comments...