
Background and purpose of publication
C-Eval is a comprehensive Chinese-basedModel Evaluationsuite, jointly launched in May 2023 by researchers from Shanghai Jiao Tong University, Tsinghua University and the University of Edinburgh. Its release aims to provide a standardized assessment benchmark to help the Chinese community of big model developers to continuously polish and iterate their big models. Similar to the foreign benchmark MMLU, C-Eval also uses objective questions to assess the ability of big models in Chinese comprehension and reasoning.
Data set composition
The C-Eval dataset contains 13,948 multiple-choice questions that cover 52 different subject areas and are categorized into four difficulty levels: middle school, high school, college, and vocational exams. These subject areas include STEM (Science, Technology, Engineering, and Math Education), Social Science, Humanity, and other subjects (e.g., Environment, Fire, Taxation, Physical Education, Medicine, etc.).
The questions for each subject area are split into three datasets: dev, validation, and test. the dev dataset contains five model examples and explanations provided for the chain of thought format; the validation dataset is mainly used for hyper-parameterized evaluations with answers to the questions; and the test dataset is used for model evaluations and its answers are not publicly available. Its answers are not public and users are required to submit their results to the official website in order to access the test results.
Assessment Methods and Criteria
C-Eval is based on two submission templates: answer-only and chain-of-thought, whereas answer-only requires the model to give the answer directly, and chain-of-thought requires the model to show its reasoning process and finally give the answer. Both templates support zero-shot and few-shot modes.
During the evaluation process, the model needs to understand and reason about the questions within a given context (e.g., a 2048 character context) and give an answer. The evaluation results will be ranked and scored based on the correctness of the model's answers.
Features & Benefits
- Comprehensiveness and diversity: The C-Eval dataset covers a wide range of subject areas and difficulty levels, and is capable of comprehensively assessing the ability of large models in Chinese comprehension and reasoning.
- Standardization and objectivity: C-Eval adopts the objective question assessment method to avoid the influence of subjective judgment, making the assessment results more objective and reliable.
- Facilitate model iteration: The release of C-Eval provides a standardized evaluation benchmark for big model developers in the Chinese community, which helps them to continuously polish and iterate their big models to improve their performance and accuracy.
- Preventing overfitting: C-Eval selects non-public source test questions when constructing questions whenever possible and avoids using real questions to minimize the risk of model overfitting.
Application Scenarios and Impacts
The application scenarios of C-Eval mainly include the evaluation and iteration of big models, academic research, and applications in the field of education. Through C-Eval's evaluation, it can objectively assess the ability of big models in Chinese comprehension and reasoning, and provide guidance for model iteration and optimization. Meanwhile, C-Eval also provides a standardized evaluation benchmark for academic research, which helps to promote research progress in related fields. In the field of education, C-Eval can be used as an auxiliary tool to help teachers assess students' learning and comprehension.
data statistics
Relevant Navigation

A comprehensive, scientific, and fair big model evaluation system and open platform aims to help researchers assess the performance of basic models and training algorithms in an all-round way by providing multi-dimensional evaluation tools and methods.

MMBench
A multimodal benchmarking framework designed to comprehensively assess and understand the performance of multimodal models in different scenarios, providing robust and reliable evaluation results through a well-designed evaluation process and labeled datasets.

AGI-Eval Review Community
It is a comprehensive assessment platform focusing on evaluating the general ability of large models in human cognition and problem solving tasks, which is jointly created by well-known universities and organizations, providing diversified assessment methods and authoritative rankings to help the development and application of AI technology.

HELM
Initiated by Stanford University, it aims to comprehensively assess the capabilities of big language models through multiple dimensions and scenarios in order to drive technological advancement and model optimization of the evaluation benchmark.

OpenCompass
An open-source big model capability assessment system designed to comprehensively and quantitatively assess the capabilities of big models in knowledge, language, understanding, reasoning, etc., and to drive iterative optimization of the models.

SuperCLUE
A comprehensive evaluation tool for Chinese big models, which truly reflects the general ability of big models through a multi-dimensional and multi-perspective evaluation system, and helps technical progress and industrialization development.
No comments...