C-Eval

4mos agoupdate 998 0 0

The Chinese Basic Model Assessment Suite, jointly launched by Shanghai Jiao Tong University, Tsinghua University and the University of Edinburgh, covers objective questions assessed in multiple domains and difficulty levels, aiming to measure the ability of the Big Model in Chinese comprehension and reasoning.

Location:
China
Language:
zh
Collection time:
2024-12-27
C-EvalC-Eval

Background and purpose of publication

C-Eval is a comprehensive Chinese-basedModel Evaluationsuite, jointly launched in May 2023 by researchers from Shanghai Jiao Tong University, Tsinghua University and the University of Edinburgh. Its release aims to provide a standardized assessment benchmark to help the Chinese community of big model developers to continuously polish and iterate their big models. Similar to the foreign benchmark MMLU, C-Eval also uses objective questions to assess the ability of big models in Chinese comprehension and reasoning.

Data set composition

The C-Eval dataset contains 13,948 multiple-choice questions that cover 52 different subject areas and are categorized into four difficulty levels: middle school, high school, college, and vocational exams. These subject areas include STEM (Science, Technology, Engineering, and Math Education), Social Science, Humanity, and other subjects (e.g., Environment, Fire, Taxation, Physical Education, Medicine, etc.).

The questions for each subject area are split into three datasets: dev, validation, and test. the dev dataset contains five model examples and explanations provided for the chain of thought format; the validation dataset is mainly used for hyper-parameterized evaluations with answers to the questions; and the test dataset is used for model evaluations and its answers are not publicly available. Its answers are not public and users are required to submit their results to the official website in order to access the test results.

Assessment Methods and Criteria

C-Eval is based on two submission templates: answer-only and chain-of-thought, whereas answer-only requires the model to give the answer directly, and chain-of-thought requires the model to show its reasoning process and finally give the answer. Both templates support zero-shot and few-shot modes.

During the evaluation process, the model needs to understand and reason about the questions within a given context (e.g., a 2048 character context) and give an answer. The evaluation results will be ranked and scored based on the correctness of the model's answers.

Features & Benefits

  1. Comprehensiveness and diversity: The C-Eval dataset covers a wide range of subject areas and difficulty levels, and is capable of comprehensively assessing the ability of large models in Chinese comprehension and reasoning.
  2. Standardization and objectivity: C-Eval adopts the objective question assessment method to avoid the influence of subjective judgment, making the assessment results more objective and reliable.
  3. Facilitate model iteration: The release of C-Eval provides a standardized evaluation benchmark for big model developers in the Chinese community, which helps them to continuously polish and iterate their big models to improve their performance and accuracy.
  4. Preventing overfitting: C-Eval selects non-public source test questions when constructing questions whenever possible and avoids using real questions to minimize the risk of model overfitting.

Application Scenarios and Impacts

The application scenarios of C-Eval mainly include the evaluation and iteration of big models, academic research, and applications in the field of education. Through C-Eval's evaluation, it can objectively assess the ability of big models in Chinese comprehension and reasoning, and provide guidance for model iteration and optimization. Meanwhile, C-Eval also provides a standardized evaluation benchmark for academic research, which helps to promote research progress in related fields. In the field of education, C-Eval can be used as an auxiliary tool to help teachers assess students' learning and comprehension.

data statistics

Relevant Navigation

No comments

none
No comments...