an expert-driven benchmark for Chineses LLMs.
An Automatic Evaluator for Instruction-following Language Models using Nous benchmark suite.
a benchmark for evaluating the performance of large language models (LLMs) in various tasks related to both textual and visual imagination.
an evaluation benchmark focused on ancient Chinese language comprehension.
a benchmark designed to evaluate large language models (LLMs) specifically in their ability to answer real-world coding-related questions.
a ground-truth-based dynamic benchmark derived from off-the-shelf benchmark mixtures, which evaluates LLMs with a highly capable model ranking while running locally and quickly.
focuses on understanding how these models perform in various scenarios and analyzing results from an interpretability perspective.
Your email address will not be published. Required fields are marked *
Comment *
Name *
Email *
Website
Captcha: 20 - 13 = ?*
Save my name, email, and website in this browser for the next time I comment.
An Automatic Evaluator for Instruction-following Language Models using Nous benchmark suite.