an evaluation benchmark focused on ancient Chinese language comprehension.
a meta-benchmark that evaluates how well factuality evaluators assess the outputs of large language models (LLMs).
a benchmark designed to evaluate large language models in the legal domain.
focuses on understanding how these models perform in various scenarios and analyzing results from an interpretability perspective.
aims to track, rank, and evaluate LLMs and chatbots as they are released.
a benchmark for evaluating the performance of large language models (LLMs) in various tasks related to both textual and visual imagination.
a benchmark that evaluates large multimodal models (LMMs) on their ability to perform human-like mathematical reasoning.
Your email address will not be published. Required fields are marked *
Comment *
Name *
Email *
Website
Captcha: 19 - 10 = ?*
Save my name, email, and website in this browser for the next time I comment.
a meta-benchmark that evaluates how well factuality evaluators assess the outputs of large language models (LLMs).