a benchmark designed to evaluate large language models in the legal domain.
aims to track, rank, and evaluate LLMs and chatbots as they are released.
a meta-benchmark that evaluates how well factuality evaluators assess the outputs of large language models (LLMs).
a benchmark that evaluates large multimodal models (LMMs) on their ability to perform human-like mathematical reasoning.
a benchmark that evaluates large language models on a variety of multimodal reasoning tasks, including language, natural and social sciences, physical and social commonsense, temporal reasoning, algebra, and geometry.
focuses on understanding how these models perform in various scenarios and analyzing results from an interpretability perspective.
a comprehensive benchmarking platform designed to evaluate large models' mathematical abilities across 20 fields and nearly 30,000 math problems.
Your email address will not be published. Required fields are marked *
Comment *
Name *
Email *
Website
Captcha: 18 + 16 = ?*
Save my name, email, and website in this browser for the next time I comment.
aims to track, rank, and evaluate LLMs and chatbots as they are released.