A Challenging, Contamination-Free LLM Benchmark.
a comprehensive benchmarking platform designed to evaluate large models' mathematical abilities across 20 fields and nearly 30,000 math problems.
a benchmark for evaluating AI models across multiple academic disciplines like math, physics, chemistry, biology, and more.
a benchmark evaluating QA methods that operate over a mixture of heterogeneous input sources (KB, text, tables, infoboxes).
a meta-benchmark that evaluates how well factuality evaluators assess the outputs of large language models (LLMs).
aims to track, rank, and evaluate LLMs and chatbots as they are released.
a biomedical question-answering benchmark designed for answering research-related questions using PubMed abstracts.
Your email address will not be published. Required fields are marked *
Comment *
Name *
Email *
Website
Captcha: 20 + 19 = ?*
Save my name, email, and website in this browser for the next time I comment.
a comprehensive benchmarking platform designed to evaluate large models' mathematical abilities across 20 fields and nearly 30,000 math problems.