A Challenging, Contamination-Free LLM Benchmark.
a comprehensive benchmarking platform designed to evaluate large models' mathematical abilities across 20 fields and nearly 30,000 math problems.
a biomedical question-answering benchmark designed for answering research-related questions using PubMed abstracts.
a benchmark platform designed for evaluating large language models (LLMs) on a range of tasks, particularly focusing on their performance in different aspects such as natural language understanding, reasoning, and generalization.
focuses on understanding how these models perform in various scenarios and analyzing results from an interpretability perspective.
a large-scale Document Visual Question Answering (VQA) dataset designed for complex document understanding, particularly in financial reports.
a ground-truth-based dynamic benchmark derived from off-the-shelf benchmark mixtures, which evaluates LLMs with a highly capable model ranking while running locally and quickly.
Your email address will not be published. Required fields are marked *
Comment *
Name *
Email *
Website
Captcha: 12 + 16 = ?*
Save my name, email, and website in this browser for the next time I comment.
a comprehensive benchmarking platform designed to evaluate large models' mathematical abilities across 20 fields and nearly 30,000 math problems.