A Challenging, Contamination-Free LLM Benchmark.
a comprehensive benchmarking platform designed to evaluate large models' mathematical abilities across 20 fields and nearly 30,000 math problems.
a ground-truth-based dynamic benchmark derived from off-the-shelf benchmark mixtures, which evaluates LLMs with a highly capable model ranking while running locally and quickly.
evaluates LLM's ability to call external functions/tools.
a benchmark that evaluates large language models' ability to answer medical questions across multiple languages.
aims to track, rank, and evaluate LLMs and chatbots as they are released.
an expert-driven benchmark for Chineses LLMs.
Your email address will not be published. Required fields are marked *
Comment *
Name *
Email *
Website
Captcha: 17 + 13 = ?*
Save my name, email, and website in this browser for the next time I comment.
a comprehensive benchmarking platform designed to evaluate large models' mathematical abilities across 20 fields and nearly 30,000 math problems.