an evaluation benchmark focused on ancient Chinese language comprehension.
a benchmark dataset testing AI's ability to reason about visual commonsense through images that defy normal expectations.
aims to track, rank, and evaluate LLMs and chatbots as they are released.
benchmark designed to evaluate large language models (LLMs) on solving complex, college-level scientific problems from domains like chemistry, physics, and mathematics.
a comprehensive benchmarking platform designed to evaluate large models' mathematical abilities across 20 fields and nearly 30,000 math problems.
A Challenging, Contamination-Free LLM Benchmark.
CompassRank is dedicated to exploring the most advanced language and visual models, offering a comprehensive, objective, and neutral evaluation reference for the industry and research.
Your email address will not be published. Required fields are marked *
Comment *
Name *
Email *
Website
Captcha: 18 - 11 = ?*
Save my name, email, and website in this browser for the next time I comment.
a benchmark dataset testing AI's ability to reason about visual commonsense through images that defy normal expectations.