an evaluation benchmark focused on ancient Chinese language comprehension.
a meta-benchmark that evaluates how well factuality evaluators assess the outputs of large language models (LLMs).
a benchmark that evaluates large language models on a variety of multimodal reasoning tasks, including language, natural and social sciences, physical and social commonsense, temporal reasoning, algebra, and geometry.
a large-scale question-answering benchmark focused on real-world financial data, integrating both tabular and textual information.
A Challenging, Contamination-Free LLM Benchmark.
a comprehensive benchmarking platform designed to evaluate large models' mathematical abilities across 20 fields and nearly 30,000 math problems.
a benchmark platform for large language models (LLMs) that features anonymous, randomized battles in a crowdsourced manner.
Your email address will not be published. Required fields are marked *
Comment *
Name *
Email *
Website
Captcha: 11 + 10 = ?*
Save my name, email, and website in this browser for the next time I comment.
a meta-benchmark that evaluates how well factuality evaluators assess the outputs of large language models (LLMs).