a benchmark platform for large language models (LLMs) that features anonymous, randomized battles in a crowdsourced manner.
A Challenging, Contamination-Free LLM Benchmark.
aims to track, rank, and evaluate LLMs and chatbots as they are released.
An Automatic Evaluator for Instruction-following Language Models using Nous benchmark suite.
an evaluation benchmark focused on ancient Chinese language comprehension.
A pioneering benchmark specifically designed to assess honesty in LLMs comprehensively.
evaluates LLM's ability to call external functions/tools.
an expert-driven benchmark for Chineses LLMs.
CompassRank is dedicated to exploring the most advanced language and visual models, offering a comprehensive, objective, and neutral evaluation reference for the industry and research.
a benchmark evaluating QA methods that operate over a mixture of heterogeneous input sources (KB, text, tables, infoboxes).
a benchmark for evaluating the performance of large language models (LLMs) in various tasks related to both textual and visual imagination.
a meta-benchmark that evaluates how well factuality evaluators assess the outputs of large language models (LLMs).
a benchmark designed to evaluate large language models (LLMs) specifically in their ability to answer real-world coding-related questions.
a benchmark designed to evaluate large language models in the legal domain.
focuses on understanding how these models perform in various scenarios and analyzing results from an interpretability perspective.
a benchmark that evaluates large language models on a variety of multimodal reasoning tasks, including language, natural and social sciences, physical and social commonsense, temporal reasoning, algebra, and geometry.
a comprehensive benchmarking platform designed to evaluate large models' mathematical abilities across 20 fields and nearly 30,000 math problems.
a ground-truth-based dynamic benchmark derived from off-the-shelf benchmark mixtures, which evaluates LLMs with a highly capable model ranking while running locally and quickly.
a benchmark that evaluates large language models' ability to answer medical questions across multiple languages.
a multimodal question-answering benchmark designed to evaluate AI models' cognitive ability to understand human beliefs and goals.
a benchmark for evaluating AI models across multiple academic disciplines like math, physics, chemistry, biology, and more.
a biomedical question-answering benchmark designed for answering research-related questions using PubMed abstracts.
benchmark designed to evaluate large language models (LLMs) on solving complex, college-level scientific problems from domains like chemistry, physics, and mathematics.
a benchmark platform designed for evaluating large language models (LLMs) on a range of tasks, particularly focusing on their performance in different aspects such as natural language understanding, reasoning, and generalization.
a Swedish language understanding benchmark that evaluates natural language processing (NLP) models on various tasks such as argumentation analysis, semantic similarity, and textual entailment.
a large-scale Document Visual Question Answering (VQA) dataset designed for complex document understanding, particularly in financial reports.
a large-scale question-answering benchmark focused on real-world financial data, integrating both tabular and textual information.
a benchmark designed to assess the performance of multimodal web agents on realistic visually grounded tasks.
a benchmark that evaluates large multimodal models (LMMs) on their ability to perform human-like mathematical reasoning.
a benchmark dataset testing AI's ability to reason about visual commonsense through images that defy normal expectations.
A benchmark for LLM
a benchmark platform for large language models (LLMs) that features anonymous, randomized battles in a crowdsourced manner.