FELM | LLMWay – The Way To LLM

Leaderboard

FELM

a meta-benchmark that evaluates how well factuality evaluators assess the outputs of large language models (LLMs).

Link

a meta-benchmark that evaluates how well factuality evaluators assess the outputs of large language models (LLMs).

Relevant Sites

MixEval

a ground-truth-based dynamic benchmark derived from off-the-shelf benchmark mixtures, which evaluates LLMs with a highly capable model ranking while running locally and quickly.

InfiBench

a benchmark designed to evaluate large language models (LLMs) specifically in their ability to answer real-world coding-related questions.

M3CoT

a benchmark that evaluates large language models on a variety of multimodal reasoning tasks, including language, natural and social sciences, physical and social commonsense, temporal reasoning, algebra, and geometry.

SuperBench

a benchmark platform designed for evaluating large language models (LLMs) on a range of tasks, particularly focusing on their performance in different aspects such as natural language understanding, reasoning, and generalization.

Relevant Sites

Leave a Reply Cancel reply