Leaderboard
FELM
a meta-benchmark that evaluates how well factuality evaluators assess the outputs of large language models (LLMs).
a meta-benchmark that evaluates how well factuality evaluators assess the outputs of large language models (LLMs).
a benchmark for evaluating AI models across multiple academic disciplines like math, physics, chemistry, biology, and more.