Leaderboard
FELM
a meta-benchmark that evaluates how well factuality evaluators assess the outputs of large language models (LLMs).
a meta-benchmark that evaluates how well factuality evaluators assess the outputs of large language models (LLMs).
a benchmark designed to evaluate large language models (LLMs) specifically in their ability to answer real-world coding-related questions.