Leaderboard
FELM
a meta-benchmark that evaluates how well factuality evaluators assess the outputs of large language models (LLMs).
a meta-benchmark that evaluates how well factuality evaluators assess the outputs of large language models (LLMs).
a benchmark evaluating QA methods that operate over a mixture of heterogeneous input sources (KB, text, tables, infoboxes).