Leaderboard
LLMEval
focuses on understanding how these models perform in various scenarios and analyzing results from an interpretability perspective.
focuses on understanding how these models perform in various scenarios and analyzing results from an interpretability perspective.
a benchmark that evaluates large language models on a variety of multimodal reasoning tasks, including language, natural and social sciences, physical and social commonsense, temporal reasoning, algebra, and geometry.