Leaderboard
OlympicArena
a benchmark for evaluating AI models across multiple academic disciplines like math, physics, chemistry, biology, and more.
a benchmark for evaluating AI models across multiple academic disciplines like math, physics, chemistry, biology, and more.
a benchmark that evaluates large language models on a variety of multimodal reasoning tasks, including language, natural and social sciences, physical and social commonsense, temporal reasoning, algebra, and geometry.