Leaderboard
WHOOPS!
a benchmark dataset testing AI's ability to reason about visual commonsense through images that defy normal expectations.
a benchmark dataset testing AI's ability to reason about visual commonsense through images that defy normal expectations.
a comprehensive benchmarking platform designed to evaluate large models' mathematical abilities across 20 fields and nearly 30,000 math problems.