Leaderboard
We-Math
a benchmark that evaluates large multimodal models (LMMs) on their ability to perform human-like mathematical reasoning.
a benchmark that evaluates large multimodal models (LMMs) on their ability to perform human-like mathematical reasoning.
benchmark designed to evaluate large language models (LLMs) on solving complex, college-level scientific problems from domains like chemistry, physics, and mathematics.