Leaderboard
We-Math
a benchmark that evaluates large multimodal models (LMMs) on their ability to perform human-like mathematical reasoning.
a benchmark that evaluates large multimodal models (LMMs) on their ability to perform human-like mathematical reasoning.
An Automatic Evaluator for Instruction-following Language Models using Nous benchmark suite.