Oobabooga Benchmark | LLMWay

an evaluation benchmark focused on ancient Chinese language comprehension.

a comprehensive benchmarking platform designed to evaluate large models' mathematical abilities across 20 fields and nearly 30,000 math problems.

WHOOPS!

a benchmark dataset testing AI's ability to reason about visual commonsense through images that defy normal expectations.

SciBench

benchmark designed to evaluate large language models (LLMs) on solving complex, college-level scientific problems from domains like chemistry, physics, and mathematics.

Relevant Sites

Leave a Reply Cancel reply