an evaluation benchmark focused on ancient Chinese language comprehension.
benchmark designed to evaluate large language models (LLMs) on solving complex, college-level scientific problems from domains like chemistry, physics, and mathematics.
an expert-driven benchmark for Chineses LLMs.
a large-scale question-answering benchmark focused on real-world financial data, integrating both tabular and textual information.
A Challenging, Contamination-Free LLM Benchmark.
a benchmark evaluating QA methods that operate over a mixture of heterogeneous input sources (KB, text, tables, infoboxes).
a benchmark dataset testing AI's ability to reason about visual commonsense through images that defy normal expectations.
Your email address will not be published. Required fields are marked *
Comment *
Name *
Email *
Website
Captcha: 10 - 11 = ?*
Save my name, email, and website in this browser for the next time I comment.
benchmark designed to evaluate large language models (LLMs) on solving complex, college-level scientific problems from domains like chemistry, physics, and mathematics.