Berkeley Function-Calling Leaderboard | LLMWay

Leaderboard

Berkeley Function-Calling Leaderboard

evaluates LLM's ability to call external functions/tools.

Link

evaluates LLM's ability to call external functions/tools.

a multimodal question-answering benchmark designed to evaluate AI models' cognitive ability to understand human beliefs and goals.

focuses on understanding how these models perform in various scenarios and analyzing results from an interpretability perspective.

a large-scale question-answering benchmark focused on real-world financial data, integrating both tabular and textual information.

a large-scale Document Visual Question Answering (VQA) dataset designed for complex document understanding, particularly in financial reports.

a benchmark designed to assess the performance of multimodal web agents on realistic visually grounded tasks.

a benchmark that evaluates large language models' ability to answer medical questions across multiple languages.