Skip to main content
Favicon of ToolBench

ToolBench

ToolBench is an open-source benchmark platform for evaluating large language models on API tool-use tasks, with 16,000+ APIs and 12,000+ task instances.

Reviewed by Mathijs Bronsdijk · Updated Apr 13, 2026

ToolOpen SourceUpdated 1 month ago
Screenshot of ToolBench website

What is ToolBench?

ToolBench is an open-source platform for training, serving, and evaluating large language models (LLMs) on tool-use tasks, developed by OpenBMB and recognized as an ICLR 2024 spotlight project. It addresses a concrete gap in AI research: open-source LLMs tend to perform significantly worse than closed models like those from OpenAI when it comes to calling APIs and manipulating software tools. ToolBench gives researchers and developers a shared infrastructure to benchmark model performance, study where that gap comes from, and build better tool-using models. The project is hosted on GitHub under the Apache-2.0 license and is written in Python.

Key Features

  • Benchmark Dataset: Covers 16,000+ RESTful APIs sourced from RapidAPI Hub, with 12,000+ task instances generated using ChatGPT across real-world scenarios such as weather queries, air pollution data retrieval, and image management.
  • Task Structure: Each benchmark task is organized by API category and includes an examples/ folder with question-answer pairs expressed as executable code, plus a functions/ folder containing API signatures, descriptions, and curl usage examples.
  • ToolEval Evaluation Infrastructure: Provides built-in tools for measuring execution success rates, pass rates, and win rates of LLMs on tool-calling tasks, giving a direct and quantifiable way to compare model performance.
  • Open Training and Serving Platform: Supports not just evaluation but also training and serving LLMs for tool learning, which distinguishes it from benchmarks that only measure without helping users improve their models.
  • DFSDT Integration: Integrates with a Depth-First Search-Based Decision Tree (DFSDT) algorithm for action generation, supporting structured decision-making during tool use.
  • ToolLLM Compatibility: Designed to work with the ToolLLM framework, which targets open-source LLMs such as LLaMA and Vicuna for tool-use fine-tuning and evaluation.
  • Extensibility: Accepts contributions for new APIs, tasks, and action generators via pull requests, and datasets can be downloaded through provided scripts.

Use Cases

  • AI Researchers Benchmarking LLMs: Researchers studying tool-use capabilities in language models can use ToolBench's 12,000+ task instances and evaluation infrastructure to measure and compare model performance across a broad set of real-world API tasks.
  • ML Engineers Fine-Tuning Open-Source Models: Engineers working with open-source LLMs like LLaMA or Vicuna can use the platform's training support and DFSDT-based action generation to improve their models' ability to call APIs correctly.
  • Academic Teams Studying the Open/Closed Model Gap: Groups investigating why open-source models underperform closed models on tool tasks can use ToolBench's dataset and leaderboard (ToolEval) to quantify and diagnose that gap.
  • Developers Building Tool-Calling Agents: Developers who want to evaluate how well an LLM-based agent handles real-world API calls, such as querying weather data or managing image libraries, can use ToolBench's task examples as a starting point.

Strengths and Weaknesses

Strengths:

  • Recognized as an ICLR 2024 spotlight project, indicating peer-reviewed academic significance in the LLM tool-learning space.
  • Large and varied dataset with 16,000+ APIs and 12,000+ task instances spanning multiple real-world domains.
  • Goes beyond evaluation by supporting model training and serving and is a more complete research platform than a pure benchmark.
  • Open-source under the Apache-2.0 license with 5,588 GitHub stars and 482 forks, reflecting substantial community interest.
  • Actively maintained with contributions from 16 contributors and last updated in May 2025.

Weaknesses:

  • API key access has been a persistent and frequently reported problem, with multiple users submitting requests through forms and receiving no response.
  • Dataset download links have led to empty folders on Google Drive, and downloaded data volumes have not matched what the accompanying paper describes.
  • Server timeouts have been reported by users attempting to access the platform's services.
  • Installation errors, including a "ModuleNotFoundError: No module named 'triton.ops'" problem, have been documented in open GitHub issues. As of the latest available data, 158 issues remain open with limited visible resolution activity.

Getting Started

ToolBench is free and open-source. It is available at github.com/OpenBMB/ToolBench under the Apache-2.0 license, meaning there is no cost to access, clone, or use the repository. No paid tiers or subscription plans exist for ToolBench itself.

To get started, clone the repository and install the Python dependencies listed in requirements.txt. Key libraries include accelerate for distributed training, fastapi for API serving, gradio for UI interfaces, and rouge for evaluation metrics. Datasets can be downloaded using the provided scripts. Note that some features require an API key for the underlying RapidAPI services, and users have reported delays or non-responses when requesting these keys through the project's form.

FAQ

What is ToolBench?

ToolBench is an open-source platform developed by OpenBMB for training, serving, and evaluating large language models on tool-use and API-calling tasks. It was recognized as an ICLR 2024 spotlight project.

Who made ToolBench?

ToolBench was created by the OpenBMB organization. The repository has 16 contributors and is hosted at github.com/OpenBMB/ToolBench.

What programming language is ToolBench written in?

ToolBench is written primarily in Python and is available under the Apache-2.0 open-source license.

How large is the ToolBench dataset?

The benchmark dataset covers 16,000+ RESTful APIs sourced from RapidAPI Hub and includes 12,000+ task instances generated using ChatGPT across a range of real-world scenarios.

What is ToolEval?

ToolEval is the evaluation infrastructure included in ToolBench. It measures execution success rates, pass rates, and win rates of LLMs on tool-calling tasks, and functions as a leaderboard for comparing model performance.

Is ToolBench free to use?

Yes, ToolBench is fully free and open-source under the Apache-2.0 license. There are no paid tiers or subscription costs associated with the project itself.

What models does ToolBench support?

ToolBench is designed to work with open-source LLMs, including LLaMA and Vicuna, through its integration with the ToolLLM framework. It is intended to help close the performance gap between open-source models and closed models like those from OpenAI.

What is DFSDT in ToolBench?

DFSDT stands for Depth-First Search-Based Decision Tree. It is an algorithm integrated into ToolBench to support structured action generation when LLMs are deciding which tool or API call to make next.

What are the known problems with ToolBench?

Users have reported several persistent issues, including difficulty obtaining API keys, dataset download links leading to empty Google Drive folders, server timeouts, and installation errors related to missing Python modules. As of the latest available data, 158 GitHub issues remain open.

Who is ToolBench intended for?

ToolBench is aimed at AI researchers studying LLM tool use, machine learning engineers fine-tuning open-source models, academic teams studying the performance gap between open and closed models, and developers building and evaluating tool-calling agents.

The repository has 5,588 stars and 482 forks as of the research data available, indicating significant interest from the AI research and developer community.

What is StableToolBench?

StableToolBench is a related project that builds on ToolBench. It adds features such as MirrorAPI simulators and solvable query filtering to improve stability. It is a separate repository maintained by THUNLP-MT.

Can I contribute to ToolBench?

Yes, the project welcomes contributions for new APIs, tasks, and action generators. The contribution workflow involves forking from the main branch, adding tests, and running black. for code formatting before submitting a pull request.

How does ToolBench compare to a standard LLM benchmark?

Unlike most benchmarks that only measure performance, ToolBench also supports model training and serving and is a more complete research platform for teams that want to both evaluate and improve their models' tool-calling capabilities.

What are the best alternatives to ToolBench?

StableToolBench is a direct derivative that addresses some of ToolBench's stability issues. For broader tool-use evaluation, researchers may also look at other API-calling benchmarks in the academic literature, though specific alternatives are not detailed in the available research data.

Share:

Similar to ToolBench

Favicon

 

  
  
Favicon

 

  
  
Favicon