Skip to main content
Favicon of ToolBench

ToolBench

What is ToolBench?

ToolBench is an open platform for training, serving, and evaluating tool-using language models for ML teams and research engineers that study how models choose and call tools. It combines open-source code and data with Web Demo, Tool Eval, and solution-path annotations. The project spans 16,464 APIs and 3451 dialogues, is Apache-2.0 licensed, and is self-hostable. Plans run Free $0/user/month, Team $4/user/month, and Enterprise $21/user/month.

Last verifiedHow we evaluate

Screenshot of ToolBench website

At a glance

Best for
ToolBench is best for ML teams who need a dataset and benchmark for tool-using models.
Pricing
Free $0USDper user/mo; Team $4USDper user/mo; Enterprise $21USDper user/mo
Free trial
30 days, no credit card

What does ToolBench do?

ToolBench runs as an open platform for training, serving, and evaluating tool-using language models. It organizes tool-learning data into dialogues, APIs, reasoning traces, and solution-path annotations so teams can study how models choose and call tools. The project's Web Demo and Tool Eval pieces make it easier to inspect behavior and benchmark tool-use capability without rebuilding the pipeline from scratch. The repository is Apache-2.0 licensed and self-hostable, so teams can run it on their own infrastructure. It is also tied to the broader OpenBMB ecosystem and has been used in research settings, including the ICLR'24 spotlight paper and the project's public GitHub presence with 5.6k stars.

Why use ToolBench?

  • Its open-source license lets teams inspect, modify, and run the platform on their own infrastructure.
  • The dataset scale gives researchers a large corpus of dialogues, APIs, and calls for tool-learning work.
  • Tool Eval and the Web Demo support faster iteration on evaluation and behavior inspection.

Who is ToolBench for?

  • Research engineers who need a benchmark for tool-use behavior and evaluation.
  • ML teams who want training data for tool-learning experiments.
  • Applied AI developers who need to inspect reasoning traces and solution paths.
  • Platform teams who prefer a self-hostable research stack for internal experimentation.

What are ToolBench's key features?

open-source

Open-source code and data let teams inspect, modify, and self-host ToolBench for internal research workflows without depending on a closed platform.

large-scale

Covers 16,464 APIs across 3451 dialogues, giving buyers a broad benchmark for testing agent behavior against many real tool-use scenarios.

high-quality

Includes 120,000 solution path annotations and 4.0 average reasoning traces, which helps train and evaluate agents on more reliable step-by-step tool use.

Web Demo

Provides a web demo for trying ToolBench in the browser, which helps teams review tool-use examples before integrating them into their own workflows.

Tool Eval

Offers evaluation tooling for comparing agent performance on the dataset, so buyers can measure tool-use quality against the same 16,464-API benchmark.

What does ToolBench integrate with?

  • RapidAPI
  • ChatGPT
  • Google Drive
  • Discord
  • Hugging Face

What are ToolBench's use cases?

Benchmarking tool-use models

Research engineers use ToolBench to benchmark tool-use behavior and compare models on realistic tasks, using Tool Eval to score outcomes against the same evaluation setup. The large-scale dataset and 469K total API calls help them test whether a system can actually choose and use tools reliably.

Training tool-learning systems

ML teams use ToolBench to build training data for tool-learning experiments, using the high-quality dialogues and 120,000 solution path annotations to teach models how to reach the right API call sequence. The open-source stack makes it easier to adapt the data for internal research workflows.

Inspecting reasoning traces

Applied AI developers use ToolBench to inspect reasoning traces and solution paths when a model fails or succeeds on a task, leaning on tool-use capability to study decision points. The Web Demo gives them a quick way to review examples before wiring experiments into their own systems.

Self-hosted research experimentation

Platform teams use ToolBench as a self-hostable research stack for internal experimentation, combining open-source access with Tool Eval to run repeatable tests behind their own infrastructure. That lets them keep benchmark data and evaluation loops inside the organization while iterating on tool-use behavior.

How does ToolBench work?

  1. Clone the open-source repository and load the large-scale benchmark data into your research environment, then review the dataset structure before running any experiments.
  2. Use the Web Demo to browse dialogues, inspect reasoning traces, and understand how solution paths are represented across tool-use tasks.
  3. Connect your model or pipeline to the Tool Eval workflow so you can score tool-use capability against the same evaluation setup used in the benchmark.
  4. Run experiments on your own infrastructure, compare outputs across runs, and use the high-quality annotations to diagnose where tool selection or execution breaks down.
  5. Iterate on prompts, training data, or agent logic, then re-run Tool Eval to track whether your changes improve task completion and reasoning consistency.

How much does ToolBench cost?

Free

$0USDper user/month
  • Unlimited public/private repositories
  • Dependabot security and version updates
  • 2,000 CI/CD minutes/month
  • 500MB of Packages storage
  • Issues & Projects
  • Community support

Team

$4USDper user/month
  • Everything included in Free, plus.
  • Access to GitHub Codespaces
  • Repository rules
  • Draft pull requests
  • Code owners
  • Required reviewers
  • Pages and Wikis
  • Environment deployment branches and secrets
  • 3,000 CI/CD minutes/month
  • 2GB of Packages storage
  • Web-based support

Enterprise

$21USDper user/month
  • Everything included in Team, plus.
  • Data residency
  • Enterprise Managed Users
  • User provisioning through SCIM
  • Enterprise Account to centrally manage multiple organizations
  • Environment protection rules
  • Repository rules
  • Audit Log API
  • SOC1, SOC2, type 2 reports annually
  • FedRAMP Tailored Authority to Operate (ATO)
  • SAML single sign-on
  • Auditing
  • GitHub Connect
  • 50,000 CI/CD minutes/month
  • 50GB of Packages storage

Frequently asked questions

What is ToolBench?

ToolBench is an open platform for training, serving, and evaluating tool-using language models for ML teams and research engineers that study how models choose and call tools. It combines open-source code and data with Web Demo, Tool Eval, and solution-path annotations. The project spans 16,464 APIs and 3451 dialogues, is Apache-2.0 licensed, and is self-hostable. Plans run Free $0/user/month, Team $4/user/month, and Enterprise $21/user/month.

How much does ToolBench cost? Is it free?

ToolBench has a free plan, with paid tiers including Team at $4USDper user/month, Enterprise at $21USDper user/month. A 30-day free trial is available.

What is ToolBench used for? Who is it for?

ToolBench is used for open-source, large-scale, and high-quality. It's built for Research engineers, ML teams, and Applied AI developers.

Does ToolBench have an API and what does it integrate with?

ToolBench doesn't publish a public API. It integrates with RapidAPI, ChatGPT, Google Drive, Discord, Hugging Face.

Share:

Sponsored
Favicon

 

  
 

Explore other Agent Tools & Integrations

Favicon

 

  
  
Favicon

 

  
  
Favicon