ToolBench

What is ToolBench?

ToolBench is an open platform for training, serving, and evaluating tool-using language models for ML teams and research engineers that study how models choose and call tools. It combines open-source code and data with Web Demo, Tool Eval, and solution-path annotations. The project spans 16,464 APIs and 3451 dialogues, is Apache-2.0 licensed, and is self-hostable. Plans run Free $0/user/month, Team $4/user/month, and Enterprise $21/user/month.

Last verifiedMay 17, 2026How we evaluate

Visit ToolBench

At a glance

Best for: ToolBench is best for ML teams who need a dataset and benchmark for tool-using models.
Pricing: Free $0USDper user/mo; Team $4USDper user/mo; Enterprise $21USDper user/mo
Free trial: 30 days, no credit card

What does ToolBench do?

ToolBench runs as an open platform for training, serving, and evaluating tool-using language models. It organizes tool-learning data into dialogues, APIs, reasoning traces, and solution-path annotations so teams can study how models choose and call tools. The project's Web Demo and Tool Eval pieces make it easier to inspect behavior and benchmark tool-use capability without rebuilding the pipeline from scratch. The repository is Apache-2.0 licensed and self-hostable, so teams can run it on their own infrastructure. It is also tied to the broader OpenBMB ecosystem and has been used in research settings, including the ICLR'24 spotlight paper and the project's public GitHub presence with 5.6k stars.

Why use ToolBench?

Its open-source license lets teams inspect, modify, and run the platform on their own infrastructure.
The dataset scale gives researchers a large corpus of dialogues, APIs, and calls for tool-learning work.
Tool Eval and the Web Demo support faster iteration on evaluation and behavior inspection.

Who is ToolBench for?

Research engineers who need a benchmark for tool-use behavior and evaluation.
ML teams who want training data for tool-learning experiments.
Applied AI developers who need to inspect reasoning traces and solution paths.
Platform teams who prefer a self-hostable research stack for internal experimentation.

What are ToolBench's key features?

open-source

Open-source code and data let teams inspect, modify, and self-host ToolBench for internal research workflows without depending on a closed platform.

large-scale

Covers 16,464 APIs across 3451 dialogues, giving buyers a broad benchmark for testing agent behavior against many real tool-use scenarios.

high-quality

Includes 120,000 solution path annotations and 4.0 average reasoning traces, which helps train and evaluate agents on more reliable step-by-step tool use.

Web Demo

Provides a web demo for trying ToolBench in the browser, which helps teams review tool-use examples before integrating them into their own workflows.

Tool Eval

Offers evaluation tooling for comparing agent performance on the dataset, so buyers can measure tool-use quality against the same 16,464-API benchmark.

What does ToolBench integrate with?

RapidAPI
ChatGPT
Google Drive
Discord
Hugging Face

What are ToolBench's use cases?

Benchmarking tool-use models

Research engineers use ToolBench to benchmark tool-use behavior and compare models on realistic tasks, using Tool Eval to score outcomes against the same evaluation setup. The large-scale dataset and 469K total API calls help them test whether a system can actually choose and use tools reliably.

Training tool-learning systems

ML teams use ToolBench to build training data for tool-learning experiments, using the high-quality dialogues and 120,000 solution path annotations to teach models how to reach the right API call sequence. The open-source stack makes it easier to adapt the data for internal research workflows.

Inspecting reasoning traces

Applied AI developers use ToolBench to inspect reasoning traces and solution paths when a model fails or succeeds on a task, leaning on tool-use capability to study decision points. The Web Demo gives them a quick way to review examples before wiring experiments into their own systems.

Self-hosted research experimentation

Platform teams use ToolBench as a self-hostable research stack for internal experimentation, combining open-source access with Tool Eval to run repeatable tests behind their own infrastructure. That lets them keep benchmark data and evaluation loops inside the organization while iterating on tool-use behavior.

How does ToolBench work?

Clone the open-source repository and load the large-scale benchmark data into your research environment, then review the dataset structure before running any experiments.
Use the Web Demo to browse dialogues, inspect reasoning traces, and understand how solution paths are represented across tool-use tasks.
Connect your model or pipeline to the Tool Eval workflow so you can score tool-use capability against the same evaluation setup used in the benchmark.
Run experiments on your own infrastructure, compare outputs across runs, and use the high-quality annotations to diagnose where tool selection or execution breaks down.
Iterate on prompts, training data, or agent logic, then re-run Tool Eval to track whether your changes improve task completion and reasoning consistency.

How much does ToolBench cost?

Free

$0USDper user/month

Unlimited public/private repositories
Dependabot security and version updates
2,000 CI/CD minutes/month
500MB of Packages storage
Issues & Projects
Community support

Team

$4USDper user/month

Everything included in Free, plus.
Access to GitHub Codespaces
Repository rules
Draft pull requests
Code owners
Required reviewers
Pages and Wikis
Environment deployment branches and secrets
3,000 CI/CD minutes/month
2GB of Packages storage
Web-based support

Enterprise

$21USDper user/month

Everything included in Team, plus.
Data residency
Enterprise Managed Users
User provisioning through SCIM
Enterprise Account to centrally manage multiple organizations
Environment protection rules
Repository rules
Audit Log API
SOC1, SOC2, type 2 reports annually
FedRAMP Tailored Authority to Operate (ATO)
SAML single sign-on
Auditing
GitHub Connect
50,000 CI/CD minutes/month
50GB of Packages storage

Frequently asked questions

What is ToolBench?

How much does ToolBench cost? Is it free?

ToolBench has a free plan, with paid tiers including Team at $4USDper user/month, Enterprise at $21USDper user/month. A 30-day free trial is available.

What is ToolBench used for? Who is it for?

ToolBench is used for open-source, large-scale, and high-quality. It's built for Research engineers, ML teams, and Applied AI developers.

Does ToolBench have an API and what does it integrate with?

ToolBench doesn't publish a public API. It integrates with RapidAPI, ChatGPT, Google Drive, Discord, Hugging Face.

Filed under:Agent Tools & Integrations free-trial freemium open-source self-hosted

Explore other Agent Tools & Integrations

Browse Agent Tools & Integrations

Smithery

Connect AI agents to tools, auth, and reusable workflows.

Agent Tools & Integrations

Smithery connects AI agents to tools with Managed OAuth and 8,841+ MCPs. Plans run Free, Pro $20/month, and Enterprise custom.

Galileo AI Evaluate

AI observability and eval engineering for turning traces into guardrails.

Agent Tools & Integrations

Galileo AI Evaluate turns production traces into evals and guardrails. Plans start at Free $0, then Pro $100/month, with Enterprise custom.

AgentMail

Email API for AI agents that handles inboxes and threaded replies.

Agent Tools & Integrations

AgentMail is an email API for AI agents with threaded replies, semantic search, and data extraction. Plans start at Free $0/month.

TruLens

Open-source evaluation for AI agents with trace-level scoring.

Agent Tools & Integrations

TruLens traces AI agent behavior and scores groundedness, relevance, and coherence through Python SDK or OpenTelemetry traces.

Composio

AI agents that act across apps with managed integrations and scoped access.

Agent Tools & Integrations

Composio connects AI agents to 1,000+ apps with OAuth-based tool calls. Plans start at $0/month.