ToolBench
What is ToolBench?
ToolBench is an open platform for training, serving, and evaluating tool-using language models for ML teams and research engineers that study how models choose and call tools. It combines open-source code and data with Web Demo, Tool Eval, and solution-path annotations. The project spans 16,464 APIs and 3451 dialogues, is Apache-2.0 licensed, and is self-hostable. Plans run Free $0/user/month, Team $4/user/month, and Enterprise $21/user/month.
Last verifiedHow we evaluate
At a glance
- ToolBench is best for ML teams who need a dataset and benchmark for tool-using models.
- Free $0USDper user/mo; Team $4USDper user/mo; Enterprise $21USDper user/mo
- 30 days, no credit card
What does ToolBench do?
ToolBench runs as an open platform for training, serving, and evaluating tool-using language models. It organizes tool-learning data into dialogues, APIs, reasoning traces, and solution-path annotations so teams can study how models choose and call tools. The project's Web Demo and Tool Eval pieces make it easier to inspect behavior and benchmark tool-use capability without rebuilding the pipeline from scratch. The repository is Apache-2.0 licensed and self-hostable, so teams can run it on their own infrastructure. It is also tied to the broader OpenBMB ecosystem and has been used in research settings, including the ICLR'24 spotlight paper and the project's public GitHub presence with 5.6k stars.
Why use ToolBench?
- Its open-source license lets teams inspect, modify, and run the platform on their own infrastructure.
- The dataset scale gives researchers a large corpus of dialogues, APIs, and calls for tool-learning work.
- Tool Eval and the Web Demo support faster iteration on evaluation and behavior inspection.
Who is ToolBench for?
- Research engineers who need a benchmark for tool-use behavior and evaluation.
- ML teams who want training data for tool-learning experiments.
- Applied AI developers who need to inspect reasoning traces and solution paths.
- Platform teams who prefer a self-hostable research stack for internal experimentation.
What are ToolBench's key features?
open-source
Open-source code and data let teams inspect, modify, and self-host ToolBench for internal research workflows without depending on a closed platform.
large-scale
Covers 16,464 APIs across 3451 dialogues, giving buyers a broad benchmark for testing agent behavior against many real tool-use scenarios.
high-quality
Includes 120,000 solution path annotations and 4.0 average reasoning traces, which helps train and evaluate agents on more reliable step-by-step tool use.
Web Demo
Provides a web demo for trying ToolBench in the browser, which helps teams review tool-use examples before integrating them into their own workflows.
Tool Eval
Offers evaluation tooling for comparing agent performance on the dataset, so buyers can measure tool-use quality against the same 16,464-API benchmark.
What does ToolBench integrate with?
- RapidAPI
- ChatGPT
- Google Drive
- Discord
- Hugging Face
What are ToolBench's use cases?
Benchmarking tool-use models
Research engineers use ToolBench to benchmark tool-use behavior and compare models on realistic tasks, using Tool Eval to score outcomes against the same evaluation setup. The large-scale dataset and 469K total API calls help them test whether a system can actually choose and use tools reliably.
Training tool-learning systems
ML teams use ToolBench to build training data for tool-learning experiments, using the high-quality dialogues and 120,000 solution path annotations to teach models how to reach the right API call sequence. The open-source stack makes it easier to adapt the data for internal research workflows.
Inspecting reasoning traces
Applied AI developers use ToolBench to inspect reasoning traces and solution paths when a model fails or succeeds on a task, leaning on tool-use capability to study decision points. The Web Demo gives them a quick way to review examples before wiring experiments into their own systems.
Self-hosted research experimentation
Platform teams use ToolBench as a self-hostable research stack for internal experimentation, combining open-source access with Tool Eval to run repeatable tests behind their own infrastructure. That lets them keep benchmark data and evaluation loops inside the organization while iterating on tool-use behavior.
How does ToolBench work?
- Clone the open-source repository and load the large-scale benchmark data into your research environment, then review the dataset structure before running any experiments.
- Use the Web Demo to browse dialogues, inspect reasoning traces, and understand how solution paths are represented across tool-use tasks.
- Connect your model or pipeline to the Tool Eval workflow so you can score tool-use capability against the same evaluation setup used in the benchmark.
- Run experiments on your own infrastructure, compare outputs across runs, and use the high-quality annotations to diagnose where tool selection or execution breaks down.
- Iterate on prompts, training data, or agent logic, then re-run Tool Eval to track whether your changes improve task completion and reasoning consistency.
How much does ToolBench cost?
Free
$0USDper user/month- Unlimited public/private repositories
- Dependabot security and version updates
- 2,000 CI/CD minutes/month
- 500MB of Packages storage
- Issues & Projects
- Community support
Team
$4USDper user/month- Everything included in Free, plus.
- Access to GitHub Codespaces
- Repository rules
- Draft pull requests
- Code owners
- Required reviewers
- Pages and Wikis
- Environment deployment branches and secrets
- 3,000 CI/CD minutes/month
- 2GB of Packages storage
- Web-based support
Enterprise
$21USDper user/month- Everything included in Team, plus.
- Data residency
- Enterprise Managed Users
- User provisioning through SCIM
- Enterprise Account to centrally manage multiple organizations
- Environment protection rules
- Repository rules
- Audit Log API
- SOC1, SOC2, type 2 reports annually
- FedRAMP Tailored Authority to Operate (ATO)
- SAML single sign-on
- Auditing
- GitHub Connect
- 50,000 CI/CD minutes/month
- 50GB of Packages storage
Frequently asked questions
What is ToolBench?
ToolBench is an open platform for training, serving, and evaluating tool-using language models for ML teams and research engineers that study how models choose and call tools. It combines open-source code and data with Web Demo, Tool Eval, and solution-path annotations. The project spans 16,464 APIs and 3451 dialogues, is Apache-2.0 licensed, and is self-hostable. Plans run Free $0/user/month, Team $4/user/month, and Enterprise $21/user/month.
How much does ToolBench cost? Is it free?
ToolBench has a free plan, with paid tiers including Team at $4USDper user/month, Enterprise at $21USDper user/month. A 30-day free trial is available.
What is ToolBench used for? Who is it for?
ToolBench is used for open-source, large-scale, and high-quality. It's built for Research engineers, ML teams, and Applied AI developers.
Does ToolBench have an API and what does it integrate with?
ToolBench doesn't publish a public API. It integrates with RapidAPI, ChatGPT, Google Drive, Discord, Hugging Face.
