AgentBench

AgentBench tests how well LLMs act as agents on multi-step tasks, tool use, and long interactions in realistic settings.

Reviewed by Mathijs Bronsdijk · Updated Apr 19, 2026

ToolOpen Source + PaidUpdated 25 days ago

SDK: PythonCloud, Self-hosted

Evaluates 29 different LLMsTests across 8 diverse environmentsReveals gaps between commercial and OSS modelsIdentifies core failure modes in LLMsDeveloped by THUDM, presented at ICLR 2024Focuses on multi-turn interactive evaluationIncludes environments for web shopping and gamingSupports modular and scalable evaluation architecture

Explore Alternatives Visit AgentBench

Compare AgentBench

What is AgentBench?

AgentBench is a research benchmark for testing how well large language models behave as agents, not just as chatbots. It was built by researchers from THUDM at Tsinghua University and presented at ICLR 2024. The core idea is simple: a model that writes fluent text is not necessarily good at completing multi-step tasks, using tools, recovering from mistakes, or staying on track over a long interaction. AgentBench was created to measure those harder abilities in a more realistic way.

Instead of giving models one prompt and grading one answer, AgentBench puts them into eight interactive environments. These include operating system tasks, database queries, knowledge graph reasoning, web shopping, web browsing, household task simulations, a digital card game, and lateral thinking puzzles. Across those environments, the benchmark evaluates whether a model can plan, act, interpret feedback, and keep going across multiple turns. In the original paper, the team evaluated 29 models and found a clear gap between top commercial systems like GPT-4 and most open source models.

What we found most important in the research is that AgentBench is less a product and more a measuring instrument for the whole agent field. It has become a reference point for researchers building new agent systems, model providers comparing capabilities, and teams that want to know whether a model can do more than answer questions. It also spawned follow-on work such as VisualAgentBench and General AgentBench, which tells you the original framework landed in the right place, even if it is not the final word on agent evaluation.

Key Features

Eight interactive environments: AgentBench tests agents across 8 different settings, including OS operations, SQL databases, knowledge graphs, web shopping, web browsing, house-holding, lateral thinking puzzles, and a digital card game. That range matters because many benchmarks only test one skill, while AgentBench shows where a model is uneven, strong at code tasks but weak at planning, for example.
Multi-turn evaluation: Tasks unfold over several steps instead of one-shot prompts. This matters because many real agent failures do not happen on the first response, they happen on turn 5 or turn 15 when the model forgets context, repeats itself, or takes the wrong action.
29-model baseline study: The original research compared 29 language models, including GPT-4, Claude-2, GPT-3.5-turbo, and open models such as CodeLLaMA. That gives teams a useful historical snapshot of the market and showed a consistent performance gap between commercial APIs and open models.
Environment-specific scoring: AgentBench uses success rate, F1 score, and reward functions depending on the task. That sounds technical, but it matters because a web shopping task and a knowledge graph retrieval task need different grading methods if you want scores to mean anything.
Failure mode tracking: The framework does not just record pass or fail. It also tracks reasons such as invalid format, invalid action, context limit exceeded, and task limit exceeded. In the paper, task limit exceeded showed up often, which pointed to weak long-horizon reasoning as a central problem for current agents.
Modular three-part architecture: AgentBench is organized around a Task Server, Agent Server, and Client, all communicating over HTTP. That design matters for research teams because it lets them swap in different models, run evaluations on separate machines, and compare systems without rewriting the whole stack.
Realistic tool-use scenarios: Database tasks use actual SQL-style interaction, web tasks are grounded in existing datasets like WebShop and Mind2Web, and house-holding uses ALFWorld. The benchmark is still a simulation, but it is closer to real tool use than many toy agent tests.
Open research foundation: The codebase is available on GitHub, and the framework has been reused in later projects such as VisualAgentBench and healthcare-focused variants. For research groups, that means AgentBench is not just a paper result, it is something they can extend and inspect.

Use Cases

The clearest use case for AgentBench is model evaluation before building an agent product. In the original study, GPT-4 was the strongest overall model and succeeded in 6 of the 8 environments, including a 78% success rate on House-Holding. That kind of result gave research teams and product builders a grounded way to say, “this model can probably handle longer, tool-based workflows better than the alternatives,” instead of relying on vague impressions from chatbot demos.

AgentBench is also used as a baseline for academic agent research. Because the paper tested 29 models across the same tasks, later research could compare new prompting methods, planning systems, or model fine-tuning approaches against something the field already recognized. That is part of why variants such as General AgentBench and VisualAgentBench appeared. Researchers were not starting from zero, they were building on a benchmark that had already exposed a real problem: many models can sound capable but break down during sustained interaction.

A third use case is failure analysis. The benchmark’s error categories tell a story that raw scores miss. In the paper, “task limit exceeded” was the most common failure mode, which means many agents did not collapse instantly, they wandered. They kept taking actions without getting closer to the goal. For teams building internal agents, that is useful because it points to where work is needed, planning, memory, tool grounding, or action validation, rather than just saying a model “performed poorly.”

Strengths and Weaknesses

Strengths:

AgentBench covers more than one kind of agent work. That sounds obvious, but it is a real strength. Many benchmarks are deep in one area, coding, browsing, or customer support. AgentBench puts database querying next to web navigation next to game reasoning, which reveals how patchy model capability can be. A model that looks strong in one benchmark may look far less reliable here.

The benchmark helped make the commercial versus open model gap visible with numbers, not anecdotes. In the original evaluation, commercial API models consistently outperformed most open models, and CodeLLaMA-34B, one of the strongest open entries in that size range, still trailed GPT-3.5-turbo in important ways. For decision-makers, that was useful because it turned a vague belief into a measurable tradeoff.

Its failure reporting is unusually practical. Instead of stopping at leaderboard scores, AgentBench records whether the model failed because it formatted output incorrectly, attempted an invalid action, ran out of context, or hit the task limit. That gives research teams something to fix. In our view, this is one of the benchmark’s most valuable contributions.

Weaknesses:

AgentBench is a benchmark, not a production testbed. Some environments are adapted from datasets and simulations such as ALFWorld, WebShop, and Mind2Web. They are useful approximations, but they are still approximations. Strong performance here does not guarantee the same behavior on a live ecommerce site, a messy enterprise database, or a real operating environment.

It focuses more on whether the task gets completed than on how the agent gets there. An agent can take a clumsy, expensive route and still receive the same credit as one that solves the task efficiently. If you care about latency, token costs, or whether an agent behaves in a way a human teammate can trust, AgentBench only answers part of that question.

The benchmark also sits inside a wider debate about evaluation reliability. Separate research from Berkeley showed that several agent benchmarks can be exploited through quirks in scoring or environment design. That study did not single out AgentBench as the main example, but it raised a fair concern for the whole category. Any team using AgentBench should treat it as one signal, not the final truth.

Pricing

AgentBench is not a SaaS product with a pricing page. The framework is open research infrastructure, so the software itself is available through GitHub.

Open-source framework: $0 You can access the code without paying a license fee. In practice, though, “free” only describes the repository, not the full cost of running evaluations.
Model API costs: Variable If you test commercial models such as GPT-4 or Claude through APIs, your real spend comes from prompt and completion usage across many task turns. Multi-step benchmarks can get expensive quickly because each task may require repeated model calls.
Self-hosted model costs: Variable If you evaluate open models locally, you avoid API bills but take on GPU and infrastructure costs. For larger models and broad benchmark runs, that can mean dedicated machines and longer experiment times.
Operational setup costs: Team time There is also a setup cost in engineering time. You may need to configure task servers, model endpoints, datasets, and environment dependencies before you get a clean run.

For most users, the real comparison is not “free versus paid.” It is “open benchmark with infrastructure work” versus “managed evaluation platform with less setup.” That distinction matters more than the repo price.

Alternatives

SWE-bench SWE-bench is for teams that care specifically about software engineering agents. It evaluates models on real GitHub issues and codebase fixes. If your product lives or dies on code repair and repository reasoning, SWE-bench is often more relevant than AgentBench. AgentBench is broader, but SWE-bench goes deeper in one domain.

WebArena WebArena is a stronger choice when your main concern is browser-based task completion on realistic web interfaces. It focuses on web workflows in more depth than AgentBench’s broader mix of environments. Teams building browsing agents may prefer WebArena, while teams comparing general-purpose agent behavior may prefer AgentBench.

τ-bench τ-bench focuses on realistic dialog and tool use with dynamic interactions. Compared with AgentBench, it puts more weight on the messiness of real user conversations and tool-calling situations. If you are evaluating customer-facing assistants or support agents, τ-bench may tell you more about conversational behavior.

General AgentBench General AgentBench is the closest conceptual alternative because it extends the original idea. Instead of testing each environment separately, it asks agents to handle mixed task types in a more unified setting. That is closer to how real assistants are used. If AgentBench asks, “can this model do these tasks,” General AgentBench asks, “can this model figure out what kind of task it is dealing with in the first place?”

VisualAgentBench VisualAgentBench matters if your agent needs to interpret screens, GUIs, or visual environments. AgentBench is mainly text and tool interaction. VisualAgentBench expands into multimodal evaluation, so it is a better fit for desktop agents, interface agents, and embodied systems.

FAQ

What is AgentBench used for?

It is used to evaluate how well language models behave as agents across multi-step tasks. Researchers and builders use it to compare models, study failures, and test new agent methods.

Who created AgentBench?

AgentBench was created by researchers from THUDM at Tsinghua University. The work was published at ICLR 2024.

Is AgentBench a product or a benchmark?

It is a benchmark framework, not a commercial agent platform. You use it to measure agent performance rather than to deploy an end-user assistant.

What kinds of tasks does AgentBench include?

It includes 8 environments: operating system tasks, databases, knowledge graphs, web shopping, web browsing, house-holding, lateral thinking puzzles, and a digital card game.

Which models performed best in the original research?

GPT-4 was the strongest overall in the original evaluation. Claude-2 and Claude also performed well, while most open models lagged behind the top commercial systems.

What did AgentBench reveal about current AI agents?

The big finding was that many models struggle with long-horizon reasoning and decision-making. A common failure mode was hitting the task limit without solving the problem, which means the agent kept acting without making real progress.

Is AgentBench open source?

Yes. The framework is available on GitHub, and researchers can run or extend it themselves.

How do I get started?

Start by cloning the AgentBench repository, setting up the required Python environment, and connecting a model endpoint through the Agent Server interface. From there, you choose which environments to run and configure the evaluation client.

How long to set up?

For a research team already comfortable with Python environments, HTTP services, and model hosting, the first setup can take a few hours to a day. If you want all environments working cleanly, expect more time for dependencies and task-specific configuration.

Does AgentBench test real-world production behavior?

Partly, but not fully. The tasks are more realistic than many toy benchmarks, yet they are still controlled environments and dataset-based simulations.

Is AgentBench enough to choose a production model?

No. It is a useful signal, especially for planning and tool use, but teams should pair it with domain-specific tests, safety checks, and cost analysis before making a production decision.

What are the main limitations?

It does not deeply evaluate safety, efficiency, or trajectory quality. It also depends on benchmark design choices, so scores should be treated as informative comparisons, not absolute proof of real-world reliability.

Categories:

Testing & Evaluation

Tags:

ai-testing analytics continuous-evaluation multi-step-workflow open-source research-software

Similar to AgentBench

Browse Testing & Evaluation

WebArena

Reproducible benchmark for evaluating web agents in realistic local sites

Testing & Evaluation

WebArena is a research benchmark for web agents, offering realistic locally hosted websites for controlled, reproducible evaluation.

ToolBench

Open-source platform for training and evaluating LLMs on tool learning tasks

Testing & Evaluation

ToolBench is an open-source benchmark platform for evaluating large language models on API tool-use tasks, with 16,000+ APIs and 12,000+ task instances.

GAIA Benchmark

Standardized evaluation for general AI assistants

Testing & Evaluation

An academic benchmark that evaluates AI assistants on real-world tasks requiring multi-step reasoning, web browsing, and file manipulation, with verifiable correct answers.

Promptfoo

Open-source LLM evaluation and red teaming for secure AI development

Testing & Evaluation

Promptfoo is an open-source CLI tool for LLM evaluation, red teaming, and AI security testing. Used by 127 Fortune 500 companies. Free tier available.

Inspect AI

Evaluate LLMs and agents across coding, reasoning, safety, and task performance

Testing & Evaluation

Inspect AI is an evaluation framework for testing LLMs and agents across safety, reasoning, coding, and agentic tasks.