Skip to main content

AgentBench vs WebArena: Breadth Across Agent Skills or Depth in Web Navigation

Reviewed by Mathijs Bronsdijk · Updated Apr 22, 2026

Favicon of AgentBench

AgentBench

Benchmarking LLMs as real-world agents, not just chatbots

WebArena

Reproducible benchmark for evaluating web agents in realistic local sites

AgentBench vs WebArena: Breadth Across Agent Skills or Depth in Web Navigation

If you are choosing between AgentBench and WebArena, you are not choosing two versions of the same benchmark. You are choosing between two different ideas of what matters when an agent fails.

AgentBench asks: can this model handle a wide spread of agentic settings - code, games, web, knowledge graphs, household tasks - and keep its reasoning together across multi-turn interaction? WebArena asks a narrower but sharper question: can this agent actually survive realistic web workflows inside a controlled browser environment, with reproducible websites and programmatic task checks?

That is the real axis here. AgentBench is the breadth play. WebArena is the depth play. One is built to reveal how an agent behaves across task families and failure modes. The other is built to stress web navigation until the browser itself becomes the test.

For teams making a benchmark-selection decision, that difference matters more than any surface similarity. Both are evaluation tools for autonomous agents. Both are interactive. Both are used in serious research. But they disagree on what kind of evidence should count.

The decision is not "which benchmark is better"

The wrong way to compare these tools is to ask which one is more advanced. They are advanced in different directions.

AgentBench, from THUDM and presented at ICLR 2024, was designed to close a gap in agent evaluation: the field had plenty of static language benchmarks, but not enough standardized ways to test LLMs as autonomous agents across complex, multi-turn environments. It spans eight environments across code-grounded, game-grounded, and web-grounded tasks, and it evaluated 29 models in one framework. That breadth is the point. It is trying to expose whether a model can reason, plan, recover from mistakes, and sustain coherence across different kinds of interaction.

WebArena, by contrast, was created to solve the environment problem in web agent research. It is the first large-scale, realistic, and reproducible web environment for evaluating and training web agents. It uses full websites hosted locally in Docker - e-commerce, forum, GitLab, MediaWiki - so researchers can get realism without the chaos of live websites. Its mission is narrower but more demanding: not "can the agent do many kinds of tasks?" but "can it actually navigate realistic web systems well enough to matter?"

So the choice is not about quality. It is about what kind of failure you need to surface.

If you need to know whether your agent is broadly competent across different task settings, AgentBench gives you the bigger map. If you need to know whether your agent can navigate web workflows with enough fidelity to be useful, WebArena gives you the harsher test.

AgentBench is a portfolio benchmark; WebArena is a web-specific stress test

AgentBench's structure makes its philosophy obvious. The benchmark includes eight environments: operating systems, databases, digital card games, lateral thinking puzzles, household tasks, web shopping, web browsing, and knowledge graphs. Five of these were created specifically for the benchmark and three were adapted from published datasets. That hybrid design matters because it is not trying to simulate one domain perfectly. It is trying to compare agent behavior across a portfolio of settings.

That portfolio approach is what makes AgentBench valuable for teams with mixed agent ambitions. The benchmark can tell you whether a model is strong at SQL but weak at long-horizon planning, or good at web shopping but poor at lateral reasoning. It surfaces failure modes like poor long-term reasoning, inadequate instruction following, invalid actions, context limit exceeded, and task limit exceeded. In other words, it is less interested in one polished score than in the shape of the breakdown.

WebArena is more focused. It is not trying to compare databases, puzzles, and household tasks. It is trying to make web automation hard in a realistic way. The benchmark includes four website categories - shopping, forum, GitLab, and MediaWiki - plus map and knowledge resources. The tasks are natural-language instructions that require navigation, form filling, search, cross-page reasoning, and sometimes cross-site behavior. The environment is built to resemble real web systems while staying reproducible.

That narrower scope is a strength, not a weakness, if your actual question is web-agent readiness. WebArena is not diluted by unrelated task types. It drills into browser interaction, page state, and workflow completion. If your product lives and dies by browser automation, that depth is more actionable than AgentBench's breadth.

Reproducibility and realism are handled differently

This is where the tradeoff gets sharp.

AgentBench tries to preserve reproducibility by separating the task server, agent server, and client, all communicating over HTTP. That architecture is designed to make evaluation scalable and model-agnostic. It also keeps the benchmark modular: you can evaluate specific environments, deploy models through an HTTP interface, and run the benchmark across multiple machines. The result is a framework that is relatively easy to adapt for different model types and infrastructure setups.

But AgentBench's realism is distributed across heterogeneous environments. The web tasks are only part of the picture. Its strength is that it can compare agent behavior across settings that feel different from one another. Its weakness is that no single environment receives the same depth of treatment as a dedicated benchmark.

WebArena goes much harder on environment realism within one domain. It uses locally hosted, fully functional websites so researchers can get both realism and reproducibility. That is a big deal in web evaluation, where live sites change constantly and synthetic tasks often feel too toy-like. WebArena's Docker-based architecture is built to preserve the complexity of real websites while freezing the environment so results can be replayed.

The result is a benchmark that feels more like a browser-native test of competence. It is closer to the actual operational problem. But that realism comes with setup cost and a narrower domain. WebArena is more faithful to web work, less broad in what it covers.

If you care most about comparing agent capability across multiple interaction styles, AgentBench's architecture is the better fit. If you care most about whether a browser agent can handle realistic websites in a controlled environment, WebArena is the sharper instrument.

The failure modes they expose are not the same

This is the part buyers should pay the most attention to.

AgentBench is unusually good at diagnosing why agents fail. It calls out five failure categories: valid completion, invalid format, invalid action, context limit exceeded, and task limit exceeded. That taxonomy is useful because it separates syntax mistakes from tool misunderstandings from long-horizon reasoning breakdowns. It also says Task Limit Exceeded is the predominant failure reason across environments, which suggests current LLM agents often get stuck in loops, revisit failed strategies, or fail to recognize that their approach is not working.

That is a very specific kind of signal. AgentBench is good when you want to know whether your agent can plan, recover, and stay coherent across turns. It is less about pixel-perfect interface handling and more about whether the model can keep its head when the interaction gets messy.

WebArena surfaces a different class of failure. It highlights problems with form interactions, dropdown selection, accessibility tree extraction, multi-step planning, and context management in browser workflows. It also notes that validation scripts can be overly restrictive, penalizing semantically correct but textually different outputs. That means WebArena often tells you whether the agent can actually execute a browser task, but it can also punish agents for interface-level mismatches or evaluation rigidity.

In practice, WebArena is the better benchmark for browser-specific failure analysis. If your agent struggles with dropdowns, page transitions, or form completion, WebArena will show it. AgentBench may reveal broader reasoning issues, but it will not isolate web-navigation pain as cleanly.

So if you are debugging an agent that uses browser automation, WebArena is the more relevant failure lens. If you are trying to understand whether your model has general agentic weakness - especially in long multi-turn loops - AgentBench is more diagnostic.

The performance stories are different too

The performance distributions show different kinds of benchmark.

AgentBench evaluated 29 models and found a large gap between commercial and open-source systems. GPT-4 came out strongest overall, succeeding on 6 of 8 environments and reaching 78 percent success on House-Holding. Claude-2 and Claude followed closely behind. The best open-source model with 70 billion parameters or fewer, CodeLLaMA-34b, still lagged meaningfully behind the top commercial systems. That is a broad capability story: some models generalize better across agentic settings, and current open-source systems still trail at the top end.

WebArena's story is more dramatic. Early GPT-4 performance was around 14 percent on the benchmark, which made clear how hard realistic web automation was. Since then, specialized agent architectures have pushed the ceiling much higher: OpAgent reached 71.6 percent on the leaderboard, and Meka reached 72.7 percent on a cleaned subset. Human performance is around 78 percent. That means the best systems are now in the same rough neighborhood as humans, but the benchmark still exposes a meaningful gap.

Those two performance stories imply different uses.

AgentBench is useful when you want to compare model families and understand how capability transfers across task types. WebArena is useful when you want to measure how close a web agent is to practical competence in a browser. The fact that WebArena's best systems are approaching human performance makes it especially valuable for teams that need a realistic ceiling, not just a general capability score.

Breadth versus depth is also a pricing and effort question

Neither benchmark is a casual download-and-run toy, but they ask for different kinds of effort.

AgentBench's setup is modular and configuration-driven. It describes a three-part architecture - task server, agent server, client - and notes that researchers can test models through HTTP endpoints. That makes it relatively flexible for teams that already have model-serving infrastructure. There is also an Inspect AI integration, which lowers the barrier for teams already using Inspect for evaluation. In practical terms, AgentBench is easier to slot into a research pipeline if you are already comfortable with multi-environment evaluation.

WebArena asks more from the environment side. It describes AWS AMI deployment, Docker-based manual setup, multiple website services, URL configuration, reset workflows, and substantial storage and compute requirements. It is not impossible to run, but it is more operationally involved. That is the cost of realism. You are not evaluating against a simple simulator. You are running a small web stack.

So if your team wants a benchmark that is easier to integrate into a broader evaluation suite, AgentBench is usually the lighter lift. If your team is willing to invest in infrastructure because browser realism is the point, WebArena justifies the overhead.

Ecosystem maturity favors both, but in different ways

AgentBench has become a foundation for a family of specialized variants. It mentions VisualAgentBench for multimodal agents, FHIR-AgentBench for healthcare, AGENTiGraph for knowledge-graph collaboration, ProAgentBench for proactive agents, and LiveClawBench for complexity-annotated evaluation. That tells you AgentBench is being used as a base layer for broader agent evaluation research. Its value is partly that it can be extended into domain-specific variants.

WebArena has a different kind of ecosystem. It has become the standard web-agent benchmark and spawned BrowserGym, WebArena-Verified, VisualWebArena, ST-WebAgentBench, SecureWebArena, WebChoreArena, and WebForge. That ecosystem is tightly clustered around browser automation and web-agent realism. It is not broad in domain coverage, but it is deep in web-agent research.

This distinction matters if you are choosing a benchmark for a product roadmap. AgentBench plugs into a larger conversation about general agent capability. WebArena plugs into a more focused conversation about browser automation, enterprise workflows, and realistic web interaction. Both are ecosystem-rich, but the surrounding research communities are different.

Where each benchmark breaks

A good compare page should be blunt about the cracks.

AgentBench breaks when you need domain-specific fidelity. Its environments are real enough to be useful, but they are still abstractions. It explicitly says the benchmark does not deeply evaluate safety, efficiency, or trajectory quality, and that strong performance does not guarantee production readiness. It also notes that general-agent settings are harder than domain-specific ones, which means performance can drop when tasks are merged into more realistic mixed settings. In short: AgentBench is excellent for broad agent evaluation, but it is not a substitute for domain-specific or safety-specific testing.

WebArena breaks when you need breadth beyond web automation. It is a web benchmark, full stop. It can tell you a lot about browser agents, but not much about code execution, knowledge-graph reasoning, or other agentic settings. Some WebArena tasks are infeasible, validation can be too strict, and accessibility issues can distort results. So even within its domain, it is not a perfect mirror of the real web. It is a controlled approximation with high practical value.

If you need a benchmark that covers many agent modes, WebArena is too narrow. If you need a benchmark that faithfully isolates browser workflows, AgentBench is too diffuse.

Which one should you pick?

Pick AgentBench if your real question is: "How does this model behave as an agent across different settings?" It is the better choice for teams comparing models, studying failure modes, or building a broad evaluation portfolio. It exposes long-horizon reasoning problems, instruction-following failures, invalid actions, and context exhaustion across eight environments. If you care about general agent competence more than browser realism, AgentBench is the more informative benchmark.

Pick WebArena if your real question is: "Can this agent actually do realistic web work?" It is the better choice for browser-agent teams, web automation research, and product groups that need a reproducible stand-in for real websites. It is the standard benchmark for realistic web interaction, it has a strong ecosystem, and top agents are now approaching human performance - which makes it especially useful when you need a meaningful ceiling.

If you are deciding between them for a serious evaluation program, the cleanest answer is often not either/or. Use AgentBench when you need breadth and failure diagnosis across agentic settings. Use WebArena when you need depth and realism in web navigation. But if you only get one, let the shape of your product decide: general agent capability points to AgentBench; browser-native automation points to WebArena.

Pick AgentBench if you need breadth across agent settings, multi-turn failure analysis, and a modular benchmark that reveals where models break outside one domain.

Pick WebArena if you need depth in realistic web navigation, reproducible browser environments, and the clearest test of whether an agent can actually complete web workflows.