WebArena

WebArena is a research benchmark for web agents, offering realistic locally hosted websites for controlled, reproducible evaluation.

Reviewed by Mathijs Bronsdijk · Updated Apr 19, 2026

ToolOpen Source + PaidUpdated 25 days ago

Open SourceSelf-HostedAPI AvailableCloud, Self-hosted

First large-scale web environment for agentsSupports e-commerce, social media, and moreAchieved 71.6% success rate with OpAgentOver 800 tasks designed for realistic interactionsOpen-source with active community contributionsUses Docker for easy deployment and reproducibilityHuman evaluators achieve 78% success rateExtensions for safety and multimodal evaluation

Explore Alternatives Visit WebArena

Compare WebArena

Galileo AI EvaluatevsWebArena

AgentBenchvsWebArena

View all comparisons

What is WebArena?

WebArena is a benchmark for web agents, not a production automation tool. It was introduced by researchers led by Shuyan Zhou at Carnegie Mellon University to solve a very specific problem in agent research: most web benchmarks were either too toy-like to matter, or too dependent on live websites to reproduce results. WebArena sits in the middle. It gives agents realistic websites to operate, but hosts them locally in controlled environments so researchers can run the same tasks again and again.

The project packages full web apps inside Docker containers, including an e-commerce store built on Magento, a forum built on Postmill, a GitLab instance, and a MediaWiki site. Instead of asking an agent to click a fake button in a tiny sandbox, WebArena asks it to complete real multi-step tasks across software that looks and behaves like the web tools people already use. That design choice is why it quickly became a standard benchmark in papers on browser agents, planning systems, and multimodal web interaction.

Our research found that WebArena is used mostly by AI researchers, model labs, and agent teams trying to answer a practical question: can this agent actually navigate the web well enough to be trusted on non-trivial tasks? It is also the base for a growing family of related benchmarks, including VisualWebArena, WebArena-Verified, ST-WebAgentBench, and security-focused extensions. In other words, WebArena is less a single repo than a reference point for the whole web-agent field.

Key Features

Realistic self-hosted websites: WebArena runs full applications locally, including Magento, GitLab, Postmill, and MediaWiki. That matters because agents face the same messy menus, forms, filters, and workflows people do, instead of simplified mock pages.
Reproducible evaluation: Because the sites are containerized and hosted in a controlled setup, researchers can rerun the same task under the same conditions. This is one of WebArena's biggest contributions, since live-site evaluation often breaks when interfaces change.
Multi-domain task coverage: The benchmark spans shopping, forums, software development, and knowledge management. That breadth helps teams test whether an agent generalizes across contexts, rather than overfitting to one website type.
Large task suite: The original benchmark includes 812 tasks, with later community work highlighting cleaner subsets like WebArena-Lite at 165 tasks and verified variants for more reliable reporting. The numbers matter here because small benchmarks can hide brittle behavior.
Natural language task instructions: Tasks are phrased like real user requests, not rigid scripts. This pushes agents to plan, interpret intent, and choose actions instead of replaying a known sequence.
Automated task validation: WebArena uses annotated programs to check whether a task was completed correctly. That makes large-scale evaluation possible without judging every run by hand, though some researchers have noted that validation can be stricter than human judgment.
Browser automation support: The environment works with tools like Playwright and Selenium, and is often used through BrowserGym. This lowers the barrier for teams already building browser agents in standard automation stacks.
Multiple observation modes: Researchers can expose agents to HTML, accessibility trees, screenshots, and set-of-mark views. That flexibility is important because text-only and multimodal agents often fail in different ways.
Cross-site and multi-step workflows: Some tasks require moving across pages, tabs, or even sites to gather information and finish an objective. That is closer to real work than one-page click tests.

Use Cases

WebArena is most often used to build and test browser agents that can handle realistic workflows. The clearest example is the leaderboard race itself. Early baseline agents, including GPT-4-era systems, reportedly landed around 14% success on the benchmark. That result was sobering. It showed that strong language models still struggled badly once they had to navigate forms, menus, and long action chains on real web apps.

From there, research teams used WebArena to build better agent architectures. OpAgent, a modular framework with reinforcement learning fine-tuning, reached 71.6% success on the WebArena leaderboard as of January 2026 in the research you provided. Meka reported 72.7% on a cleaned 651-task subset. Those are not just benchmark wins. They show how teams are using WebArena to iterate on memory, planning, tool use, and action policies, then measuring whether those changes survive contact with realistic websites.

Open-source agent builders have used WebArena in a different way, as a proving ground for architectures that can be reproduced outside major labs. AgentWorkflowMemory, described in the research as the top open-source agent on the leaderboard, achieved 35.5% success by emphasizing memory across episodes. That tells a useful story for smaller teams: WebArena is not just for frontier model companies. It is also where open-source projects test whether design choices like memory systems or API access actually move the needle.

The benchmark has also become the foundation for adjacent evaluation projects. VisualWebArena uses the same spirit of realistic web tasks but adds screenshots and visual grounding to test vision-language agents. ST-WebAgentBench builds on it to measure safety and policy compliance, and found notable policy violation rates in systems like WebVoyager, including 17.6% risk ratios in user consent and 22.1% in strict execution in the cited research. SecureWebArena extends the idea again, using 330 adversarial scenarios to study whether agents can be manipulated through deceptive interfaces. In practice, that means teams are not just building agents with WebArena. They are building the entire evaluation stack around it.

Strengths and Weaknesses

Strengths:

WebArena solved a real research bottleneck. Before it, teams had to choose between toy browser tasks and unstable live websites. WebArena gave them realistic apps with reproducible conditions, which is a big reason it became a default benchmark in papers and leaderboards.
It captures more of the messiness of web work than older benchmarks like MiniWoB++. MiniWoB++ is still useful for atomic interactions, but WebArena asks agents to work through longer workflows across shopping, forums, GitLab, and wiki systems. For visitors comparing benchmarks, this is the difference between testing reflexes and testing judgment.
The ecosystem around it is unusually strong. BrowserGym supports WebArena alongside other benchmarks, and follow-on projects like VisualWebArena, WorkArena comparisons, and WebArena-Verified all build on the same core idea. That gives researchers a shared reference point instead of fragmented one-off environments.
It has shown real progress over time. Moving from roughly 14% early success rates to 70%+ in top systems gives teams a way to see whether the field is advancing. Benchmarks matter more when they are hard enough to expose failure, but stable enough to show improvement.

Weaknesses:

The benchmark is realistic, but still narrow. Four main site types cannot represent the full web. If your agent needs to handle finance portals, airline booking, healthcare forms, or weird internal enterprise tools, WebArena gives signal, but not certainty.
Some tasks appear to be infeasible or poorly specified. The research you provided notes that about 22% of the original 812 tasks were found to be infeasible, even for humans in some cases. That is a serious issue for anyone treating raw benchmark scores as ground truth.
Validation can be too literal. Researchers reported cases where agents did something a human would accept, but still failed because the checker expected exact strings or a specific output format. That means a low score can reflect benchmark rigidity as much as agent weakness.
Setup is not lightweight. Running multiple Dockerized web apps with enough storage and proper URL configuration is a real infrastructure task. Compared with a smaller benchmark you can pip install and run locally in minutes, WebArena asks for more engineering patience.
It does not answer safety or production-readiness on its own. That is why projects like ST-WebAgentBench and SecureWebArena had to exist. An agent that scores well on WebArena may still violate user intent, mishandle risky actions, or fall apart on live websites.

Pricing

Open source: $0
Self-hosted infrastructure: variable
AWS deployment: variable, based on instance and storage

WebArena itself is open source, so there is no software license fee in the research we reviewed. The real cost is infrastructure and setup time. The recommended AWS deployment in the documentation uses a pre-configured AMI on an EC2 instance, with the research citing a t3a.xlarge and 1000GB EBS storage as a typical setup. That means your spend depends on how long you keep the environment running and how much storage you provision.

For many teams, the hidden cost is not cloud compute but maintenance. You are running several full web apps, managing Docker containers, configuring URLs correctly, and resetting state between runs. Compared with lighter benchmarks, WebArena can be expensive in engineering hours even when the software is free.

If you already use BrowserGym or have an internal evaluation cluster, the economics look better because WebArena becomes one benchmark among many in the same workflow. If you just want a quick sanity check on browser agents, a simpler benchmark may be cheaper to operate.

Alternatives

BrowserGym BrowserGym is often the most practical alternative if you want a unified framework instead of a single benchmark. Built by ServiceNow, it wraps environments like WebArena, WorkArena, and MiniWoB++ under one interface. Someone might choose BrowserGym over raw WebArena because it reduces integration work and gives them a standard API, screenshots, accessibility trees, and higher-level tooling. If your goal is benchmarking across environments, BrowserGym is often the better starting point. If your goal is specifically to understand the benchmark that shaped modern web-agent evaluation, WebArena remains the reference.

WorkArena WorkArena focuses on enterprise workflows inside ServiceNow. It is narrower than WebArena in domain coverage, but often harder in day-to-day workflow complexity. Teams building agents for internal business software may prefer WorkArena because it reflects the forms, tables, search flows, and operational tasks knowledge workers actually face. WebArena is broader. WorkArena is more specialized.

MiniWoB++ MiniWoB++ is the classic lightweight benchmark for browser interaction. It is much simpler, with short tasks like clicking buttons or filling tiny forms. Researchers still use it for fast iteration and debugging, especially early in development. Someone would choose MiniWoB++ when they need speed and isolation. They would choose WebArena when they need realism and want to know if the agent can survive longer workflows.

VisualWebArena VisualWebArena is the right comparison if your agent depends on screenshots, visual cues, or multimodal reasoning. It extends the WebArena idea into vision-language evaluation. If your model reads HTML well but struggles with visual layout, or if your production setting depends heavily on rendered pages, VisualWebArena may tell you more than the original benchmark.

WebShop WebShop is a better fit for teams focused specifically on shopping and product selection. It asks agents to search catalogs and match user preferences, rather than navigate several different web domains. If your product is an e-commerce copilot or shopping assistant, WebShop can be more directly relevant. If you need broader browser competence, WebArena covers more ground.

OSWorld and ScreenSpot These tools move beyond the browser into desktop and operating system interaction. A team choosing them is usually asking a different question: can the agent operate a whole computer, not just websites? WebArena is still the better choice when browser automation is the core problem.

FAQ

What is WebArena used for?

WebArena is used to evaluate and train AI agents that interact with websites. Most teams use it as a benchmark to measure browser-agent performance, not as an end-user automation product.

Who created WebArena?

It was introduced by researchers led by Shuyan Zhou at Carnegie Mellon University, with collaborators who focused on realistic and reproducible web-agent evaluation.

Is WebArena a product I can use to automate my browser?

Not really. It is an evaluation environment for research and development. If you want production browser automation, you would usually build with other tools and test against WebArena.

How do I get started?

Most teams start by deploying the Dockerized environment or using the recommended AWS AMI from the project docs, then connecting an agent through Playwright, Selenium, or BrowserGym. If you are new to browser agents, BrowserGym often makes the first run easier.

How long does it take to set up?

If you use the pre-configured AWS image, setup can be fairly quick. Manual Docker deployment takes longer because URL configuration, storage, and service coordination all need to be right.

Is WebArena free?

Yes, the software is open source in the research we reviewed. You still pay for cloud compute, storage, and the engineering time needed to run it.

What websites are included in WebArena?

The main environment includes a Magento shopping site, a Postmill forum, a GitLab instance, and a MediaWiki site. Some setups also include tools like maps and external knowledge resources.

How hard is WebArena for current agents?

Hard enough that early GPT-4-level baselines were around 14% success in the cited research. Top systems later pushed past 70%, but humans were still reported at about 78%, so the gap has not disappeared.

Is WebArena good for multimodal agents?

Partly. The original benchmark supports different observation modes, but VisualWebArena is the more direct choice if you want to test screenshot-based or vision-language agents.

What are the biggest limitations?

Coverage is limited to a handful of website types, setup can be heavy, and some tasks or validators have been criticized as too strict or even infeasible. It is useful, but not perfect.

Can I trust leaderboard scores?

They are useful, but they need context. Different task subsets, validation quirks, and environment versions can affect results, which is why projects like WebArena-Verified exist.

Should I use WebArena or WorkArena?

Use WebArena if you want broader web-agent evaluation across several site types. Use WorkArena if your main concern is enterprise workflow automation in ServiceNow-style environments.

Categories:

Testing & Evaluation

Tags:

ai-testing browser-automation continuous-evaluation e-commerce gitlab open-source self-hosted

Similar to WebArena

Browse Testing & Evaluation

GAIA Benchmark

Standardized evaluation for general AI assistants

Testing & Evaluation

An academic benchmark that evaluates AI assistants on real-world tasks requiring multi-step reasoning, web browsing, and file manipulation, with verifiable correct answers.

ToolBench

Open-source platform for training and evaluating LLMs on tool learning tasks

Testing & Evaluation

ToolBench is an open-source benchmark platform for evaluating large language models on API tool-use tasks, with 16,000+ APIs and 12,000+ task instances.

AgentBench

Benchmarking LLMs as real-world agents, not just chatbots

Testing & Evaluation

AgentBench tests how well LLMs act as agents on multi-step tasks, tool use, and long interactions in realistic settings.

Athina AI

Build, test, and monitor AI apps together with Athina AI

Testing & Evaluation

Athina AI is a collaborative IDE for building, evaluating, and monitoring production AI applications with observability tools.

Patronus AI

Evaluate and stress-test LLM agents with judge models and simulators

Testing & Evaluation

Patronus AI helps teams evaluate LLM agents with judge models, AI evaluation tools, and simulators for real-world performance and safety.