WebArena
What is WebArena?
WebArena is a benchmark environment for AI researchers and ML engineers that lets browser agents practice long-horizon web work through realistic front-end tasks. It combines generated websites, annotated programs, and programmatic verifiers for functional correctness, and includes WebArena, WebArena-Infinity, VisualWebArena, and TheAgentCompany. TheAgentCompany uses Gmail, Superhuman, GitLab, Xero, and PayPal. The project is self-hostable, with no public pricing tiers listed.
Last verifiedHow we evaluate
At a glance
- WebArena is best for researchers who need realistic, verifiable web-agent benchmarks.
What does WebArena do?
WebArena runs a set of realistic web environments and verifiable tasks that let browser agents practice long-horizon web work through the front end. It combines generated websites, annotated programs, and programmatic verifiers so tasks can be checked for functional correctness instead of judged by hand. The project spans multiple benchmark variants, including WebArena, WebArena-Infinity, VisualWebArena, and TheAgentCompany, and the environments can embed tools and knowledge resources as separate websites to mirror real workflows. At scale, the Infinity pipeline can generate each environment within ten hours at a cost under $100, and the release includes ten environments, 1,260 verifiable tasks, and 2,070 successful trajectories. The environments are designed for fast resets, easy replication, and scalable deployment, which makes them useful for RL training and evaluation. Strong browser-use agents still land at 69.3% average success, while vision-based approaches in the cited setup score 45.9% and 49.1%.
Why use WebArena?
- It pairs realistic web workflows with programmatic verifiers, so teams can measure task completion instead of relying on subjective review.
- The environments are self-hostable, which helps teams run experiments on their own infrastructure.
- Fast resets and easy replication make repeated agent runs practical for training and regression testing.
- The Infinity pipeline can generate environments in under ten hours at under $100 each, which lowers the cost of expanding a benchmark suite.
- Its tasks remain challenging for strong browser-use agents, so it can surface meaningful gaps in agent capability.
Who is WebArena for?
- AI researchers who need benchmark environments for autonomous web agents.
- ML engineers who want verifiable tasks for browser-use evaluation.
- RL teams who need scalable environments with fast resets and replication.
- Product teams studying long-horizon agent behavior across realistic web workflows.
- Benchmark authors who need programmatic verification for task correctness.
What are WebArena's key features?
WebArena
Benchmarks autonomous web agents across ten environments with 1,260 verifiable tasks and 2,070 successful trajectories, giving buyers a repeatable evaluation set.
WebArena-Infinity
Generates each environment within ten hours at a cost under $100, making it practical to create more benchmark scenarios without heavy manual setup.
VisualWebArena
Adds visual task evaluation for web agents, pairing realistic browser interactions with verifiable outcomes so teams can measure performance beyond text-only commands.
TheAgentCompany
Tests agents on workplace-style workflows using integrations like Gmail, Superhuman, GitLab, Xero, and PayPal, helping teams benchmark business automation.
standalone, self-hostable web environment
Runs as a standalone, self-hostable web environment, which lets teams control deployment and reproduce benchmark conditions across their own infrastructure.
annotated programs
Uses annotated programs to define tasks and checks, supporting functional correctness verification and clearer comparisons between agent runs.
Self Auditing
Supports self-auditing workflows that verify agent outputs against task rules, improving confidence in reported results and reducing manual review.
Fast resets
Provides fast resets for benchmark environments, so teams can rerun tests quickly and compare agents across many trials with less downtime.
What does WebArena integrate with?
- GitLab
- Wikipedia
- Google Maps
- Handshake
- Xero
- PayPal
- Elation EHR
- Gmail
- Superhuman
What are WebArena's use cases?
Benchmarking browser agents
AI researchers use WebArena to test autonomous web agents on realistic browser workflows, using WebArena and annotated programs to measure whether an agent completes tasks correctly instead of just appearing to work. They can compare runs across ten environments and track the 69.3% average success rate as a baseline.
Verifiable evals for ML teams
ML engineers use WebArena-Infinity to build verifiable browser-use evaluations, using Functional Correctness Verification and Standalone Verifiers to score outcomes against task requirements. That makes it easier to separate genuine task completion from partial progress when evaluating models on long-horizon web actions.
Long-horizon studies for product teams
Product teams use VisualWebArena and TheAgentCompany to study how agents behave across realistic web workflows, using realistic natural language command and Self Auditing to observe where plans break down. The result is clearer insight into failure modes before shipping agentic product experiences.
Scalable task generation for RL
RL teams use the standalone, self-hostable web environment with Fast resets and Easy replication to run many experiments without rebuilding infrastructure each time. They can generate repeatable tasks quickly, compare policies across environments, and keep training loops moving with less setup overhead.
How does WebArena work?
- Choose a benchmark suite such as WebArena, WebArena-Infinity, or VisualWebArena, then load the standalone, self-hostable web environment so your agent can start interacting with realistic sites.
- Add the first task set and inspect the annotated programs to understand the expected end state, task structure, and what counts as a correct completion.
- Run your agent against the environment and use Self Auditing plus Standalone Verifiers to check whether each trajectory satisfies the task requirements.
- Reset the environment with Fast resets, then rerun experiments to compare models, prompts, or policies under the same conditions.
- Scale the setup across more tasks and environments with Easy replication and Scalable deployment, so benchmark results stay consistent as your evaluation program grows.
Frequently asked questions
What is WebArena?
WebArena is a benchmark environment for AI researchers and ML engineers that lets browser agents practice long-horizon web work through realistic front-end tasks. It combines generated websites, annotated programs, and programmatic verifiers for functional correctness, and includes WebArena, WebArena-Infinity, VisualWebArena, and TheAgentCompany. TheAgentCompany uses Gmail, Superhuman, GitLab, Xero, and PayPal. The project is self-hostable, with no public pricing tiers listed.
What is WebArena used for? Who is it for?
WebArena is used for WebArena, WebArena-Infinity, and VisualWebArena. It's built for AI researchers, ML engineers, and RL teams.
Does WebArena have an API and what does it integrate with?
WebArena doesn't publish a public API. It integrates with GitLab, Wikipedia, Google Maps, Handshake, Xero, and 4 more.
