SWE-bench
SWE-bench tests AI agents on real GitHub issues. Includes Verified, Lite, Multilingual, and Multimodal variants with public leaderboards.
Reviewed by Mathijs Bronsdijk · Updated Apr 13, 2026

What is SWE-bench?
SWE-bench is a benchmark for evaluating how well AI systems can solve real-world software engineering tasks. It presents models and agents with actual GitHub issues drawn from open-source repositories, then measures whether the system can produce a working fix. Researchers, AI labs, and developers use it to compare coding agents against a shared, standardized set of problems. The benchmark has expanded into a family of variants, including SWE-bench Verified, Lite, Multilingual, and Multimodal, each targeting different evaluation conditions.
Key Features
- SWE-bench Original: The full benchmark dataset of real GitHub issues used to evaluate software engineering agents on end-to-end code repair tasks.
- SWE-bench Verified: A human-filtered subset of 500 instances, designed to provide higher-confidence evaluation results. All models are evaluated using the same use via mini-SWE-agent.
- SWE-bench Lite: A smaller, focused subset intended for faster or more targeted evaluations.
- SWE-bench Multilingual: A variant that extends evaluation beyond English-language codebases and issues.
- SWE-bench Multimodal: A newer, challenging variant where software issues are described using images rather than text alone.
- Public Leaderboards: Official leaderboards for each benchmark variant allow direct comparison across agents and models, with filters for open scaffold, open weights, and evaluation tags.
- SWE-bench CLI: A command-line tool for running evaluations locally against the benchmark datasets.
- SWE-smith and SWE-ReX: Related tools in the SWE-bench family supporting training data generation and execution environments for agents.
Use Cases
- AI research teams: Researchers at AI labs use SWE-bench to measure how their coding models or agents perform against a consistent, reproducible set of real engineering problems.
- Open-source contributors and agent developers: Developers building autonomous coding agents submit results to the public leaderboard to demonstrate capability relative to other systems.
- Model evaluation and comparison: Organizations evaluating which coding agent or model to adopt can consult the leaderboards to compare performance across many systems under identical conditions.
- Benchmark-driven training: Teams developing new models reference SWE-bench results to guide training decisions, using variants like SWE-bench Lite for faster iteration cycles.
Strengths and Weaknesses
Strengths:
- Covers real GitHub issues rather than synthetic problems, making results more meaningful for practical software engineering tasks.
- Multiple variants (Verified, Lite, Multilingual, Multimodal) allow evaluation under different constraints and conditions.
- Public leaderboards with consistent evaluation uses enable direct, apples-to-apples comparisons across agents.
- The benchmark has an associated academic paper and formal submission process, supporting reproducibility.
Weaknesses:
- No structured user sentiment data is publicly available in our sources at this time, so specific community complaints cannot be listed.
- Evaluation can be resource-intensive depending on the subset and agent being tested.
Getting Started
SWE-bench is openly accessible. The benchmark datasets, leaderboards, and documentation are available at swebench.com. The associated paper is published on OpenReview. Developers can submit results through the official submission page. The SWE-bench CLI provides tooling for local evaluation runs. Related tools such as mini-SWE-agent, SWE-smith, and SWE-ReX are available as separate projects within the SWE-bench family. Pricing is not publicly listed, and the core benchmark resources appear to be freely available.
FAQ
What does "SWE-bench" mean?
SWE-bench stands for Software Engineering benchmark. It is a benchmark designed to evaluate how well AI systems can solve real-world software engineering tasks drawn from actual GitHub issues.
When was SWE-bench released?
SWE-bench has an associated academic paper that formally introduced it as a benchmark. It has since expanded into a family of variants including SWE-bench Verified, Lite, Multilingual, and Multimodal.
What is SWE-bench used for?
SWE-bench is used by researchers, AI labs, and developers to compare coding agents against a shared, standardized set of real GitHub issues from open-source repositories. Organizations evaluating which coding agent or model to adopt can consult the public leaderboards to compare performance across many systems under identical conditions.
Is SWE-bench reliable?
SWE-bench Verified is a human-filtered subset of 500 instances designed to provide higher-confidence evaluation results. The benchmark uses consistent evaluation methods and a formal submission process to support reproducibility and direct comparisons across agents.
How do you test against SWE-bench?
Evaluations can be run locally using the SWE-bench CLI, a command-line tool for running evaluations against the benchmark datasets. Results can also be submitted to the public leaderboards for comparison against other systems.
Is SWE-bench considered a good benchmark?
SWE-bench covers real GitHub issues rather than synthetic problems, making results more meaningful for practical software engineering tasks. Public leaderboards with consistent evaluation methods enable direct, apples-to-apples comparisons across agents.
What variants of SWE-bench exist?
The SWE-bench family includes SWE-bench Original, Verified, Lite, Multilingual, and Multimodal. Each variant targets different evaluation conditions, such as faster iteration, non-English codebases, or image-based issue descriptions.
What is SWE-bench Lite?
SWE-bench Lite is a smaller, focused subset of the benchmark intended for faster or more targeted evaluations. Teams developing new models use it to guide training decisions with faster iteration cycles.
What is SWE-bench Multimodal?
SWE-bench Multimodal is a newer, challenging variant where software issues are described using images rather than text alone. It extends the benchmark beyond text-based evaluation conditions.
What is SWE-bench Multilingual?
SWE-bench Multilingual is a variant that extends evaluation beyond English-language codebases and issues. It allows assessment of AI systems on software engineering tasks across multiple languages.
Why is SWE-bench Verified no longer the standard evaluation?
SWE-bench Verified was a human-filtered subset of 500 instances used to provide higher-confidence results. All models evaluated under SWE-bench Verified use the same method via mini-SWE-agent, which may have prompted a shift in how evaluations are structured.
Who uses SWE-bench?
SWE-bench is used by AI research teams, open-source contributors, agent developers, and organizations evaluating coding tools. Developers building autonomous coding agents submit results to the public leaderboard to demonstrate capability relative to other systems.
What are SWE-smith and SWE-ReX?
SWE-smith and SWE-ReX are related tools in the SWE-bench family. SWE-smith supports training data generation and SWE-ReX provides execution environments for agents.