SWE-bench

What is SWE-bench?

SWE-bench is a benchmark for ML researchers and AI engineers that evaluates language models on real GitHub issues by checking whether model-generated patches make failing tests pass. It includes Official Leaderboards, Compare results, Analyze Results in Detail, Verified, Multilingual, and Multimodal, plus Docker-based reproducible evaluation and mini-SWE-agent, SWE-agent, and SWE-smith workflows. It is used with OpenAI, Anthropic, AWS, Modal, Open Philanthropy, and Andreessen Horowitz.

Last verifiedMay 17, 2026How we evaluate

Visit SWE-bench

At a glance

Best for: SWE-bench is best for ML researchers who need reproducible software-engineering benchmarks.

What does SWE-bench do?

SWE-bench evaluates language models on real GitHub issues by pairing a codebase with an issue, then checking whether the model-generated patch makes the failing tests pass. Its leaderboard flow lets you compare results across SWE-bench, Verified, Lite, Multilingual, and Multimodal, while the viewer helps analyze results in detail. The benchmark uses a Docker-based harness for reproducible evaluation, and the Verified track runs all models through the same mini-SWE-agent setup for apples-to-apples comparison. At scale, SWE-bench covers 2,294 issue-commit pairs from 12 popular Python repositories, plus a human-filtered Verified subset of 500 instances, a 300-task Multilingual set across 9 programming languages, and 517 multimodal issues with visual elements. The site also points to related ecosystem projects like SWE-smith and mini-SWE-agent, and acknowledges support from Open Philanthropy, AWS, Modal, Andreessen Horowitz, OpenAI, and Anthropic.

Why use SWE-bench?

It evaluates on real GitHub issues, so results reflect patching work on actual software problems rather than synthetic prompts.
The Docker-based harness supports reproducible runs, which makes leaderboard comparisons easier to trust and repeat.
Verified uses a human-filtered 500-instance subset and the same mini-SWE-agent harness for consistent apples-to-apples comparisons.
The benchmark spans full, lite, multilingual, and multimodal tracks, letting teams match evaluation depth to compute budget and task type.
The viewer and compare-results workflow make it easier to inspect resolved rates and drill into model behavior.

Who is SWE-bench for?

ML researchers who need a reproducible benchmark for code-editing models.
AI engineers who want to compare agents on real GitHub issues.
Evaluation teams who need standardized results across multiple benchmark variants.
Research groups studying multimodal or multilingual software-engineering tasks.

What are SWE-bench's key features?

Official Leaderboards

Track model rankings on official SWE-bench leaderboards, including SWE-bench Verified scores up to 74% for direct comparison across submissions.

Compare results

Compare runs side by side across 500 Verified instances or 300 Lite and Multilingual tasks, so buyers can judge performance on the same benchmark set.

Analyze Results in Detail

Inspect detailed outcomes with the % Resolved metric and issue-commit pairs, helping teams understand where a model succeeds or fails.

Compare models

Evaluate different models on the same benchmark, including SWE-Llama 7b and 13b variants, to see which setup performs best.

Verified

Use the human-filtered 500-instance Verified subset to test against engineer-confirmed solvable problems, reducing noise in evaluation results.

Multilingual

Benchmark across 300 tasks in 9 programming languages, giving teams a broader view of code reasoning beyond Python-only tests.

Multimodal

Run evaluations on issues with visual elements, including about 60 images, for tasks that require both code and image context.

standardized evaluation environment

Evaluate in a standardized environment with Docker and Modal support, which helps keep runs consistent and reproducible across machines.

What does SWE-bench integrate with?

Slack
GitHub
YouTube
X
HuggingFace
mini-SWE-agent
SWE-agent
SWE-smith
OpenAI
Hugging Face
Docker
Modal

What are SWE-bench's use cases?

ML researchers benchmark code edits

ML researchers who need a reproducible benchmark for code-editing models use SWE-bench to test agents on Real-world GitHub issues and track outcomes on Official Leaderboards. They can Compare results across runs and Compare models to see which approach actually resolves more tasks, not just which one looks good in a demo.

AI engineers compare agents

AI engineers who want to compare agents on real GitHub issues use SWE-bench to run the same workload through a standardized evaluation environment. With Reproducible evaluation and Analyze Results in Detail, they can pinpoint where an agent succeeds, fails, or regresses before shipping it into a larger workflow.

Evaluation teams standardize reporting

Evaluation teams who need standardized results across multiple benchmark variants use SWE-bench to keep scoring consistent across Multiple datasets. They rely on Verified and Official Leaderboards to produce comparable reports that stakeholders can trust when reviewing model performance over time.

Multilingual software research studies

Research groups studying multimodal or multilingual software-engineering tasks use SWE-bench to evaluate models on Multilingual and Multimodal benchmarks. The Open Scaffold and Open Weights make it easier to reproduce experiments and compare how different systems handle diverse task types.

How does SWE-bench work?

Start with a benchmark variant such as Verified, Lite, or Full, then load the corresponding dataset into the standardized evaluation environment so every run begins from the same baseline.
Connect your agent or model through the Open Scaffold, using integrations like GitHub, Docker, Modal, SWE-agent, or mini-SWE-agent to execute tasks against Real-world GitHub issues.
Run the evaluation and let SWE-bench score outcomes across the selected tasks, including Multilingual or Multimodal sets when your research needs broader coverage.
Review Compare results and Analyze Results in Detail to inspect failures, compare models, and understand which edits were actually resolved versus partially completed.
Publish or share the outcome on Official Leaderboards, then iterate with Reproducible evaluation so your team can track progress across Multiple datasets over time.

Frequently asked questions

What is SWE-bench?

SWE-bench is a benchmark for ML researchers and AI engineers that evaluates language models on real GitHub issues by checking whether model-generated patches make failing tests pass. It includes Official Leaderboards, Compare results, and Analyze Results in Detail, and supports Docker-based reproducible evaluation plus mini-SWE-agent, SWE-agent, and SWE-smith workflows. It is used with OpenAI, Anthropic, AWS, and Modal.

What is SWE-bench used for? Who is it for?

SWE-bench is used for Official Leaderboards, Compare results, and Analyze Results in Detail. It's built for ML researchers, AI engineers, and Evaluation teams.

Does SWE-bench have an API and what does it integrate with?

SWE-bench doesn't publish a public API. It integrates with Slack, GitHub, YouTube, X, HuggingFace, and 7 more.

Filed under:Research Agents open-source self-hosted

Explore other Research Agents

Browse Research Agents

Sixtyfour

AI research for verified profiles from an input identifier.

Research Agents

Sixtyfour turns an identifier into verified profiles with identity resolution and threat intelligence. Plans start at Free $0.

Semantic Scholar

Machine-learning search for scientific papers with summaries and citations.

Research Agents

Semantic Scholar searches 234,936,038 papers, adds TLDRs and citation context, and offers an Academic Graph API for scholarly data.

Scite

Scientific literature search with Smart Citations and full-text context.

Research Agents

Scite turns scientific literature into grounded answers with Smart Citations, Full-Text Search, and Table Mode. Basic starts at $20/month.

SciSpace

AI research workspace for citation-backed search, reading, and writing.

Research Agents

SciSpace combines citation-backed research, Chat with PDF, and literature review tools. Plans start at Premium $20/month.

ResearchRabbit

Literature review discovery with visual maps, citations, and collections.

Research Agents

ResearchRabbit helps researchers explore 270+ million articles with visual maps and citation trails. Free $0, Forever!; ResearchRabbit+ USD$10/month.