Skip to main content
Favicon of SWE-bench

SWE-bench

What is SWE-bench?

SWE-bench is a benchmark for ML researchers and AI engineers that evaluates language models on real GitHub issues by checking whether model-generated patches make failing tests pass. It includes Official Leaderboards, Compare results, Analyze Results in Detail, Verified, Multilingual, and Multimodal, plus Docker-based reproducible evaluation and mini-SWE-agent, SWE-agent, and SWE-smith workflows. It is used with OpenAI, Anthropic, AWS, Modal, Open Philanthropy, and Andreessen Horowitz.

Last verifiedHow we evaluate

Screenshot of SWE-bench website

At a glance

Best for
SWE-bench is best for ML researchers who need reproducible software-engineering benchmarks.

What does SWE-bench do?

SWE-bench evaluates language models on real GitHub issues by pairing a codebase with an issue, then checking whether the model-generated patch makes the failing tests pass. Its leaderboard flow lets you compare results across SWE-bench, Verified, Lite, Multilingual, and Multimodal, while the viewer helps analyze results in detail. The benchmark uses a Docker-based harness for reproducible evaluation, and the Verified track runs all models through the same mini-SWE-agent setup for apples-to-apples comparison. At scale, SWE-bench covers 2,294 issue-commit pairs from 12 popular Python repositories, plus a human-filtered Verified subset of 500 instances, a 300-task Multilingual set across 9 programming languages, and 517 multimodal issues with visual elements. The site also points to related ecosystem projects like SWE-smith and mini-SWE-agent, and acknowledges support from Open Philanthropy, AWS, Modal, Andreessen Horowitz, OpenAI, and Anthropic.

Why use SWE-bench?

  • It evaluates on real GitHub issues, so results reflect patching work on actual software problems rather than synthetic prompts.
  • The Docker-based harness supports reproducible runs, which makes leaderboard comparisons easier to trust and repeat.
  • Verified uses a human-filtered 500-instance subset and the same mini-SWE-agent harness for consistent apples-to-apples comparisons.
  • The benchmark spans full, lite, multilingual, and multimodal tracks, letting teams match evaluation depth to compute budget and task type.
  • The viewer and compare-results workflow make it easier to inspect resolved rates and drill into model behavior.

Who is SWE-bench for?

  • ML researchers who need a reproducible benchmark for code-editing models.
  • AI engineers who want to compare agents on real GitHub issues.
  • Evaluation teams who need standardized results across multiple benchmark variants.
  • Research groups studying multimodal or multilingual software-engineering tasks.

What are SWE-bench's key features?

Official Leaderboards

Track model rankings on official SWE-bench leaderboards, including SWE-bench Verified scores up to 74% for direct comparison across submissions.

Compare results

Compare runs side by side across 500 Verified instances or 300 Lite and Multilingual tasks, so buyers can judge performance on the same benchmark set.

Analyze Results in Detail

Inspect detailed outcomes with the % Resolved metric and issue-commit pairs, helping teams understand where a model succeeds or fails.

Compare models

Evaluate different models on the same benchmark, including SWE-Llama 7b and 13b variants, to see which setup performs best.

Verified

Use the human-filtered 500-instance Verified subset to test against engineer-confirmed solvable problems, reducing noise in evaluation results.

Multilingual

Benchmark across 300 tasks in 9 programming languages, giving teams a broader view of code reasoning beyond Python-only tests.

Multimodal

Run evaluations on issues with visual elements, including about 60 images, for tasks that require both code and image context.

standardized evaluation environment

Evaluate in a standardized environment with Docker and Modal support, which helps keep runs consistent and reproducible across machines.

What does SWE-bench integrate with?

  • Slack
  • GitHub
  • YouTube
  • X
  • HuggingFace
  • mini-SWE-agent
  • SWE-agent
  • SWE-smith
  • OpenAI
  • Hugging Face
  • Docker
  • Modal

What are SWE-bench's use cases?

ML researchers benchmark code edits

ML researchers who need a reproducible benchmark for code-editing models use SWE-bench to test agents on Real-world GitHub issues and track outcomes on Official Leaderboards. They can Compare results across runs and Compare models to see which approach actually resolves more tasks, not just which one looks good in a demo.

AI engineers compare agents

AI engineers who want to compare agents on real GitHub issues use SWE-bench to run the same workload through a standardized evaluation environment. With Reproducible evaluation and Analyze Results in Detail, they can pinpoint where an agent succeeds, fails, or regresses before shipping it into a larger workflow.

Evaluation teams standardize reporting

Evaluation teams who need standardized results across multiple benchmark variants use SWE-bench to keep scoring consistent across Multiple datasets. They rely on Verified and Official Leaderboards to produce comparable reports that stakeholders can trust when reviewing model performance over time.

Multilingual software research studies

Research groups studying multimodal or multilingual software-engineering tasks use SWE-bench to evaluate models on Multilingual and Multimodal benchmarks. The Open Scaffold and Open Weights make it easier to reproduce experiments and compare how different systems handle diverse task types.

How does SWE-bench work?

  1. Start with a benchmark variant such as Verified, Lite, or Full, then load the corresponding dataset into the standardized evaluation environment so every run begins from the same baseline.
  2. Connect your agent or model through the Open Scaffold, using integrations like GitHub, Docker, Modal, SWE-agent, or mini-SWE-agent to execute tasks against Real-world GitHub issues.
  3. Run the evaluation and let SWE-bench score outcomes across the selected tasks, including Multilingual or Multimodal sets when your research needs broader coverage.
  4. Review Compare results and Analyze Results in Detail to inspect failures, compare models, and understand which edits were actually resolved versus partially completed.
  5. Publish or share the outcome on Official Leaderboards, then iterate with Reproducible evaluation so your team can track progress across Multiple datasets over time.

Frequently asked questions

What is SWE-bench?

SWE-bench is a benchmark for ML researchers and AI engineers that evaluates language models on real GitHub issues by checking whether model-generated patches make failing tests pass. It includes Official Leaderboards, Compare results, and Analyze Results in Detail, and supports Docker-based reproducible evaluation plus mini-SWE-agent, SWE-agent, and SWE-smith workflows. It is used with OpenAI, Anthropic, AWS, and Modal.

What is SWE-bench used for? Who is it for?

SWE-bench is used for Official Leaderboards, Compare results, and Analyze Results in Detail. It's built for ML researchers, AI engineers, and Evaluation teams.

Does SWE-bench have an API and what does it integrate with?

SWE-bench doesn't publish a public API. It integrates with Slack, GitHub, YouTube, X, HuggingFace, and 7 more.

Share:

Sponsored
Favicon

 

  
 

Explore other Research Agents

Favicon

 

  
  
Favicon

 

  
  
Favicon