Inspect AI
What is Inspect AI?
Inspect AI is an open-source evaluation framework for AI evaluation engineers that benchmarks models and agents through a Python API, CLI, and Inspect View. It combines Inspect View, Tool calling, Structured Output, Batch Mode, and Model Concurrency for repeatable runs, and supports external agents like Claude Code, Codex CLI, and Gemini CLI. Developed by the UK AI Security Institute and Meridian Labs, it includes over 200 pre-built evaluations.
Last verifiedHow we evaluate
At a glance
- Inspect AI is best for AI eval engineers who need a flexible framework for benchmarking models and agents.
What does Inspect AI do?
Inspect handles large language model evaluation workflows by combining a Python API, CLI, and web-based Inspect View for monitoring runs. It lets teams define tasks, scoring, and prompting, then connect models, tools, and agents through built-in and custom interfaces. The framework also supports structured output, batch mode, and concurrency controls so evaluations can run consistently across many samples and models. At scale, Inspect ships with over 200 pre-built evaluations and an evals catalog of 219 entries, including benchmarks like GDPval, SWE-bench Verified, and Humanity's Last Exam. It is built for frontier AI evaluations and can work with external agents such as Claude Code, Codex CLI, and Gemini CLI, while sandboxing untrusted code in Docker, Kubernetes, Modal, Proxmox, and other systems. The docs also expose logs, traces, and data frames for analysis, and the project is developed by the UK AI Security Institute and Meridian Labs.
Why use Inspect AI?
- Its open-source design gives teams a framework they can inspect, extend, and adapt without being locked into a closed evaluation stack.
- The combination of Python APIs, CLI commands, and a web viewer supports both programmatic workflows and hands-on debugging.
- Built-in support for external agents and custom tools makes it easier to evaluate real agent behavior instead of isolated prompts.
- Sandboxing across Docker, Kubernetes, Modal, and Proxmox helps teams test untrusted code with tighter execution control.
- The evals catalog includes over 200 pre-built evaluations, so teams can start from existing benchmarks instead of assembling everything from scratch.
Who is Inspect AI for?
- AI evaluation engineers who need reusable components for benchmarking models and agents.
- Research teams who want to run frontier-model tests across reasoning, coding, and multimodal tasks.
- Platform engineers who need sandboxed execution and controlled tool use during evaluations.
- Applied ML teams who want to inspect logs, traces, and metrics from evaluation runs.
- Developers building custom agents who need a Python-first evaluation framework.
What are Inspect AI's key features?
Inspect View
Review evaluation runs in Inspect View with execution traces, transcript events, and logs, so teams can inspect failures and compare model behavior.
VS Code Extension
Run Inspect from VS Code to read execution traces, logs, and task results without leaving the editor, which speeds up debugging and prompt iteration.
Tool calling
Test built-in and custom tool functions with tool calling, then inspect transcript events and logs to verify agent actions and outputs.
Agent evaluations
Evaluate agents with tasks, scoring, and metrics across over 200 pre-built evaluations, helping teams compare models and prompts on repeatable benchmarks.
Sandboxing system
Manage sandbox environments for safe execution of agent runs, which matters when testing tool use, code, or external actions in controlled setups.
Structured Output
Validate structured output from models and providers such as OpenAI, Anthropic, and Google, reducing parsing errors in downstream workflows.
Batch Mode
Run batch evaluations over datasets and samples, including 1,000-sample and 999-sample test sets, to process larger test suites efficiently.
Model Concurrency
Execute multiple model runs in parallel across providers like AWS Bedrock, Azure AI, and vLLM, which shortens evaluation cycles.
What does Inspect AI integrate with?
- OpenAI
- Anthropic
- Grok
- Mistral
- HF
- AWS Bedrock
- Azure AI
- TogetherAI
- Groq
- Cloudflare
- Goodfire
- vLLM
- Ollama
- llama-cpp-python
- TransformerLens
- nnterp
- Claude Code
- Codex CLI
- Gemini CLI
- Docker
- Kubernetes
- Modal
- Proxmox
What are Inspect AI's use cases?
Benchmarking for evaluation engineers
AI evaluation engineers use Inspect AI to assemble reusable benchmarks for models and agents, using Agent evaluations and over 200 pre-built evaluations to compare runs consistently. They can inspect failures in Inspect View and reuse the same setup across experiments without rebuilding the evaluation harness each time.
Frontier tests for research teams
Research teams use Inspect AI to run frontier-model tests across reasoning, coding, and multimodal tasks, using Batch Mode to execute large suites and Model Concurrency to keep throughput high. They can pair those runs with Structured Output to make results easier to compare and analyze.
Sandboxed agents for platform engineers
Platform engineers use Inspect AI to evaluate agents that call tools in controlled environments, relying on the Sandboxing system and Tool calling to keep execution contained. They can manage sandbox environments while still observing execution traces and logs for each run.
Trace analysis for applied ML teams
Applied ML teams use Inspect AI to inspect logs, traces, and metrics from evaluation runs, using Inspect View and List, read, write, and analyse logs to spot regressions quickly. They can drill into execution traces and transcript events to understand why a model scored poorly.
How does Inspect AI work?
- Connect your first model provider in Model interface and providers, then define the task you want to evaluate with Tasks, evaluation, and scoring so Inspect AI knows what success looks like.
- Add datasets or samples with Reading samples from datasets, then shape outputs with Structured Output and Prompting and elicitation to keep runs comparable across models and agents.
- Run evaluations in Batch Mode or with Model Concurrency, and use Agent evaluations plus Tool calling when your workflow includes multi-step agent behavior.
- Open Inspect View or the Inspect log viewer to review transcripts, execution traces, and metrics, then use List, read, write, and analyse logs to isolate failures.
- Refine your setup with Agent scaffolds, Built-in and custom tool functions, and the Sandboxing system, then rerun tests to track improvements over time.
Frequently asked questions
What is Inspect AI?
Inspect AI is an open-source evaluation framework for AI evaluation engineers that benchmarks models and agents through a Python API, CLI, and Inspect View. It combines Inspect View, Tool calling, Structured Output, Batch Mode, and Model Concurrency for repeatable runs, and supports external agents like Claude Code, Codex CLI, and Gemini CLI. Developed by the UK AI Security Institute and Meridian Labs, it includes over 200 pre-built evaluations.
What is Inspect AI used for? Who is it for?
Inspect AI is used for Inspect View, VS Code Extension, and Tool calling. It's built for AI evaluation engineers, Research teams, and Platform engineers.
Does Inspect AI have an API and what does it integrate with?
Inspect AI doesn't publish a public API.
