Skip to main content
Favicon of LangWatch

LangWatch

LangWatch is an open-source LLMOps platform for testing, evaluation, and observability of AI agents in development and production.

Reviewed by Mathijs Bronsdijk · Updated Apr 13, 2026

ToolFree + Paid PlansUpdated 1 month ago
Screenshot of LangWatch website

What is LangWatch?

LangWatch is an open-source LLMOps platform for monitoring, evaluating, and testing AI agents in development and production. It captures full traces of LLM calls, tool usage, and conversation flows, giving teams visibility into what their agents are actually doing. LangWatch is built for AI engineers and product teams at growth-stage companies shipping multi-step agent workflows. Where most observability tools stop at logging, LangWatch adds agent simulations, automated evaluations, prompt versioning, and cost tracking in a single platform.

Key Features

  • Full Trace Observability: Captures end-to-end traces across multi-step agent workflows, logging inputs, outputs, metadata, tokens, and costs with real-time drill-down for debugging failures
  • Agent Simulation Testing (LangWatch Scenario): Creates and runs simulation tests directly in the platform without code, defining user personas and pass/fail criteria against HTTP endpoints or voice agents
  • Prompt Playground: Write, test, and version prompts connected to real trace data and evaluations, linking prompt changes directly to observed performance impacts
  • Graph Alerts: Triggers Slack notifications when custom thresholds are met on metrics like latency or error rates, catching production issues before users notice
  • Annotations and Queues: Domain experts manually label edge cases in traces, combining human oversight with automated evals to target improvements on rare failure modes
  • GitHub Integration: Links prompt versions to Git commits and tags traces for GitOps workflows, treating prompts as versioned code with reproducible debugging
  • Self-Hosting via Docker: Deploys with a single Docker Compose command for data residency in VPCs, suitable for regulated sectors needing full infrastructure control
  • RBAC Custom Roles: Enforces fine-grained access control with customizable role sets, limiting eval or production workflow access by team domain

Use Cases

  • AI engineering teams shipping multi-step agents: Run daily scenario tests in CI/CD to catch prompt regressions and hallucinations before they reach production
  • B2B platform builders: Track per-customer AI performance with customer_id filtering and provide custom observability dashboards to clients running LLM apps
  • QA leads debugging agent failures: Use trace filtering, labels, and cross-team annotations to isolate root causes in black-box agent workflows
  • MLOps engineers in production: Monitor cost attribution, quality signals, and user event feedback across OpenTelemetry-instrumented agent pipelines

Strengths and Weaknesses

Strengths:

  • Combines monitoring, evaluations, and experimentation in one platform instead of stitching together separate tools
  • Python SDK is lightweight with good async support; TypeScript SDK is type-safe and simple for Node apps
  • Scenario testing with simulated users catches regressions that unit tests miss in non-deterministic agent behavior
  • ISO 27001 certified and SOC 2 Type II compliant with self-hosting option for data sovereignty
  • Free tier includes all platform features with 50,000 events per month

Weaknesses:

  • Free tier rate limits can hit during heavy testing cycles
  • Error messages for authentication and missing spans could be more specific
  • Some SDK breaking changes (v1.2) required decorator rewrites with limited migration guidance
  • Documentation is practical for setup but thin on advanced evaluation configurations
  • Community presence is limited compared to alternatives like Langfuse

Pricing

  • Developer (Free): $0 forever, all platform features, 50,000 events/month, 14-day data retention, 2 users, 1 GB storage, 3 scenarios/simulations/custom evaluations
  • Growth: ~$34/core seat/month (EUR 29), 200,000 events included then EUR 0.0005 per extra event, 30-day retention, unlimited lite-users, unlimited evals/simulations/prompts, private Slack/Teams support, 14-day free trial
  • Enterprise: Custom pricing, on-prem or hosted deployment, OpenTelemetry support for Java/Go/custom, custom metrics and dashboards, premium support

Volume discounts available above 20 Growth seats.

FAQ

Is LangWatch open source?

Yes. LangWatch is fully open-source and available on GitHub. Teams can self-host it locally or in their own VPC with no data lock-in, and export all data at any time.

How does LangWatch work?

LangWatch automatically tracks every LLM call, tool usage, and user interaction with detailed traces, spans, and metadata. It layers evaluations, agent simulations, prompt experiments, and dashboards for cost and quality metrics on top of that observability data.

Is LangWatch safe for regulated industries?

LangWatch holds ISO 27001 and SOC 2 Type II certifications. It supports VPC self-hosting to keep sensitive traces and datasets private, offers data residency in EU and US regions, and encrypts data at rest (AES-256) and in transit (TLS 1.2+).

How does LangWatch compare to Langfuse?

Both are open-source LLM observability platforms. LangWatch bundles monitoring, evaluations, and experimentation into one tool. Langfuse focuses more on self-managed observability with a strong open-source community. Choose LangWatch if you need integrated scenario testing and prompt experiments alongside tracing.

How long does it take to set up LangWatch?

Most teams report logging their first traces within 10 to 20 minutes. Setup requires adding an API key and installing the Python or TypeScript SDK. No credit card is needed for the free tier.

Share:

Sponsored
Favicon

 

  
 

Explore other Agent Tools & Integrations

Favicon

 

  
  
Favicon

 

  
  
Favicon