Llama Guard
Meta’s open moderation model that screens prompts and outputs against a published hazard taxonomy, running inside your own stack.
Reviewed by Mathijs Bronsdijk · Updated Apr 18, 2026

What is Llama Guard?
Llama Guard is Meta’s safety model for screening prompts and model responses before they reach users. It sits beside a chatbot, agent, or multimodal app and classifies content as safe or unsafe against a published hazard taxonomy. That matters because most moderation systems are closed APIs with vague labels, while Llama Guard is open, inspectable, and built to run inside your own stack.
Meta first introduced Llama Guard in late 2023 as a fine-tuned Llama 2 7B classifier for human and AI conversations. Since then, the family has expanded quickly: Llama Guard 2 moved to a Llama 3 base, Llama Guard 3 added 1B, 8B, and vision variants plus support for 8 languages, and Llama Guard 4 pushed further into native multimodal moderation with a 12B model that can evaluate text and images together. Across those releases, the core idea stayed the same: use a language model as a policy engine that can read a conversation, map it to categories like violent crime, hate, self-harm, or privacy violations, and return a structured judgment.
From what we researched, Llama Guard is used less like a consumer app and more like infrastructure. Teams plug it into LAG pipelines, chat products, internal copilots, and agent systems where they want control over policy, deployment, and logs. It is especially attractive to companies that do not want every moderation decision to pass through a third-party API, or that need to adapt a safety taxonomy to their own rules.
Key Features
-
Input and output moderation: Llama Guard can classify both user prompts before generation and model responses after generation. In practice, this gives teams two checkpoints instead of one, which is useful when a risky prompt slips through but the output can still be caught before delivery.
-
Published hazard taxonomy: The model uses the MLCommons hazard taxonomy, with up to 14 categories in newer versions, including violent crimes, self-harm, hate, privacy, elections, and code interpreter abuse. That transparency is rare in moderation tools, and it gives compliance and policy teams something concrete to review instead of a black-box label.
-
Multiple model sizes: The Llama Guard 3 family includes 1B, 8B, and 11B Vision variants, and Llama Guard 4 is a 12B multimodal model. The 1B option is not just a smaller fallback, recent benchmarking showed it hit 76% detection on OWASP Top 10 adversarial prompts with about 0.165 seconds latency and roughly 0.94 GB VRAM, which is unusually practical for edge or low-cost deployments.
-
Multimodal safety checks: Llama Guard 3 Vision and Llama Guard 4 can evaluate text plus image inputs. This matters for teams building assistants that read screenshots, uploaded images, or mixed-media prompts, where a text-only filter would miss a large part of the risk.
-
Multilingual support: Llama Guard 3 added support for 8 languages, English, French, German, Hindi, Italian, Portuguese, Spanish, and Thai. For global products, that can reduce the need to maintain separate moderation models by region, though teams should still validate performance language by language.
-
Structured outputs with categories: Instead of only saying safe or unsafe, Llama Guard can return the specific violated categories. That makes it easier to build policy-specific actions, such as blocking self-harm content, escalating privacy issues to review, or logging election-related prompts separately.
-
Customizable policy behavior: Teams can adapt the taxonomy with zero-shot or few-shot prompting, and Meta also supports deeper fine-tuning paths. This is one of the biggest reasons technical teams choose Llama Guard over API moderation services, because they can bend the model toward internal policy without waiting for a vendor.
-
Open deployment options: Llama Guard can be self-hosted with standard Hugging Face and PyTorch tooling, served through frameworks like vLLM, or accessed through providers such as OpenRouter. That flexibility changes the cost model significantly, especially for companies already running their own inference stack.
Use Cases
One of the clearest use cases is guarding retrieval and assistant workflows where a normal chatbot stack is not enough. Teams using LlamaIndex and NVIDIA NeMo Guardrails have integrated Llama Guard into RAG and conversational systems so that user prompts, retrieved documents, and final responses can all be screened. In those setups, the model is not the product people see, it is the layer that keeps a product from drifting into unsafe advice, privacy leaks, or policy violations.
Security testing is another strong story. In published benchmarking on OWASP Top 10 adversarial prompts, Llama Guard 3 1B reached a 76% detection rate, while the base Llama 3.1 8B model scored 0% in the same test setup. The more surprising part was that the smallest Guard model also had the lowest latency, around 0.165 seconds, and the lightest memory footprint, under 1 GB VRAM. For teams building real-time safety checks into agents, that result changes the usual assumption that bigger always means safer.
Meta’s own ecosystem also points to how Llama Guard is actually used in production-style architectures. Inside Llama Stack, developers can register Llama Guard as a “shield” and combine it with Prompt Guard for jailbreak detection. That tells a realistic story about deployment: most serious teams are not asking Llama Guard to solve every safety problem by itself. They are pairing it with prompt injection defenses, code safety tools, and logging systems so moderation becomes one layer in a broader control system.
The multimodal versions open another path. Llama Guard 3 Vision was trained to classify image and text interactions, including risks tied to visual understanding. On Meta’s internal response-classification benchmark, it posted 96.1% precision, 91.6% recall, and a 0.938 F1 score, with a 1.6% false positive rate. For teams building assistants that inspect screenshots, forms, or user-uploaded photos, those numbers suggest a practical moderation layer that goes beyond text-only filters.
Strengths and Weaknesses
Strengths:
-
It is one of the few serious open moderation models with real ecosystem support. We found integrations across Llama Stack, NeMo Guardrails, LlamaIndex, Hugging Face, and third-party inference providers. Compared with many research-only safety models, Llama Guard feels deployable, not just publishable.
-
The smallest model is unusually useful. In one benchmark, Llama Guard 3 1B beat larger Guard variants on OWASP adversarial prompt detection while using only 0.94 GB VRAM. That is a rare case where a team can spend less on infrastructure and still get better security performance for a specific task.
-
The taxonomy is visible and understandable. Compared with closed moderation APIs from OpenAI or Azure, Llama Guard gives teams a clearer view of what is being flagged and why. That is valuable for internal policy reviews and regulated environments where “the API said no” is not enough documentation.
-
It works well as part of a layered safety stack. Meta’s own framing with Prompt Guard, CodeShield, and LlamaFirewall is honest about the job. Llama Guard is good at content classification, and it becomes more useful when paired with tools that cover jailbreaks and agent-specific risks.
Weaknesses:
-
It is not a complete safety system by itself. Meta’s own materials make this fairly clear. Llama Guard focuses on harmful content categories, but prompt injection, jailbreaks, insecure code generation, and factual verification often need separate tools.
-
Some categories depend on outside knowledge the model may not have. Defamation, intellectual property, and election-related judgments can require current facts, legal context, or regional nuance. In those cases, Llama Guard can help with triage, but it should not be treated as the final authority.
-
Prompt classification is weaker than response classification. In Meta’s reported numbers for the vision model, response classification reached an F1 of 0.938, while prompt classification was lower at 0.733. That gap matters if your main goal is blocking risky requests before generation rather than reviewing outputs after the fact.
-
Quantization can hurt the exact thing people care about. Recent benchmarking suggested INT8 quantization reduced safety performance and increased latency in some tests. For a normal generation model that tradeoff might be acceptable, but for a moderation layer it undercuts the whole reason you deployed it.
Pricing
-
Open-source self-hosted: $0 for model weights The weights are openly available, so there is no license fee just to use the model. Real spending comes from GPUs, engineering time, logging infrastructure, and whatever extra safety layers you add around it.
-
OpenRouter, Llama Guard 3 8B: about $0.00000002 per input token, $0.00000006 per output token This is one of the cheaper ways to experiment if you do not want to host it yourself. For low to moderate volume, API access may cost less than maintaining a dedicated inference stack.
There is a real pricing fork here. If you already run vLLM, Hugging Face, or internal GPU infrastructure, self-hosting can be cheaper and gives you more control over logs and policy behavior. If you are a small team, the hidden cost of “free” open source is setup and maintenance, not the model itself. Also note that teams often deploy more than one guard model, such as Llama Guard plus Prompt Guard, so the total safety bill is usually higher than the line item for moderation alone.
Alternatives
OpenAI Moderation API OpenAI’s moderation service is the obvious default for teams already deep in the OpenAI stack. It is easy to call and backed by a vendor with huge production volume, but it is also more closed. Teams that want full control over deployment, taxonomy, and training behavior tend to prefer Llama Guard, while teams that want the fastest path to “it works” may still choose OpenAI.
Azure AI Content Safety / Content Moderator Azure appeals to enterprises that already buy infrastructure, identity, and compliance tooling from Microsoft. The tradeoff is similar to OpenAI’s approach: easier procurement and support, less transparency and less flexibility. Llama Guard stands out when a team wants to inspect the model, run it locally, or adapt policy behavior without waiting on a vendor.
Perspective API Perspective is narrower and has long been associated with toxicity scoring rather than broad generative AI safety policy. If your core problem is comment toxicity or harassment detection, Perspective can still be a reasonable fit. If you need a moderation layer for LLM prompts and responses across categories like self-harm, privacy, crimes, and elections, Llama Guard is closer to that job.
IBM Granite Guardian Granite Guardian is one of the more direct open alternatives. It serves similar buyers, teams that want an open safety model they can deploy themselves, often in enterprise settings. The choice often comes down to ecosystem preference: IBM shops may lean Granite, while teams already building around Meta’s Llama family, Llama Stack, or Hugging Face examples may find Llama Guard easier to adopt.
ShieldGemma Google’s ShieldGemma is another open safety effort worth watching. It may appeal to teams already experimenting with Gemma models or Google tooling. Llama Guard currently has the stronger deployment story from what we found, especially in terms of integrations, model family depth, and practical documentation around real use.
FAQ
What is Llama Guard used for?
It is used to moderate LLM prompts and responses. Teams add it to chatbots, agents, RAG systems, and multimodal apps to catch unsafe content before or after generation.
Is Llama Guard a chatbot?
No. It is a classifier, not a general assistant. Most users never interact with it directly because it runs behind the scenes.
Who built Llama Guard?
Meta built it as part of the broader Llama ecosystem and its AI safety tooling.
Can Llama Guard check both prompts and model outputs?
Yes. It can screen user input before generation and screen model responses before they are shown.
Does Llama Guard support images?
Yes, in the vision and multimodal versions. Llama Guard 3 Vision and Llama Guard 4 can evaluate text plus image content.
Which version should I use?
From the research, Llama Guard 3 1B is surprisingly strong for lightweight deployments, while 8B is the general-purpose text choice and the newer multimodal models are better if your app handles images. The right answer depends on whether you care more about cost, latency, or modality support.
Is it multilingual?
Yes, Llama Guard 3 supports 8 languages. You should still test it on your own traffic if non-English moderation is important.
How do I get started?
Most teams start by pulling the model from Hugging Face or calling it through an API provider like OpenRouter. Then they wire it into one or two checkpoints, usually user input and final model output.
How long to set up?
A simple API-based proof of concept can take a few hours. A production deployment with self-hosting, logging, thresholds, and policy review can take days to weeks.
Is Llama Guard enough on its own for AI safety?
Usually not. It is best treated as one layer, especially if you are worried about jailbreaks, prompt injection, or code execution risks.
How accurate is it?
It depends on the version and task. In one OWASP benchmark, the 1B model detected 76% of adversarial prompts, and Llama Guard 3 Vision reached a 0.938 F1 score on response classification in Meta’s internal tests.
Is Llama Guard free?
The model weights are free to use, but deployment is not. You still pay for compute, engineering time, monitoring, and any surrounding safety infrastructure.