Groq
What is Groq?
Groq is an AI inference platform for developers that serves text, audio, vision, and image-to-text models with consistent latency and predictable throughput. It combines GroqCloud, OpenAI-compatible APIs, popular models on demand, prompt caching, and compound tools. The platform is used by Dropbox, Vercel, Canva, and Robinhood. Pricing includes usage tiers such as GPT OSS 20B 128k at $0.075, GPT OSS 120B 128k at $0.15, and Llama 4 Scout 128k at $0.11.
Last verifiedHow we evaluate
At a glance
- Groq is best for developers who need fast, predictable inference for production AI apps.
- GPT OSS 20B 128k $0.075; GPT OSS Safeguard 20B $0.075; GPT OSS 120B 128k $0.15; Llama 4 Scout (17Bx16E) 128k $0.11; Qwen3 32B 131k $0.29…
- Yes — GroqCloud is presented as an AI inference platform for developers with APIs available on Free, Developer, and Enterprise plans.
What does Groq do?
GroqCloud runs inference on Groq's purpose-built LPU so teams can serve text, audio, vision, and image-to-text models with consistent latency and predictable throughput. The platform exposes popular models on demand, plus OpenAI-compatible access and a free API key path that makes it easy to start in a few lines of code. Groq also supports built-in prompt caching and compound tools, so developers can wire model calls into real workflows without rebuilding their stack. At scale, Groq shows millisecond inference, 99.5% platform availability, and record-setting performance across production workloads. The company says it was established in 2016 for inference, and customer stories point to real deployments from Solomei AI, GPTZero, StackAI, Fintool, and Opennote. GroqCloud is available in public, private, or co-cloud instances, while enterprise plans add regional endpoint selection, scalable capacity, custom models, and LoRA fine-tunes for larger deployments.
Why use Groq?
- Groq's LPU architecture is built for inference, so teams get deterministic execution instead of adapting a general-purpose chip.
- OpenAI-compatible access lets developers switch with just a few lines of code instead of rewriting their application layer.
- Public, private, and co-cloud instance options give buyers more deployment flexibility than a single shared endpoint.
- Customer stories show production gains like 7.41x faster chat speed and 89% lower costs for Fintool.
- GroqCloud supports text, audio, vision, and image-to-text models in one platform, reducing tool sprawl for multimodal apps.
Who is Groq for?
- Application developers who need low-latency model access with simple API integration.
- Startup teams who want to scale AI features without unpredictable inference costs.
- Enterprise AI teams who need custom deployment options and dedicated support.
- Product builders who ship voice, text, and vision experiences from one platform.
- Data and automation teams who need OpenAI-compatible model access and prompt caching.
What are Groq's key features?
GroqCloud
Run AI inference through GroqCloud APIs with Free, Developer, and Enterprise access, so teams can ship production workloads without changing their stack.
OpenAI compatible
Use OpenAI-compatible APIs to swap in GroqCloud with less code churn, while keeping support for OpenAI-style request patterns and common tooling.
Support for LLMs, STT, TTS, and image-to-text models
Serve LLMs, whisper v3 large speech-to-text, whisper-large-v3-turbo, and image-to-text workloads from one platform, reducing vendor sprawl for multimodal apps.
Popular models on-demand
Access models like llama-4-scout, llama-3.1-8b-instant, GPT-OSS-120B, and qwen3-32b on demand, which helps teams test and deploy faster.
LPU Architecture
Groq's LPU architecture is built for millisecond inference and 7, 10X faster inference, helping latency-sensitive products respond faster under load.
Secure by Default
Deploy on infrastructure designed for secure by default operation, with GroqRack and Groq Data Center Deployments for teams that need controlled environments.
Power Efficient
Use a power-efficient inference stack with single-core and on-chip SRAM design, which can lower operating cost while keeping throughput high.
What does Groq integrate with?
- OpenAI
- whisper v3 large
- whisper-large-v3-turbo
- llama-4-scout
- llama-3.1-8b-instant
- llama-3.3-70b-versatile
- GPT-OSS-120B
- GPT-OSS-20B
- compound
- compound-mini
- qwen3-32b
What are Groq's use cases?
App developers ship faster
Application developers use GroqCloud to add low-latency model calls to product features without rewriting their stack, using OpenAI compatible and Popular models on-demand to move from prototype to production quickly. They can keep existing request patterns while getting faster responses for chat, extraction, and agent workflows.
Voice and vision products
Product builders use GroqCloud to power voice assistants, transcription, and image understanding from one platform, using Support for LLMs, STT, TTS, and image-to-text models to keep the experience unified. That lets them launch multimodal features without stitching together separate vendors.
Cost control for startups
Startup teams use GroqCloud to scale AI features while keeping inference spend predictable, using LPU Architecture and Popular models on-demand to handle traffic spikes without surprise bills. They can ship customer-facing AI with faster responses and a clearer cost model.
Enterprise deployment options
Enterprise AI teams use GroqCloud and GroqRack to roll out production inference with dedicated support and secure deployment choices, using Secure by Default and Groq Data Center Deployments to meet internal requirements. That helps them standardize model access across teams without sacrificing control.
How does Groq work?
- Connect your first model endpoint in GroqCloud and choose a model from Popular models on-demand, then send a test request through the OpenAI compatible API to verify your integration.
- Map your existing prompts and tools to Industry standard frameworks and integrations, so your app, automation, or agent workflow can call Groq without changing your core architecture.
- Add voice, text, or vision workloads by selecting Support for LLMs, STT, TTS, and image-to-text models, then route each request to the right model for the task.
- Monitor latency and throughput as traffic grows, using LPU Architecture and Build Fast to keep responses quick while your team ships new features.
- Move higher-volume or enterprise workloads onto GroqRack or Groq Data Center Deployments, and rely on Secure by Default and Power Efficient operation for steady production use.
How much does Groq cost?
GPT OSS 20B 128k
$0.075- AI Model GPT OSS 20B 128k
- Current Speed 1,000 TPS
- Input Token Price(Per Million Tokens) $0.075(13.3M / $1)*
- Try Now
- Model Card
GPT OSS Safeguard 20B
$0.075- AI Model GPT OSS Safeguard 20B
- Current Speed 1,000 TPS
- Input Token Price(Per Million Tokens) $0.075(13.3M / $1)*
- Try Now
- Model Card
GPT OSS 120B 128k
$0.15- AI Model GPT OSS 120B 128k
- Current Speed 500 TPS
- Input Token Price(Per Million Tokens) $0.15(6.67M / $1)*
- Try Now
- Model Card
Llama 4 Scout (17Bx16E) 128k
$0.11- AI Model Llama 4 Scout (17Bx16E) 128k
- Current Speed 594 TPS
- Input Token Price(Per Million Tokens) $0.11(9.09M / $1)*
- Try Now
- Model Card
Qwen3 32B 131k
$0.29- AI Model Qwen3 32B 131k
- Current Speed 662 TPS
- Input Token Price(Per Million Tokens) $0.29(3.44M / $1)*
- Try Now
- Model Card
Llama 3.3 70B Versatile 128k
$0.59- AI Model Llama 3.3 70B Versatile 128k
- Current Speed 394 TPS
- Input Token Price(Per Million Tokens) $0.59(1.69M / $1)*
- Try Now
- Model Card
Llama 3.1 8B Instant 128k
$0.05- Llama 3.1 8B Instant 128k
- Current Speed 840 TPS
- Input Token Price(Per Million Tokens) $0.05(20M / $1)*
- Try Now
- Model Card
Minimax M2.5
Custom- AI Model Minimax M2.5
Qwen3-VL 32B
Custom- AI Model Qwen3-VL 32B
Canopy Labs Orpheus English
$22.00- AI Model Canopy Labs Orpheus English
- Characters /s 100
- Price $22.00
- Try Now
- Model Card
Canopy Labs Orpheus Arabic Saudi
$40.00- Canopy Labs Orpheus Arabic Saudi
- Characters /s 100
- Price $40.00
- Try Now
- Model Card
Whisper V3 Large
$0.111- AI Model Whisper V3 Large
- Speed Factor 217x
- Price $0.111*
- Try Now
- Model Card
Whisper Large v3 Turbo
$0.04- Whisper Large v3 Turbo
- Speed Factor 228x
- Price $0.04*
- Try Now
- Model Card
moonshotai/kimi-k2-instruct-0905
$1.00- Model moonshotai/kimi-k2-instruct-0905
- Cached Input Tokens (Per M Tokens) $0.50
- Output Tokens (Per M Tokens) $3.00
openai/gpt-oss-120b
$0.15- Model openai/gpt-oss-120b
- Cached Input Tokens (Per M Tokens) $0.075
- Output Tokens (Per M Tokens) $0.60
openai/gpt-oss-20b
$0.075- Model openai/gpt-oss-20b
- Cached Input Tokens (Per M Tokens) $0.0375
- Output Tokens (Per M Tokens) $0.30
Basic Search
$5 / 1000 requests- Web_search
Advanced Search
$8 / 1000 requests- Search
Visit Website
$1 / 1000 requests- Tool Visit Website
Code Execution
$0.18 / hour- Tool Code Execution
Browser Automation
$0.08 / hour- Price $0.08 / hour
Browser Search - Basic Search
$5 / 1000 requests- Price $5 / 1000 requests
Browser Search - Visit Website
$1 / 1000 requests- Price $1 / 1000 requests
- Tool Browser Search - Visit Website
Code Execution - Python
$0.18 / hour- Tool Code Execution - Python
Frequently asked questions
What is Groq?
Groq is an AI inference platform for developers that serves text, audio, vision, and image-to-text models with consistent latency and predictable throughput. It combines GroqCloud, OpenAI-compatible APIs, popular models on demand, prompt caching, and compound tools. The platform is used by Dropbox, Vercel, Canva, and Robinhood. Pricing includes usage tiers such as GPT OSS 20B 128k at $0.075, GPT OSS 120B 128k at $0.15, and Llama 4 Scout 128k at $0.11.
How much does Groq cost? Is it free?
Groq has 24 paid plans: GPT OSS 20B 128k at $0.075, GPT OSS Safeguard 20B at $0.075, GPT OSS 120B 128k at $0.15.
What is Groq used for? Who is it for?
Groq is used for GroqCloud, OpenAI compatible, and Support for LLMs, STT, TTS, and image-to-text models. It's built for Application developers, Startup teams, and Enterprise AI teams.
Does Groq have an API and what does it integrate with?
GroqCloud is presented as an AI inference platform for developers with APIs available on Free, Developer, and Enterprise plans. It integrates with OpenAI, whisper v3 large, whisper-large-v3-turbo, llama-4-scout, llama-3.1-8b-instant, and 6 more.
Editor's read
Check whether your workload needs regional endpoint selection, private or co-cloud deployment, or LoRA fine-tunes, since those are tied to enterprise plans. If you need those controls, verify the plan and deployment model before building around the shared API path.
