Groq

What is Groq?

Groq is an AI inference platform for developers that serves text, audio, vision, and image-to-text models with consistent latency and predictable throughput. It combines GroqCloud, OpenAI-compatible APIs, popular models on demand, prompt caching, and compound tools. The platform is used by Dropbox, Vercel, Canva, and Robinhood. Pricing includes usage tiers such as GPT OSS 20B 128k at $0.075, GPT OSS 120B 128k at $0.15, and Llama 4 Scout 128k at $0.11.

Last verifiedMay 17, 2026How we evaluate

Visit Groq

At a glance

Best for: Groq is best for developers who need fast, predictable inference for production AI apps.
Pricing: GPT OSS 20B 128k $0.075; GPT OSS Safeguard 20B $0.075; GPT OSS 120B 128k $0.15; Llama 4 Scout (17Bx16E) 128k $0.11; Qwen3 32B 131k $0.29…
API: Yes — GroqCloud is presented as an AI inference platform for developers with APIs available on Free, Developer, and Enterprise plans.

What does Groq do?

GroqCloud runs inference on Groq's purpose-built LPU so teams can serve text, audio, vision, and image-to-text models with consistent latency and predictable throughput. The platform exposes popular models on demand, plus OpenAI-compatible access and a free API key path that makes it easy to start in a few lines of code. Groq also supports built-in prompt caching and compound tools, so developers can wire model calls into real workflows without rebuilding their stack. At scale, Groq shows millisecond inference, 99.5% platform availability, and record-setting performance across production workloads. The company says it was established in 2016 for inference, and customer stories point to real deployments from Solomei AI, GPTZero, StackAI, Fintool, and Opennote. GroqCloud is available in public, private, or co-cloud instances, while enterprise plans add regional endpoint selection, scalable capacity, custom models, and LoRA fine-tunes for larger deployments.

Why use Groq?

Groq's LPU architecture is built for inference, so teams get deterministic execution instead of adapting a general-purpose chip.
OpenAI-compatible access lets developers switch with just a few lines of code instead of rewriting their application layer.
Public, private, and co-cloud instance options give buyers more deployment flexibility than a single shared endpoint.
Customer stories show production gains like 7.41x faster chat speed and 89% lower costs for Fintool.
GroqCloud supports text, audio, vision, and image-to-text models in one platform, reducing tool sprawl for multimodal apps.

Who is Groq for?

Application developers who need low-latency model access with simple API integration.
Startup teams who want to scale AI features without unpredictable inference costs.
Enterprise AI teams who need custom deployment options and dedicated support.
Product builders who ship voice, text, and vision experiences from one platform.
Data and automation teams who need OpenAI-compatible model access and prompt caching.

What are Groq's key features?

GroqCloud

Run AI inference through GroqCloud APIs with Free, Developer, and Enterprise access, so teams can ship production workloads without changing their stack.

OpenAI compatible

Use OpenAI-compatible APIs to swap in GroqCloud with less code churn, while keeping support for OpenAI-style request patterns and common tooling.

Support for LLMs, STT, TTS, and image-to-text models

Serve LLMs, whisper v3 large speech-to-text, whisper-large-v3-turbo, and image-to-text workloads from one platform, reducing vendor sprawl for multimodal apps.

Popular models on-demand

Access models like llama-4-scout, llama-3.1-8b-instant, GPT-OSS-120B, and qwen3-32b on demand, which helps teams test and deploy faster.

LPU Architecture

Groq's LPU architecture is built for millisecond inference and 7, 10X faster inference, helping latency-sensitive products respond faster under load.

Secure by Default

Deploy on infrastructure designed for secure by default operation, with GroqRack and Groq Data Center Deployments for teams that need controlled environments.

Power Efficient

Use a power-efficient inference stack with single-core and on-chip SRAM design, which can lower operating cost while keeping throughput high.

What does Groq integrate with?

OpenAI
whisper v3 large
whisper-large-v3-turbo
llama-4-scout
llama-3.1-8b-instant
llama-3.3-70b-versatile
GPT-OSS-120B
GPT-OSS-20B
compound
compound-mini
qwen3-32b

What are Groq's use cases?

App developers ship faster

Application developers use GroqCloud to add low-latency model calls to product features without rewriting their stack, using OpenAI compatible and Popular models on-demand to move from prototype to production quickly. They can keep existing request patterns while getting faster responses for chat, extraction, and agent workflows.

Voice and vision products

Product builders use GroqCloud to power voice assistants, transcription, and image understanding from one platform, using Support for LLMs, STT, TTS, and image-to-text models to keep the experience unified. That lets them launch multimodal features without stitching together separate vendors.

Cost control for startups

Startup teams use GroqCloud to scale AI features while keeping inference spend predictable, using LPU Architecture and Popular models on-demand to handle traffic spikes without surprise bills. They can ship customer-facing AI with faster responses and a clearer cost model.

Enterprise deployment options

Enterprise AI teams use GroqCloud and GroqRack to roll out production inference with dedicated support and secure deployment choices, using Secure by Default and Groq Data Center Deployments to meet internal requirements. That helps them standardize model access across teams without sacrificing control.

How does Groq work?

Connect your first model endpoint in GroqCloud and choose a model from Popular models on-demand, then send a test request through the OpenAI compatible API to verify your integration.
Map your existing prompts and tools to Industry standard frameworks and integrations, so your app, automation, or agent workflow can call Groq without changing your core architecture.
Add voice, text, or vision workloads by selecting Support for LLMs, STT, TTS, and image-to-text models, then route each request to the right model for the task.
Monitor latency and throughput as traffic grows, using LPU Architecture and Build Fast to keep responses quick while your team ships new features.
Move higher-volume or enterprise workloads onto GroqRack or Groq Data Center Deployments, and rely on Secure by Default and Power Efficient operation for steady production use.

How much does Groq cost?

GPT OSS 20B 128k

$0.075

AI Model GPT OSS 20B 128k
Current Speed 1,000 TPS
Input Token Price(Per Million Tokens) $0.075(13.3M / $1)*
Try Now
Model Card

GPT OSS Safeguard 20B

$0.075

AI Model GPT OSS Safeguard 20B
Current Speed 1,000 TPS
Input Token Price(Per Million Tokens) $0.075(13.3M / $1)*
Try Now
Model Card

GPT OSS 120B 128k

$0.15

AI Model GPT OSS 120B 128k
Current Speed 500 TPS
Input Token Price(Per Million Tokens) $0.15(6.67M / $1)*
Try Now
Model Card

Llama 4 Scout (17Bx16E) 128k

$0.11

AI Model Llama 4 Scout (17Bx16E) 128k
Current Speed 594 TPS
Input Token Price(Per Million Tokens) $0.11(9.09M / $1)*
Try Now
Model Card

Qwen3 32B 131k

$0.29

AI Model Qwen3 32B 131k
Current Speed 662 TPS
Input Token Price(Per Million Tokens) $0.29(3.44M / $1)*
Try Now
Model Card

Llama 3.3 70B Versatile 128k

$0.59

AI Model Llama 3.3 70B Versatile 128k
Current Speed 394 TPS
Input Token Price(Per Million Tokens) $0.59(1.69M / $1)*
Try Now
Model Card

Llama 3.1 8B Instant 128k

$0.05

Llama 3.1 8B Instant 128k
Current Speed 840 TPS
Input Token Price(Per Million Tokens) $0.05(20M / $1)*
Try Now
Model Card

Minimax M2.5

Custom

AI Model Minimax M2.5

Qwen3-VL 32B

Custom

AI Model Qwen3-VL 32B

Canopy Labs Orpheus English

$22.00

AI Model Canopy Labs Orpheus English
Characters /s 100
Price $22.00
Try Now
Model Card

Canopy Labs Orpheus Arabic Saudi

$40.00

Canopy Labs Orpheus Arabic Saudi
Characters /s 100
Price $40.00
Try Now
Model Card

Whisper V3 Large

$0.111

AI Model Whisper V3 Large
Speed Factor 217x
Price $0.111*
Try Now
Model Card

Whisper Large v3 Turbo

$0.04

Whisper Large v3 Turbo
Speed Factor 228x
Price $0.04*
Try Now
Model Card

moonshotai/kimi-k2-instruct-0905

$1.00

Model moonshotai/kimi-k2-instruct-0905
Cached Input Tokens (Per M Tokens) $0.50
Output Tokens (Per M Tokens) $3.00

openai/gpt-oss-120b

$0.15

Model openai/gpt-oss-120b
Cached Input Tokens (Per M Tokens) $0.075
Output Tokens (Per M Tokens) $0.60

openai/gpt-oss-20b

$0.075

Model openai/gpt-oss-20b
Cached Input Tokens (Per M Tokens) $0.0375
Output Tokens (Per M Tokens) $0.30

Basic Search

$5 / 1000 requests

Web_search

Advanced Search

$8 / 1000 requests

Visit Website

$1 / 1000 requests

Tool Visit Website

Code Execution

$0.18 / hour

Tool Code Execution

Browser Automation

$0.08 / hour

Price $0.08 / hour

Browser Search - Basic Search

$5 / 1000 requests

Price $5 / 1000 requests

Browser Search - Visit Website

$1 / 1000 requests

Price $1 / 1000 requests
Tool Browser Search - Visit Website

Code Execution - Python

$0.18 / hour

Tool Code Execution - Python

Frequently asked questions

What is Groq?

How much does Groq cost? Is it free?

Groq has 24 paid plans: GPT OSS 20B 128k at $0.075, GPT OSS Safeguard 20B at $0.075, GPT OSS 120B 128k at $0.15.

What is Groq used for? Who is it for?

Groq is used for GroqCloud, OpenAI compatible, and Support for LLMs, STT, TTS, and image-to-text models. It's built for Application developers, Startup teams, and Enterprise AI teams.

Does Groq have an API and what does it integrate with?

GroqCloud is presented as an AI inference platform for developers with APIs available on Free, Developer, and Enterprise plans. It integrates with OpenAI, whisper v3 large, whisper-large-v3-turbo, llama-4-scout, llama-3.1-8b-instant, and 6 more.

Editor's read

Check whether your workload needs regional endpoint selection, private or co-cloud deployment, or LoRA fine-tunes, since those are tied to enterprise plans. If you need those controls, verify the plan and deployment model before building around the shared API path.

Filed under:AI Model Providers byok free

Explore other AI Model Providers

Browse AI Model Providers

Luminal

Open-source ML compiler for faster PyTorch inference with optimized CUDA kernels

AI Model Providers

Luminal is open-source ai inference software that compiles PyTorch models into optimized CUDA kernels for faster inference on any hardware.

Microsoft Phi

Model access, routing, and governance inside Microsoft Foundry.

AI Model Providers

Microsoft Phi gives teams model routing, fine-tuning, and agent orchestration in Foundry. Plans start at $19,000/year.

MiniMax

Multimodal AI for text, voice, agents, and deployable infrastructure.

AI Model Providers

MiniMax turns text, audio, video, and music into outputs with Agent Harness, coding APIs, and private cluster deployment.

Mistral AI

Enterprise AI platform for building, deploying, and operating tailored systems.

AI Model Providers

Mistral AI combines Agent Runtime, observability, and codebase-aware coding with GitHub, Jira, and Snowflake integrations.

OpenAI

AI platform for chat, voice, and API workflows.

AI Model Providers

OpenAI combines ChatGPT, voice, Web search, and Containers. Pricing starts at GPT-5.5 $5.00 / 1M tokens.