Modal

Modal is a serverless compute platform for AI, data, and batch jobs needing CPUs, GPUs, long runtimes, and easy scaling.

Reviewed by Mathijs Bronsdijk · Updated Apr 18, 2026

ToolFree + Paid PlansUpdated 25 days ago

API AvailableFree Tier · From $30/mo in free compute creditsSDK: Python, JavaScript, GoHIPAACloud$111M Raised

Sub-second cold starts for GPU workloadsSupports A100, H100, H200 GPUs$87M Series B at $1.1B valuationPython-first platform with JavaScript/Go in alphaFast autoscaling from zero to hundreds of GPUsSandboxes for secure code executionBatch processing for millions of jobsIntegrates with FastAPI and Docker

Explore Alternatives Visit Modal

Compare Modal

Modal is a serverless compute platform built for teams that need serious CPU and GPU power without running their own cloud infrastructure. It started with a simple premise: a lot of AI, data, and batch workloads do not fit neatly into older serverless products like AWS Lambda. They run too long, need GPUs, pull large model weights, or spike unpredictably. Modal was built by Erik Bernhardsson and Akshat Bubna to handle that kind of work, while keeping the developer experience closer to writing Python than managing clusters.

The company has become one of the more closely watched names in AI infrastructure. It has raised $111 million in total, including an $87 million Series B at a $1.1 billion valuation. That funding matters because Modal is not a thin wrapper over existing cloud primitives. From what we researched, the company built its own scheduler, container runtime, storage layer, and orchestration system to get faster startup times and better GPU utilization than general-purpose serverless platforms usually offer.

In practice, people use Modal to deploy model inference endpoints, run batch jobs across huge datasets, fine-tune models, launch Jupyter notebooks with GPUs, and execute untrusted code in isolated sandboxes. Its core audience is Python-heavy AI and data teams, though JavaScript and Go SDKs are now in alpha. For our visitors, the simplest way to think about Modal is this: it is what many developers wanted AWS Lambda to be for modern AI workloads.

Key Features

Serverless GPU and CPU compute: Modal lets you run functions on CPUs or NVIDIA GPUs including A100, H100, and H200. That matters because teams can move from small experiments to expensive inference or training jobs without provisioning dedicated machines that sit idle between runs.
Fast cold starts: Modal says it delivers sub-second cold starts in many cases, and GPU-backed functions often start in roughly 2 to 4 seconds. For model APIs and agent backends, that is a meaningful difference from older serverless systems where cold starts can stretch into 10 to 30 seconds and make interactive apps feel broken.
Autoscaling to hundreds of workers: The platform can scale from zero to large fleets of containers and GPUs in seconds. If your traffic is bursty, or your batch pipeline suddenly needs 500 workers, you pay for active compute instead of pre-warming a cluster all day.
Python-first deployment model: Modal’s main interface is a Python SDK built around decorators like @app.function. This matters less as a syntax preference and more as an operational shortcut, developers can turn local Python code into cloud jobs, APIs, and scheduled tasks without building a full Kubernetes workflow around it.
Volumes for model weights and datasets: Modal Volumes act as a distributed file system for ML workloads, with support for files up to 1 TB and directories with millions of files. For teams serving large models, this cuts down the repeated pain of downloading weights into every new container.
Sandboxes for isolated code execution: Sandboxes let you run arbitrary code in isolated environments powered by gVisor. This is especially relevant for agent builders, code-generation tools, and research systems where you need to execute generated code without trusting it.
Batch processing: Modal Batch is designed for processing very large numbers of independent jobs. Instead of wiring up your own queueing and worker orchestration, teams can fan out work across many containers with a few lines of code.
Long-running jobs: Unlike AWS Lambda’s 15-minute ceiling, Modal supports jobs that can run for hours or days. That opens the door to training, fine-tuning, large ETL jobs, and long media processing tasks that simply do not fit traditional serverless limits.
Web endpoints and streaming: You can expose Modal functions as web endpoints and support streaming responses. For LLM apps, that means users can see tokens or partial results as they are generated, which often matters more than shaving a few milliseconds off backend time.
Jupyter notebooks on demand: Modal can launch GPU-backed Jupyter notebooks quickly for research and prototyping. Teams that do not want to maintain notebook infrastructure get a faster path from idea to experiment.
Secrets and enterprise controls: Modal includes secret management and offers HIPAA support with a BAA on Enterprise plans. That is important for teams handling regulated data or connecting production systems to external APIs and databases.

Use Cases

One of the clearest Modal stories is high-throughput inference. Modal published an example with Tokasaurus, an inference framework for LLM serving, that reached more than 80,000 tokens per second on its infrastructure. That number matters less as a benchmark trophy and more because it shows where Modal is strongest: teams serving open models at scale, where startup time, GPU scheduling, and weight loading all affect real costs.

Another recurring use case is fine-tuning and training. Modal has documented workflows for LoRA fine-tuning and DeepSpeed-based distributed training across multiple GPUs. For smaller teams, this is often the difference between shipping a domain-tuned model and abandoning the idea because nobody wants to spend a week setting up distributed infrastructure. Modal’s value here is not that it invented training, it is that it removes a lot of the plumbing that normally surrounds it.

There is also a strong story around code execution and agent systems. Modal Sandboxes give developers a way to run generated or user-submitted code in isolated environments with stricter boundaries than a normal container setup. If you are building coding agents, notebook-style assistants, or evaluation uses that need to execute arbitrary Python or shell commands, this is one of Modal’s more distinctive capabilities. We see this as one of the platform’s most relevant features for AI agent builders, because safe execution is usually where these systems stop being demos and start becoming infrastructure problems.

Modal also shows up in media and data pipelines. Teams use it for Whisper transcription, image generation, video processing, and large-scale ETL-style jobs where each task is independent and parallelizable. This is less flashy than LLM serving, but often a better fit. A company processing millions of audio or image files does not need an always-on cluster. It needs a platform that can wake up, fan out work, and disappear when the queue is empty.

Strengths and Weaknesses

Strengths:

Modal feels purpose-built for AI workloads in a way most serverless tools do not. AWS Lambda is mature and widely adopted, but it does not offer native GPU support and caps execution at 15 minutes. Modal was designed for the opposite problem set, long jobs, heavy models, and bursty demand. If your workload looks like modern inference or batch ML, that difference is not subtle.

The developer experience is one of the platform’s biggest advantages. A lot of users do not want another infrastructure product that promises flexibility but quietly hands them a cluster to manage. Modal’s Python-first model reduces the distance between experiment and deployment. Compared with RunPod or raw cloud VMs, you give up some low-level control, but you also skip a lot of DevOps work.

Its scaling model is well matched to unpredictable usage. Teams with occasional GPU-heavy jobs can avoid paying for idle hardware, and that can change the economics of prototyping. Modal also avoids some of the billing confusion people run into on larger clouds, where storage, networking, and ancillary services can quietly dominate the compute bill.

Sandboxes are a real differentiator. Plenty of platforms can host an API. Fewer give you a practical way to execute arbitrary code safely enough for production use. For agent products and coding tools, that feature is not a side note, it can be the deciding factor.

Weaknesses:

Modal is still most comfortable for Python teams. JavaScript and Go support exists in alpha, which is promising, but the center of gravity is still Python. If your team builds everything in Rust, Java, or TypeScript and wants first-class tooling throughout, Modal may feel like a platform you adapt to rather than one that fits naturally.

It is not always the cheapest option. For bursty workloads, pay-per-use pricing is great. For steady, always-on GPU usage, dedicated instances or reserved cloud capacity can come out cheaper. RunPod in particular is often mentioned as a lower-cost path if your team is willing to do more infrastructure work.

Cold starts are good by serverless standards, but they are still cold starts. If your application needs ultra-tight latency every time, a persistent service on dedicated infrastructure may still be the better answer. Modal has improved this with memory snapshots and endpoint optimizations, but physics and container startup still exist.

The ecosystem is smaller than AWS or GCP. That shows up in community size, enterprise familiarity, and observability depth. You can integrate third-party tools like Datadog or LangSmith, but you do not get the same built-in monitoring universe that comes with a major cloud provider.

Pricing

Starter: $0/month, includes $30 in monthly free compute credits
Usage-based compute: Pay per second for CPU, memory, and GPU usage
Enterprise: Custom pricing

Modal’s pricing is fundamentally usage-based. You are billed for the compute you actually consume, down to fractional core-seconds and memory usage, and the company says it does not charge for idle time or scaling overhead. For early-stage teams, the free Starter credit is enough to test deployments, run small jobs, and understand the product without talking to sales.

What users actually spend depends heavily on workload shape. If you are running occasional batch jobs or traffic that comes in bursts, Modal can be much cheaper than keeping GPU instances warm all month. If you are serving a model 24/7 with predictable demand, the math changes. In that case, alternatives like reserved cloud instances or lower-level GPU providers can beat Modal on raw cost.

The main gotcha is not hidden fees so much as underestimating how expensive GPUs are in any environment. Modal makes access easier, not magically cheap. H100 and H200 class workloads can add up quickly if you leave endpoints active or run large experiments repeatedly. The good news is that the billing model is easier to reason about than many cloud stacks where networking and supporting services muddy the picture.

One useful detail from our research, startups and researchers may be eligible for up to $25,000 in free compute credits. If you qualify, that can materially change the early cost story.

Alternatives

AWS Lambda AWS Lambda is the obvious reference point because it defined serverless for many developers. It is strong for event-driven apps, internal automation, and simple APIs, especially if your company already lives inside AWS. But it is a poor match for many AI workloads because there is no native GPU support and functions time out after 15 minutes. Teams choose Lambda when they want enterprise familiarity and deep AWS integration. They choose Modal when they need GPUs, longer jobs, and a platform that treats ML workloads as normal rather than edge cases.

Google Cloud Functions and Cloud Run Google’s serverless products are attractive if you are already on GCP, especially for teams that want to stay close to BigQuery, Vertex AI, and the rest of that stack. Cloud Run in particular gives more flexibility than classic functions. But Modal still has a clearer identity around AI-native compute, especially around GPU workflows and Python ergonomics. If your team wants one cloud vendor for everything, GCP may win. If you want a platform optimized around model serving and batch ML, Modal is usually easier to justify.

RunPod RunPod is one of the closest alternatives for GPU-heavy teams. It is often cheaper, and many users like it for direct access to GPU infrastructure without too much ceremony. The trade-off is that you generally do more setup and more ops work yourself. Teams choose RunPod when cost is the top priority and they are comfortable managing more of the stack. They choose Modal when developer speed and abstraction are worth paying for.

Replicate Replicate is a good fit for teams that want to run models quickly without building much infrastructure at all. It is particularly friendly for product teams consuming model APIs rather than building deployment systems. Compared with Modal, it offers less flexibility and less control over custom workloads. If you mostly want to plug models into an app, Replicate can be faster. If you want to build your own inference services, pipelines, or training jobs, Modal gives you more room.

Kubernetes on AWS, GCP, or Azure Self-managed or semi-managed Kubernetes remains the choice for teams that need maximum control. You can tune everything, reserve capacity, and fit the platform to your exact architecture. The cost is complexity. You need people who can operate it well. Modal is the alternative for teams that want many of the same outcomes, scaling, custom containers, GPU jobs, APIs, but do not want to become a platform engineering shop along the way.

FAQ

Modal is used to run AI inference, batch jobs, training workloads, data pipelines, notebooks, and isolated code execution. Most teams adopt it when they need cloud compute that scales quickly without managing servers.

Yes, especially if your agents need to run code, call GPU-backed models, or scale up and down with demand. Its Sandbox feature is particularly relevant for coding agents and systems that execute generated code.

How do I get started?

You install the Python package, authenticate once, and deploy a simple function from your local codebase. Modal’s setup is lighter than most cloud platforms because you do not need to provision infrastructure first.

How long does it take to set up?

For a basic Python function or API, many developers can get something running in under an hour. More complex setups, like custom containers or multi-GPU training, take longer but still avoid a lot of the usual cloud setup work.

Yes. Modal supports NVIDIA GPUs including A100, H100, and H200. That is one of its main reasons for existing, since traditional serverless platforms usually do not support GPUs well or at all.

Python is the main experience today. JavaScript and Go SDKs are in alpha, but if your team wants the smoothest path, Python is still the center of the platform.

Sometimes. For bursty or intermittent workloads, it can be cheaper because you only pay for active compute. For steady, always-on workloads, especially on GPUs, reserved infrastructure on AWS or another provider may cost less.

Yes. This is one of the big differences from AWS Lambda. Modal supports jobs that run for hours or even days, which is important for training, fine-tuning, and large media or data processing tasks.

Yes, but they are faster than what many developers expect from serverless. Our research found GPU-backed functions often start in around 2 to 4 seconds, and the platform uses techniques like memory snapshots to reduce startup overhead.

Yes. Modal supports web endpoints and streaming responses, so you can expose functions as APIs and return output incrementally for LLM or media workloads.

For many teams, yes. It includes secrets management, isolated execution, and enterprise options including HIPAA support with a BAA. As always, the right answer depends on your compliance requirements and architecture.

If you need maximum control over infrastructure, if your workloads run continuously and predictably, or if your stack is far from Python, another option may fit better. Modal is strongest when convenience, elasticity, and AI-focused compute matter more than owning every layer.

Categories:

Agent Hosting

Tags:

api batch-processing hipaa-compliant nvidia python sdk serverless-deployment

Similar to Modal

Browse Agent Hosting

Fly.io

Deploy apps to global servers in minutes, not days.

Agent Hosting

Fly.io lets developers deploy and host applications worldwide with minimal setup. Explore features, Fly.io pricing, and how it compares to alternatives.

HuggingFace Spaces

Host and share interactive AI demos with HuggingFace Spaces

Agent Hosting

HuggingFace Spaces lets developers host interactive AI/ML demos online with free CPU or paid GPU support for fast testing and sharing.

LangGraph Platform

Deploy and orchestrate stateful AI agents for production

Agent Hosting

LangGraph Platform helps teams build, deploy, and run stateful AI agents reliably in production with flexible model support.

Northflank

Run production workloads without becoming a Kubernetes expert

Agent Hosting

Northflank lets teams deploy services, jobs, databases, cron tasks, and AI workloads via UI, API, CLI, or Git—without managing Kubernetes.

Railway

Deploy apps, databases, and workers without infrastructure headaches

Agent Hosting

Railway is a developer-first cloud platform for deploying apps, databases, workers, and internal services fast.

Modal

Compare Modal

What is Modal?

Key Features

Use Cases

Strengths and Weaknesses

Pricing

Alternatives

FAQ

What is Modal used for?

Is Modal good for AI agents?

How do I get started?

How long does it take to set up?

Does Modal support GPUs?

Does Modal only work with Python?

Is Modal cheaper than AWS?

Can I run long jobs on Modal?

Does Modal have cold starts?

Can I deploy APIs on Modal?

Is Modal secure enough for production?

When should I not use Modal?

Similar to Modal

Similar to Modal

Similar to Modal