A100 vs. H100 vs. L4?

L4 for small (≤13B) models, A100 for 13–70B, H100 for 70B+ or long-context throughput. H100 is ~2× faster than A100 at ~1.5× the price — usually the right choice for 70B models.

Does quantization help?

Yes — INT8 or FP8 quantization roughly doubles throughput with minimal quality loss on most models. AWQ and GPTQ are the common open-source paths.

What about cold starts?

Loading a 70B model takes 30–90 seconds. For spiky traffic, keep warm replicas — which erodes the cost advantage of self-hosting.

Should I use a serverless GPU service?

Modal, Replicate, Baseten, and Together AI give per-second GPU billing. For under ~4 hours/day of actual compute, they beat renting a dedicated GPU.

Vercel Fluid Compute runs CPU workloads, not GPU inference. Use the AI Gateway with a provider (OpenAI, Anthropic, Together) for LLM inference and keep your Vercel functions for API orchestration.

GPU Inference Cost Calculator

When self-hosting an open model actually beats the API bill

The economics of self-hosting Llama 4 70B, Qwen 3 72B, or Mixtral 8x22B vs. paying Anthropic or OpenAI are the most debated cost question in AI infrastructure in 2026. The short answer: self-hosting wins on bulk, predictable, high-throughput workloads with engineering capacity to run them. It loses on spiky traffic, low volume, frontier capability requirements, or teams without dedicated ML infra.

The three questions that determine the answer

Before you open the calculator, three yes/no questions will tell you whether self-hosting is even a live option. First: do you have a platform or infrastructure engineer with GPU-serving experience, or budget to hire one? Second: is your traffic predictable enough that batch-size-16+ is achievable during most hours? Third: is your workload tolerant of the 5–12pp quality gap to frontier APIs? If any answer is no, stop — self-host is not the right call at your current stage. If all three are yes, the calculator and the break-even math below decide whether it is time.

2026 GPU pricing reality check

GPU	Hourly (on-demand)	Hourly (reserved 1yr)	Best for
NVIDIA H100 80GB	$2.50–$4.50/hr	$1.80/hr	70B+ models, high throughput
NVIDIA H200 141GB	$3.80–$6.00/hr	$2.90/hr	Long context, large batch
NVIDIA A100 80GB	$1.30–$2.20/hr	$0.95/hr	70B quantized, price/perf sweet spot
NVIDIA L40S 48GB	$0.85–$1.50/hr	$0.65/hr	8B–34B models
NVIDIA L4 24GB	$0.50–$0.80/hr	$0.35/hr	Small models, embeddings
AMD MI300X 192GB	$2.20–$3.50/hr	$1.60/hr	Large-context inference on ROCm

The honest case for staying on API

Most teams that seriously price self-hosting end up staying on API for another year, and this is usually the correct answer. The break-even analysis is sensitive to three numbers: real batched throughput (hard to hit in production), reserved-vs-on-demand GPU pricing (fluctuates), and the quality tax from running a non-frontier open model (task- specific). A rigorous TCO that assumes best-case values for all three will tell you self-host wins; a rigorous TCO that uses realistic values usually tells you it does not, until your API bill has passed $15k/month on stable workloads.

Throughput numbers that matter

Using vLLM + Llama 4 70B on a single H100 at FP8, you can expect roughly 40–70 tokens/sec per concurrent stream and ~1,500–3,000 tokens/sec aggregate throughput at 32-way batching. At $3/hr and 2,000 tok/sec aggregate: cost per million output tokens is (3600 × $3) / (2,000 × 3600 / 1e6) = $1.50/M output — vs. Sonnet 4.5 at $15/M output. 10× cheaper raw, but only if you are actually running the GPU at 70%+ utilization.

What utilization looks like in practice

Utilization is the single number that makes or breaks self-hosted economics, and the gap between stated and actual utilization is where most projects lose money. A team quotes "we will run our GPUs at 80% utilization" in the planning doc and actually runs them at 35% after three months in production because traffic is spiky, batching windows are smaller than planned, and model reloads happen more often than expected. Build your break-even math with 40% utilization as the base case, and treat anything higher as a pleasant surprise. Engineers used to CPU-service utilization tend to underestimate how badly GPU utilization is affected by batch dynamics.

Break-even heuristic

For Llama 4 70B-class quality (roughly competitive with GPT-4o/Sonnet for many tasks), the rough break-even vs. Sonnet 4.5 API ($3/$15) is ~40M output tokens/monthat 50%+ GPU utilization. Below that volume, API is cheaper after accounting for the engineer-hours of ops work. Above 100M output tokens/month, self-hosting is a clear win for teams with infra muscle.

For frontier capability (Sonnet 4.5 / GPT-5 quality on hard reasoning, agentic tool use, complex coding), no open model is equivalent as of April 2026. Self-hosting trades 5–15pp of benchmark quality for cost savings — sometimes worth it for bulk tasks, rarely worth it for the agent that runs your core product.

The full TCO checklist most people skip

GPU compute (the headline number).
Storage for model weights (~140GB for 70B FP8; cheap but real).
Egress bandwidth (cross-region traffic to your app).
Load balancer + autoscaler (EKS/GKE node pool; Ray Serve or vLLM server).
Observability (Prometheus + Grafana, plus token-accounting shim).
Engineer time: ~1.5 FTE-equivalent for a production self-hosted deployment.
On-call rotation for GPU incidents (less frequent but more painful than CPU ones).

Modal, Replicate, Together: the middle ground

Platforms like Modal ($0.00004/s for H100 cold-boot billing), Replicate, and Together AI give you open-model inference priced per-token like an API, but at 3–10× lower rates than the frontier labs. For a team that wants Llama 4 pricing without the ops burden, this is usually the right answer. As of April 2026, Together is pricing Llama 4 70B around $0.60/M input and $0.80/M output — a 5–10× saving on bulk extraction and summarization workloads with none of the GPU headaches.

Three real workloads we priced self-host vs API

Bulk classification, 20M calls/month, 800 in / 80 out per call: API (Haiku 4) = 16B in × $0.80/M + 1.6B out × $4/M = $12.8k + $6.4k = $19.2k/mo. Self-hosted Llama 4 70B on 4× H100 reserved at $1.80/hr × 720 hr = $5.2k/mo, utilization 50%. Batched throughput handles the load. Self-host wins by 3.7×, but only if the team can run it.
General chatbot, 2M calls/month, 2.8k in / 400 out per call: API (cached Sonnet 4.5) = $2k/mo. Self-hosted Llama 4 on 2× H100 reserved = $2.6k/mo at 40% utilization, and quality is noticeably lower on tricky responses. API wins; stop debating.
Coding agent, 500k calls/month, 6k in / 900 out per call: Sonnet 4.5 = $1.5k/mo. Self-hosted Qwen 2.5 Coder 32B on 1 H100 reserved = $1.3k/mo at steady throughput. Close, but Sonnet quality on agentic coding is materially better; the right call is API until volume triples.

Throughput-per-dollar at different batch sizes

vLLM throughput scales sub-linearly with batch size. For Llama 4 70B FP8 on an H100:

Batch size	Tokens/sec aggregate	$ per 1M output
1	~60	$13.9
4	~220	$3.8
16	~780	$1.07
32	~1,600	$0.52
64	~2,400	$0.35
128	~3,000	$0.28

The implication: self-hosting only beats API pricing at batch sizes ≥16, which means your traffic must be consistent enough to fill batches. A spiky workload that averages 2 concurrent requests loses money self-hosting. This is why bulk offline jobs are the canonical self-host win.

What quality actually costs you

Honest 2026 benchmark deltas: Llama 4 70B is roughly GPT-4o-class on general reasoning (within 3–5pp on MMLU-Pro, AIME, SWE-bench Verified). It trails Sonnet 4.5 and GPT-5 by 5–12pp on agentic tool use and hard coding. Qwen 3 72B is competitive on math and Chinese; DeepSeek V3.1 is strong on code. None of them match frontier capability on open-ended research-style tasks where tail behavior matters.

Translation: for bulk classification, summarization, structured extraction, and simple Q&A, open models are fine. For the agent that runs your core product, they are usually not. Build the architecture so you can use both — open for high-volume cheap work, frontier for the critical path.

Operational realities most teams underestimate

Cold starts are brutal. Loading a 70B model takes 2–5 minutes on an H100. Scale-from-zero is effectively off the table; you must pre-warm.
Quantization is not free. FP8 costs 1–3pp quality; INT4 can cost 5–8pp. Quantize for throughput but measure quality first.
Batching windows are a product decision. To fill batches, you add queuing latency. A 50ms batching window is often the right default; 200ms is where users start to feel it.
GPU supply is still tight. Reserved-instance availability for H100 and H200 varies weekly. Plan capacity 4–8 weeks in advance.
Observability is your own problem.No provider dashboard, no built-in rate limits. You're building the whole telemetry stack.

Frequently asked questions

Is Groq really 10× faster?On small models and steady-state throughput, yes — LPU inference on Llama and similar models hits 400–800 tok/sec per stream. Pricing is competitive with frontier APIs. Quality is the underlying model's quality, so treat it as a speed upgrade not a capability upgrade.

What about Cerebras? Similar story to Groq: specialized silicon delivering large throughput wins on specific models. Pricing has come down but still commands a premium over commodity GPU hosting.

Does spot pricing help? Sometimes. Spot H100s can be 60–70% of on-demand, but interruptions can be frequent and painful for a serving workload. Best for offline indexing or batch jobs.

Is TPU cheaper than GPU? On Google Cloud with TPU v5 for inference, yes, for Gemma and other TPU-optimized models — roughly 20–40% cheaper than equivalent H100 capacity. Requires model-specific tuning.

Should I worry about GPU supply for a startup? Below ~500 GPU-hours/month, no — on-demand capacity from Together, Modal, or Fireworks covers you. Above that, lock reserved capacity.

How do I know when to self-host?Three gates: (1) monthly API spend >$15k, (2) traffic consistent enough for batch ≥16, (3) you have or will hire an infra engineer. All three — self-host. Two — maybe. One or zero — stay on API.

What about serverless GPU like Modal for burst workloads?Excellent for irregular traffic. Modal bills per-second with < 30-second cold starts for 7B models. For sporadic heavy workloads, it often beats both API and reserved GPU.

Does self-hosting help with data privacy?Only if you actually control the hardware. Using a third-party hosted open model (Together, Fireworks) has the same data-handling posture as using a frontier API — you're still sending content to a provider.

Utilization math: why empty GPUs are the hidden cost

A rented H100 at $2/hr costs $1,440/mo even if nobody calls the endpoint. Self-hosted inference economics are entirely a function of utilization — below 40%, you are almost always worse off than calling a frontier API. The minimum-viable utilization curve for 2026: 40% utilization on an 8B Llama deployment breaks even vs Haiku 4 API; 65% utilization breaks even vs Sonnet 4.5 API for comparable-quality domains after fine-tuning. Teams that self-host for prestige or premature optimization typically run at 15-25% utilization and pay 3-5× the API cost for the same throughput. Instrument utilization per GPU from day one; route overflow to the API rather than provisioning idle capacity.

Keep going

Compute break-even — translate GPU cost into customers needed.
LLM API cost — the API side of the comparison.
RAG pipeline cost — self-hosting pays off on bulk extraction pipelines.
Fine-tune vs RAG — self-hosting + a small fine-tune can dominate.

Use the data programmatically

Every calculator on this site is also exposed as a free, CORS-open JSON endpoint. No auth, no rate limit (fair-use, please cache). License is CC-BY-4.0 — link back to attribution.canonicalUrl in the response.

Endpoint: https://aieconomyhub.co/api/page/inference-gpu-cost

curl

curl -s 'https://aieconomyhub.co/api/page/inference-gpu-cost' | jq .

Python

import requests

r = requests.get("https://aieconomyhub.co/api/page/inference-gpu-cost", timeout=10)
r.raise_for_status()
data = r.json()
print(data["title"])
for faq in data.get("faqs", []):
    print("Q:", faq["q"])

JavaScript / Node

// Node 20+ / modern browser
const res = await fetch("https://aieconomyhub.co/api/page/inference-gpu-cost");
if (!res.ok) throw new Error("HTTP " + res.status);
const inference_gpu_cost = await res.json();
console.log(inference_gpu_cost.title);
for (const faq of inference_gpu_cost.faqs ?? []) {
  console.log("Q:", faq.q);
}

Spec: /api/openapi.yaml · Docs: /api/docs

GPU inference cost

Results

Visualization

Frequently asked questions

When self-hosting an open model actually beats the API bill

The three questions that determine the answer

2026 GPU pricing reality check

The honest case for staying on API

Throughput numbers that matter

What utilization looks like in practice

Break-even heuristic

The full TCO checklist most people skip

Modal, Replicate, Together: the middle ground

Three real workloads we priced self-host vs API

Throughput-per-dollar at different batch sizes

What quality actually costs you

Operational realities most teams underestimate

Frequently asked questions

Utilization math: why empty GPUs are the hidden cost

Use the data programmatically

Track your AI tool costs, ROI, and productivity metrics

More free tools