When to pick GPT-5 vs o4
OpenAI maintains two top-tier families in April 2026: GPT-5 (the fast, general-purpose model) and o4 (the reasoning model that thinks before it answers). They are priced differently, scored differently on benchmarks, and suited to different jobs. Confusing them is the single most common mistake we see on OpenAI deployments.
| Model | Input $/MTok | Output $/MTok | P50 latency | Strengths |
|---|---|---|---|---|
| GPT-5 | $5.00 | $20.00 | 2-6s | General chat, tool use, structured output |
| GPT-5 mini | $0.40 | $1.60 | 1-3s | Bulk classification, cheap router tier |
| o4 | $12.00 | $48.00 | 10-40s | Hard math, proofs, deep debugging |
GPT-5: default pick for agents and chat
GPT-5 is the model you pick when you need high-quality responses in the 2-6 second latency band. It's also OpenAI's best on function calling and structured outputs — grammar- constrained decoding means malformed JSON is essentially eliminated. On tau-bench retail it sits at 82.5% tool-use accuracy, ahead of both o4 and the Claude tiers.
Where GPT-5 is not the pick: tasks where getting the right answer matters more than getting it fast. Hard math, formal proofs, competition code — all are o4 territory.
o4: reasoning over speed
o4 is what OpenAI calls a "reasoning model" — it generates an extended internal chain of thought before producing a final answer. On AIME 2025 it hits 91% accuracy. On Codeforces problems it matches or beats top human competitors. On hard scientific questions (GPQA Diamond) it leads GPT-5 by 4-6 points.
The cost is latency. Responses take 10-40 seconds routinely, occasionally longer on hard prompts. It's also meaningfully more expensive: $12 input / $48 output. The right workloads for o4 are deliberately small: hard technical questions where a human would pay $50-500 to an expert for the answer. At that value per question, o4's unit economics are excellent. At the value of "what's the capital of Peru," they are absurd.
A tiered OpenAI architecture
Most production OpenAI deployments benefit from a tier:
- GPT-5 mini for routing and easy classification (~$0.0005/call).
- GPT-5 for the main task (~$0.01-0.05/call).
- o4 only when a low-confidence signal or an explicit "think harder" tag fires (~$0.20-1.00/call).
This architecture covers everything from chatbots to hard-reasoning backends at a blended cost 50-70% below "GPT-5 everywhere" or 80-90% below "o4 everywhere."
Benchmarks, April 2026
| Benchmark | GPT-5 | o4 |
|---|---|---|
| MMLU-Pro | 87.4 | 88.2 |
| SWE-bench Verified | 68.1 | 74.5 |
| AIME 2025 (math) | 91 | 96 |
| Codeforces Elo | 2150 | 2380 |
| GPQA Diamond (science) | 68 | 74 |
| tau-bench retail (tool use) | 82.5 | 76.0 |
Migration tip: don't move to o4 because of a demo
The most common mistake: a team sees a hard-problem demo where o4 nails it and GPT-5 misses, and decides to swap every request. Three weeks later their latency SLO is broken, users complain, and spend has doubled. Keep GPT-5 as the default. Route to o4 only on measurable difficulty triggers — low confidence, known-hard domain, or user-initiated "think harder" button.
- ChatGPT vs Claude vs Gemini — Cross-vendor comparison of the 3 leading chat models.
- Which AI model? — 6-question recommender for your use case.
- LLM API cost calculator — Plug in tokens and volume for either model.
- Prompt performance tracker — Score prompt versions on pass rate, cost, latency.