How to swap LLMs without breaking production
Model swaps break things. Not every time — but often enough that you need a migration process, not a prayer. This planner is the process: 4 weeks, 12 items, 3 phases. Used as written, it takes a model swap from "we launched and hope" to "we shadowed, evaluated, canaried, and rolled back capability preserved."
Why migrations fail
- No baseline — you don't know if the new model is better or worse than the old one.
- No eval set — "it looks the same in testing" is a vibe, not a measurement.
- Hard cutover — when users hit regression, you can't roll back fast enough.
- No shadow period — you haven't seen the new model under real traffic before you shipped.
The 4-week plan
Week 1: scope + baseline
- Inventory: which workloads use the old model today and at what volume.
- Baseline: pass rate, cost, and P95 latency per workload.
- Quality gate: non-regression threshold per workload (e.g. pass rate within 2 points).
- Go/no-go criteria: document what would trigger a rollback.
Weeks 2-3: shadow eval
- Deploy the new model in shadow mode: 100% of traffic, both models run, only the old one serves.
- Score both against the golden eval set nightly.
- Diff outputs on 200 real requests with human review.
- Red-team the new model for injection, jailbreak, and output leak.
Week 4: cutover
- Canary 1% for 48 hours. If metrics stable, promote.
- Promote to 10% → 50% → 100% with 24-hour pauses.
- Rollback plan rehearsed and wired (one-click feature flag).
- Old model stays deployable for 30 days after 100% rollout.
Common migration scenarios
| From | To | Typical risk | Typical saving |
|---|---|---|---|
| Opus 4.1 | Opus 4.7 | Low — same family, newer version | Similar cost, ~10-15% quality lift |
| Opus 4.x | Sonnet 4.5 | Medium — quality regression possible | 80% cost cut |
| GPT-4o | GPT-5 | Low — OpenAI minimizes API breaks | Similar cost, quality lift |
| GPT-5 | Sonnet 4.5 | High — cross-vendor, different tool-use behavior | 40-60% cost cut |
| Sonnet 4.5 | Haiku 4 (routing) | Medium — needs confidence gate | 60-85% cost cut |
What the planner gives you
The interactive plan above tracks each item per phase and produces a downloadable markdown plan you can drop into your project tracker. Tick items as you complete them; the progress bar updates and the final export is a timestamped checklist of what was done.
- Prompt Performance Tracker — A/B old vs new on pass rate + cost + latency.
- AI Spend Tracker — Quantify the cost saving post-migration.
- Which AI model? — Pick the migration target.
- Enterprise AI Security Checklist — Red-team the new model before cutover.