AI Economy Hub

LLM migration planner

Plan a safe migration from one LLM to another — eval set, shadow traffic, rollback, deprecation.

Loading tool…

Get weekly marketing insights

Join 1,200+ readers. One email per week. Unsubscribe anytime.

Frequently asked questions

1.How long should a migration take?

4 weeks is the minimum for a production-grade workload. Week 1 baseline, weeks 2-3 shadow eval, week 4 cutover with canary. Shorter timelines skip the shadow period and land in pain.

2.Can I skip the shadow phase if both models are on the same vendor?

No. Same-vendor model updates (e.g. Opus 4.1 → 4.7) still have behavior changes. Shadow eval catches regressions before users do.

3.What's a good quality gate?

Pass rate within 2 percentage points of baseline, P95 latency within 20%, no Sev1 incidents during canary. Tighten for user-facing surfaces.

4.How long should I keep the old model deployable after cutover?

30 days minimum. Roll back fast is cheaper than debugging.

5.When should I NOT migrate?

If sticker price is your only reason and quality hasn't been measured. A 30% cost cut at the expense of 8-point pass-rate drop is usually net negative once you count error cost.

How to swap LLMs without breaking production

Model swaps break things. Not every time — but often enough that you need a migration process, not a prayer. This planner is the process: 4 weeks, 12 items, 3 phases. Used as written, it takes a model swap from "we launched and hope" to "we shadowed, evaluated, canaried, and rolled back capability preserved."

Why migrations fail

  • No baseline — you don't know if the new model is better or worse than the old one.
  • No eval set — "it looks the same in testing" is a vibe, not a measurement.
  • Hard cutover — when users hit regression, you can't roll back fast enough.
  • No shadow period — you haven't seen the new model under real traffic before you shipped.

The 4-week plan

Week 1: scope + baseline

  • Inventory: which workloads use the old model today and at what volume.
  • Baseline: pass rate, cost, and P95 latency per workload.
  • Quality gate: non-regression threshold per workload (e.g. pass rate within 2 points).
  • Go/no-go criteria: document what would trigger a rollback.

Weeks 2-3: shadow eval

  • Deploy the new model in shadow mode: 100% of traffic, both models run, only the old one serves.
  • Score both against the golden eval set nightly.
  • Diff outputs on 200 real requests with human review.
  • Red-team the new model for injection, jailbreak, and output leak.

Week 4: cutover

  • Canary 1% for 48 hours. If metrics stable, promote.
  • Promote to 10% → 50% → 100% with 24-hour pauses.
  • Rollback plan rehearsed and wired (one-click feature flag).
  • Old model stays deployable for 30 days after 100% rollout.

Common migration scenarios

FromToTypical riskTypical saving
Opus 4.1Opus 4.7Low — same family, newer versionSimilar cost, ~10-15% quality lift
Opus 4.xSonnet 4.5Medium — quality regression possible80% cost cut
GPT-4oGPT-5Low — OpenAI minimizes API breaksSimilar cost, quality lift
GPT-5Sonnet 4.5High — cross-vendor, different tool-use behavior40-60% cost cut
Sonnet 4.5Haiku 4 (routing)Medium — needs confidence gate60-85% cost cut

What the planner gives you

The interactive plan above tracks each item per phase and produces a downloadable markdown plan you can drop into your project tracker. Tick items as you complete them; the progress bar updates and the final export is a timestamped checklist of what was done.

Keep going

More free tools