Skip to content
AI Economy Hub

LLM migration planner

Plan a safe migration from one LLM to another — eval set, shadow traffic, rollback, deprecation.

Loading tool…

Frequently asked questions

1.How long should a migration take?

4 weeks is the minimum for a production-grade workload. Week 1 baseline, weeks 2-3 shadow eval, week 4 cutover with canary. Shorter timelines skip the shadow period and land in pain.

2.Can I skip the shadow phase if both models are on the same vendor?

No. Same-vendor model updates (e.g. Opus 4.1 → 4.7) still have behavior changes. Shadow eval catches regressions before users do.

3.What's a good quality gate?

Pass rate within 2 percentage points of baseline, P95 latency within 20%, no Sev1 incidents during canary. Tighten for user-facing surfaces.

4.How long should I keep the old model deployable after cutover?

30 days minimum. Roll back fast is cheaper than debugging.

5.When should I NOT migrate?

If sticker price is your only reason and quality hasn't been measured. A 30% cost cut at the expense of 8-point pass-rate drop is usually net negative once you count error cost.

How to swap LLMs without breaking production

Model swaps break things. Not every time — but often enough that you need a migration process, not a prayer. This planner is the process: 4 weeks, 12 items, 3 phases. Used as written, it takes a model swap from "we launched and hope" to "we shadowed, evaluated, canaried, and rolled back capability preserved."

Why migrations fail

  • No baseline — you don't know if the new model is better or worse than the old one.
  • No eval set — "it looks the same in testing" is a vibe, not a measurement.
  • Hard cutover — when users hit regression, you can't roll back fast enough.
  • No shadow period — you haven't seen the new model under real traffic before you shipped.

The 4-week plan

Week 1: scope + baseline

  • Inventory: which workloads use the old model today and at what volume.
  • Baseline: pass rate, cost, and P95 latency per workload.
  • Quality gate: non-regression threshold per workload (e.g. pass rate within 2 points).
  • Go/no-go criteria: document what would trigger a rollback.

Weeks 2-3: shadow eval

  • Deploy the new model in shadow mode: 100% of traffic, both models run, only the old one serves.
  • Score both against the golden eval set nightly.
  • Diff outputs on 200 real requests with human review.
  • Red-team the new model for injection, jailbreak, and output leak.

Week 4: cutover

  • Canary 1% for 48 hours. If metrics stable, promote.
  • Promote to 10% → 50% → 100% with 24-hour pauses.
  • Rollback plan rehearsed and wired (one-click feature flag).
  • Old model stays deployable for 30 days after 100% rollout.

Common migration scenarios

FromToTypical riskTypical saving
Opus 4.1Opus 4.7Low — same family, newer versionSimilar cost, ~10-15% quality lift
Opus 4.xSonnet 4.5Medium — quality regression possible80% cost cut
GPT-4oGPT-5Low — OpenAI minimizes API breaksSimilar cost, quality lift
GPT-5Sonnet 4.5High — cross-vendor, different tool-use behavior40-60% cost cut
Sonnet 4.5Haiku 4 (routing)Medium — needs confidence gate60-85% cost cut

What the planner gives you

The interactive plan above tracks each item per phase and produces a downloadable markdown plan you can drop into your project tracker. Tick items as you complete them; the progress bar updates and the final export is a timestamped checklist of what was done.

Keep going

Digital Dashboard Hub

Track your AI tool costs, ROI, and productivity metrics

DDH helps you measure whether AI is actually saving you money — with 162 business and productivity calculators in one place. Free 14-day trial.

Track your AI ROI free →

More free tools