A scoreboard for prompts
Prompts degrade in production and new versions regress silently. The only defense is a scoreboard: for each prompt version, track pass rate, cost per call, and P95 latency, and compare head-to-head. Teams that do this ship with data; teams that don't ship with vibes.
What to score
- Pass rate %: Percent of eval set examples where the output meets the quality bar. Humans or an LLM judge can score.
- Avg $/call: Input + output + cache amortized per call. Tracks cost impact of a prompt change.
- P95 latency ms: 95th percentile response time. Catches regressions from longer prompts, more tool calls, or CoT.
- Calls/day: Volume β determines total impact.
The score the tracker computes
This tracker computes a simple weighted score: monthly spend + a failure penalty ($5 per percentage point failed per 30 days). The failure penalty is configurable β use a dollar figure that reflects what a failed answer costs your business (a deflected ticket that bounces back to a human = $10-25; a refund decision made wrong = $50-500; a code review that misses a bug = variable).
How to run an A/B
- Create an eval set of 50-200 golden examples. Keep it versioned.
- For each prompt version, run the eval set and score pass rate.
- Measure cost from provider usage logs (OpenAI, Anthropic, Google all expose it).
- Measure latency from client-side logs.
- Populate this tracker. Compare.
Common findings
- Few-shot examples help medium tasks, hurt easy tasks. On easy tasks they just 2Γ cost with no pass-rate lift.
- CoT helps hard tasks, kills latency on easy tasks. Add a complexity gate.
- Longer prompts have diminishing returns. Past ~2,000 tokens of system prompt, lift per token is near zero.
- Cache-first redesigns save 60-85%. Restructure to put static content at the top of the prompt.
Keeping the scoreboard honest
Judges drift. If you use an LLM-as-judge for scoring, version the judge prompt and re-grade baselines when you change it. Calibrate against human scoring monthly.
- RGCF Prompt Template β Build the prompt you're about to A/B.
- Chain-of-Thought Prompt Builder β Stack reasoning into a prompt, measure the lift.
- AI Spend Tracker β Roll prompt-level cost into workload-level spend.
- LLM Migration Planner β Swap models without regressing quality.