AI Economy Hub

Prompt performance tracker

Score prompt versions on pass rate, cost, and latency. Keep the winner. Ship with data, not vibes.

Loading tool…

Get weekly marketing insights

Join 1,200+ readers. One email per week. Unsubscribe anytime.

Frequently asked questions

1.How big should my eval set be?

Start with 50 golden examples. Grow to 200-500 as you accumulate edge cases. Version it in git so you can diff eval results across prompt changes.

2.Human scoring vs LLM-as-judge?

LLM-as-judge is ~10Γ— cheaper and correlates well when calibrated. Validate against human scoring monthly. Version the judge prompt so you can re-grade baselines if it changes.

3.How often should I re-run the scoreboard?

Nightly for production prompts. On every change for new versions. Weekly review of deltas catches silent regressions before users do.

4.What's a meaningful pass-rate delta?

Below 2 points is noise. 3-5 points is a real effect but might not be worth cost trade-offs. Above 5 points is usually a clear winner.

5.Should I track latency alongside cost?

Yes β€” this tracker includes P95 latency. A prompt change that lifts pass rate 3% but doubles latency can still be a net loss if the product is latency-sensitive.

A scoreboard for prompts

Prompts degrade in production and new versions regress silently. The only defense is a scoreboard: for each prompt version, track pass rate, cost per call, and P95 latency, and compare head-to-head. Teams that do this ship with data; teams that don't ship with vibes.

What to score

  • Pass rate %: Percent of eval set examples where the output meets the quality bar. Humans or an LLM judge can score.
  • Avg $/call: Input + output + cache amortized per call. Tracks cost impact of a prompt change.
  • P95 latency ms: 95th percentile response time. Catches regressions from longer prompts, more tool calls, or CoT.
  • Calls/day: Volume β€” determines total impact.

The score the tracker computes

This tracker computes a simple weighted score: monthly spend + a failure penalty ($5 per percentage point failed per 30 days). The failure penalty is configurable β€” use a dollar figure that reflects what a failed answer costs your business (a deflected ticket that bounces back to a human = $10-25; a refund decision made wrong = $50-500; a code review that misses a bug = variable).

How to run an A/B

  1. Create an eval set of 50-200 golden examples. Keep it versioned.
  2. For each prompt version, run the eval set and score pass rate.
  3. Measure cost from provider usage logs (OpenAI, Anthropic, Google all expose it).
  4. Measure latency from client-side logs.
  5. Populate this tracker. Compare.

Common findings

  • Few-shot examples help medium tasks, hurt easy tasks. On easy tasks they just 2Γ— cost with no pass-rate lift.
  • CoT helps hard tasks, kills latency on easy tasks. Add a complexity gate.
  • Longer prompts have diminishing returns. Past ~2,000 tokens of system prompt, lift per token is near zero.
  • Cache-first redesigns save 60-85%. Restructure to put static content at the top of the prompt.

Keeping the scoreboard honest

Judges drift. If you use an LLM-as-judge for scoring, version the judge prompt and re-grade baselines when you change it. Calibrate against human scoring monthly.

Keep going

More free tools