How big should my eval set be?

Start with 50 golden examples. Grow to 200-500 as you accumulate edge cases. Version it in git so you can diff eval results across prompt changes.

Human scoring vs LLM-as-judge?

LLM-as-judge is ~10× cheaper and correlates well when calibrated. Validate against human scoring monthly. Version the judge prompt so you can re-grade baselines if it changes.

How often should I re-run the scoreboard?

Nightly for production prompts. On every change for new versions. Weekly review of deltas catches silent regressions before users do.

What's a meaningful pass-rate delta?

Below 2 points is noise. 3-5 points is a real effect but might not be worth cost trade-offs. Above 5 points is usually a clear winner.

Should I track latency alongside cost?

Yes — this tracker includes P95 latency. A prompt change that lifts pass rate 3% but doubles latency can still be a net loss if the product is latency-sensitive.

Prompt Performance Tracker — A/B Version Scoreboard

A scoreboard for prompts

Prompts degrade in production and new versions regress silently. The only defense is a scoreboard: for each prompt version, track pass rate, cost per call, and P95 latency, and compare head-to-head. Teams that do this ship with data; teams that don't ship with vibes.

What to score

Pass rate %: Percent of eval set examples where the output meets the quality bar. Humans or an LLM judge can score.
Avg $/call: Input + output + cache amortized per call. Tracks cost impact of a prompt change.
P95 latency ms: 95th percentile response time. Catches regressions from longer prompts, more tool calls, or CoT.
Calls/day: Volume — determines total impact.

The score the tracker computes

This tracker computes a simple weighted score: monthly spend + a failure penalty ($5 per percentage point failed per 30 days). The failure penalty is configurable — use a dollar figure that reflects what a failed answer costs your business (a deflected ticket that bounces back to a human = $10-25; a refund decision made wrong = $50-500; a code review that misses a bug = variable).

How to run an A/B

Create an eval set of 50-200 golden examples. Keep it versioned.
For each prompt version, run the eval set and score pass rate.
Measure cost from provider usage logs (OpenAI, Anthropic, Google all expose it).
Measure latency from client-side logs.
Populate this tracker. Compare.

Common findings

Few-shot examples help medium tasks, hurt easy tasks. On easy tasks they just 2× cost with no pass-rate lift.
CoT helps hard tasks, kills latency on easy tasks. Add a complexity gate.
Longer prompts have diminishing returns. Past ~2,000 tokens of system prompt, lift per token is near zero.
Cache-first redesigns save 60-85%. Restructure to put static content at the top of the prompt.

Keeping the scoreboard honest

Judges drift. If you use an LLM-as-judge for scoring, version the judge prompt and re-grade baselines when you change it. Calibrate against human scoring monthly.

Keep going

RGCF Prompt Template — Build the prompt you're about to A/B.
Chain-of-Thought Prompt Builder — Stack reasoning into a prompt, measure the lift.
AI Spend Tracker — Roll prompt-level cost into workload-level spend.
LLM Migration Planner — Swap models without regressing quality.

Use the data programmatically

Every calculator on this site is also exposed as a free, CORS-open JSON endpoint. No auth, no rate limit (fair-use, please cache). License is CC-BY-4.0 — link back to attribution.canonicalUrl in the response.

Endpoint: https://aieconomyhub.co/api/page/prompt-performance-tracker

curl

curl -s 'https://aieconomyhub.co/api/page/prompt-performance-tracker' | jq .

Python

import requests

r = requests.get("https://aieconomyhub.co/api/page/prompt-performance-tracker", timeout=10)
r.raise_for_status()
data = r.json()
print(data["title"])
for faq in data.get("faqs", []):
    print("Q:", faq["q"])

JavaScript / Node

// Node 20+ / modern browser
const res = await fetch("https://aieconomyhub.co/api/page/prompt-performance-tracker");
if (!res.ok) throw new Error("HTTP " + res.status);
const prompt_performance_tracker = await res.json();
console.log(prompt_performance_tracker.title);
for (const faq of prompt_performance_tracker.faqs ?? []) {
  console.log("Q:", faq.q);
}

Spec: /api/openapi.yaml · Docs: /api/docs

Prompt performance tracker

Frequently asked questions