Chain-of-thought prompting that actually works
"Think step by step" is a meme at this point. It helps β sometimes β but it's a blunt tool. The version that reliably lifts pass rate on hard reasoning tasks is a three-part structure: decomposition, rationale, self-check. Stack them in the prompt and the model reasons instead of pattern-matching.
The three-part structure
1. Decomposition
Before asking for an answer, ask for a list. "List the top 5 hypotheses." "List every assumption in the problem." "List the relevant data points from the context." This forces the model to surface structure it would otherwise skip.
2. Rationale per option
Ask for evidence for each item in the list. "For each hypothesis, list the evidence for and against." "Score each 1-5 on plausibility." This is where chain-of-thought actually happens β not in a vague "think step by step" wrapper, but in structured per-item reasoning.
3. Self-check
Before the final answer, ask the model to check its own work from a different perspective. "What would a skeptical VP of Product say is missing?" "Name one way this answer could be wrong." Self-check prompts reliably catch 20-40% of errors that would otherwise ship.
When to use CoT vs a reasoning model
For hard math, proofs, or code requiring 10+ steps of logic, skip CoT and use a reasoning model (OpenAI o4, for example). They generate an extended internal chain of thought automatically and land 5-15 points higher on hard benchmarks.
For medium-hard tasks β diagnosis, analysis, multi-factor trade-off β CoT on a non-reasoning model (Sonnet 4.5 or GPT-5) usually hits 90%+ of the quality of a reasoning model at 1/5 the cost and 1/10 the latency. This is the sweet spot.
Anti-patterns
- "Think step by step" and nothing else. Vague. Ask for specific structure.
- Chain-of-thought on simple tasks. You are paying for unnecessary output tokens β 2-5Γ cost for zero quality lift.
- No final-answer section. Without a bounded final-answer block, the model keeps reasoning indefinitely.
- Leaking reasoning to users. Chain-of-thought is for the model, not the user. Strip it before display or ask the model to hide it in a reasoning-only section.
Measurement
Score CoT prompts on pass rate, cost, and latency against a baseline on your eval set. If pass rate lifts < 5% but cost 3Γ, drop it.
- RGCF Prompt Template β The baseline template to layer CoT on top of.
- Prompt Performance Tracker β A/B CoT vs non-CoT and measure the lift.
- GPT-5 vs o4 β When to use CoT vs a reasoning model.
- LLM API Cost Calculator β Model the output-token cost of a CoT prompt.