§ I
Switch to cheaper models where quality allows
Typical ROI: 15–17x on affected workloads
The single biggest cost lever. GPT-4o costs $2.50 input / $10 output per million tokens. GPT-4o mini is $0.15 / $0.60 — roughly 15–17x cheaper. For classification, extraction, routing, content moderation, simple summarization, and many agentic sub-tasks, GPT-4o mini produces indistinguishable results. Audit every prompt in production and ask: does this actually need frontier capability?
Steps
- List every distinct prompt in production by model.
- For each, run 20 samples through GPT-4o and GPT-4o mini and compare outputs blindly.
- Switch any prompt where quality is comparable to GPT-4o mini.
- Benchmark cost reduction after 7 days of traffic.
§ II
Cache deterministic responses
Typical ROI: 30–80% reduction on cached traffic
A surprising fraction of API traffic is deterministic — same input, same output. Classification prompts, schema-conforming extractions, and repeat-question chatbots all benefit. Hash the prompt (+ model + temperature) and cache the response for a TTL matched to your freshness requirement. Even a 30% cache hit rate halves cost on a $1,000/month bill.
Steps
- Identify deterministic or near-deterministic prompts (temperature ≤ 0.2, no timestamp in prompt).
- Hash prompt + model + any relevant parameters as the cache key.
- Store responses in Redis, Vercel KV, or a simple SQLite with TTL.
- Instrument cache hit rate and cost delta in your observability stack.
§ III
Reduce RAG retrieval size
Typical ROI: $300–$500/month on typical RAG apps
Retrieval-augmented apps routinely retrieve 10–20 chunks of 1,000–2,000 tokens each. At GPT-4o's $2.50 input rate, a 20K-token context costs $0.05 per request in input alone. Halving retrieval K often has negligible quality impact when chunks are well-ranked. Measure before and after.
Steps
- Log your current average context-token count per request.
- A/B test quality at K=10, K=5, K=3 on a fixed eval set.
- Pick the smallest K that preserves quality within your threshold.
- Add embedding-based re-ranking to send better chunks instead of more.
§ IV
Use the Batch API for asynchronous workloads
Typical ROI: 50% off list price
OpenAI's Batch API processes requests asynchronously (up to 24-hour turnaround) at 50% of standard pricing. Any workload that doesn't need real-time response — nightly enrichment jobs, bulk embeddings, evaluations, data labeling, periodic summarization — is a Batch candidate. Measured by users who migrate: typical savings are $500–$5,000/month depending on volume.
Steps
- Identify workloads where latency tolerance is 12+ hours.
- Rewrite those calls to use the /v1/batches endpoint.
- Chunk submissions to stay within per-batch limits (50,000 requests or 100 MB).
- Poll for completion or use webhooks to process results.
§ V
Compress prompts and system messages
Typical ROI: 10–30% on input-heavy workloads
Production system prompts drift toward bloat — added instructions, stale examples, defensive 'do not do X' clauses. A 500-token system prompt multiplied by 10,000 requests/day on GPT-4o is $37.50/day in system-prompt cost alone. Audit quarterly. Remove any instruction that cannot be traced to a specific failure it prevents.
Steps
- Dump every system prompt in production into a single file.
- For each clause, trace it to the specific failure mode it prevents.
- Delete clauses that cannot be traced.
- Consolidate redundant examples. Prefer one well-chosen example over three mediocre ones.
§ VI
Use structured outputs carefully
Typical ROI: 5–15% on tool-heavy workloads
Structured outputs (JSON mode with schema, strict function calling) make code more reliable but add token overhead — the model spends tokens conforming to the schema and generating the JSON boilerplate. For high-volume tool use where each request produces a structured output, this overhead compounds. Use structured outputs only where a parsing failure would be costly; free-text + regex parsing is fine for 80% of cases.
Steps
- Audit every function/tool call in production.
- For each, ask: would a JSON parse failure cost more than the token overhead?
- Downgrade low-stakes tool calls to free-text + regex.
- Measure cost delta on downgraded workloads.
§ VII
Set project-scoped hard caps
Typical ROI: Caps maximum exposure, not average
The OpenAI dashboard's 'usage limit' is a soft alert, not an enforced cap. Project-scoped API keys with per-project limits are different: they stop processing requests when the project hits its cap. Create one project per service/feature, set a per-project cap at 150% of expected spend, and deploy with project-scoped keys. A single bug can't blow out your entire account.
Steps
- Create a new project in platform.openai.com for each isolated workload.
- Set the project's monthly spending limit (Settings → Limits).
- Generate a project-scoped API key (sk-proj-...).
- Deploy with the project key. Rotate your organization-wide key afterward to prevent accidental use.
§ VIII
Monitor spend hourly, not monthly
Typical ROI: Limits any one incident to hours, not weeks
Most OpenAI bill shocks are visible in the usage API within an hour of starting. But the default feedback loop is a monthly email invoice — by which time the damage has compounded for 30 days. A real-time monitor catches bad retry loops, misconfigured prompts, and unexpected model switches within one hour.
Steps
- Install Capped or an equivalent tool that polls the provider usage API hourly.
- Set notifications at 80%, 100%, and 150% of your monthly cap.
- Pair with provider-level hard caps (tactic VII) for defense in depth.
- Review alert frequency quarterly. Tighten caps as usage stabilizes.