§ I
Output tokens cost 3–5x input tokens
Every OpenAI model charges more for output than input. On GPT-4o: $2.50 per million input, $10.00 per million output. A cost estimate that averages the two rates underweights generation-heavy workloads by 3–10x. Summarization, reasoning, code generation, and long-form writing are all output-dominated — your 1,000-token prompt producing a 4,000-token answer is 94% output cost, not 50/50.
Fix: Split your usage model between input-heavy (retrieval, classification) and output-heavy (generation) workloads. Weight forecasts by actual input/output ratio, not averages.
Typical impact: Underestimating this alone produces 3–10x cost shocks.
§ II
Reasoning models bill for hidden 'thinking' tokens
o1 and o3 generate internal reasoning before producing the visible answer. These reasoning tokens are billed at the output rate ($60/million on o1, $12/million on o1-mini). A visible 500-token response can be backed by 5,000 tokens of reasoning you never see but always pay for. Reasoning-token count is not predictable from the prompt alone.
Fix: Track reasoning tokens explicitly via the usage API's reasoning_tokens field. Benchmark reasoning overhead per prompt type before scaling. For tasks that don't need multi-step reasoning, use GPT-4o instead of o1 — 5–10x cheaper for comparable quality on most problems.
Typical impact: A reasoning-heavy agent at 200 runs/day on o1 costs ~$1,700/month. GPT-4o on the same workload is ~$240/month.
§ III
Retrieval-augmented applications blow up input cost
RAG pipelines routinely pass 20,000+ tokens of retrieved context per request. At GPT-4o's $2.50 input rate, 20K tokens is $0.05 per request in input alone. At 300 queries/day, that's $450/month before you generate a single token of output. Many RAG systems are designed with top-K retrieval tuned for quality, not cost — and the cost compounds linearly with traffic.
Fix: Reduce retrieval K. Benchmark quality at K=3 vs K=10. Use embedding-based re-ranking to send fewer but better chunks. Consider summary chunks over raw passages.
Typical impact: Cutting average context from 20K to 5K tokens on a 300 queries/day RAG app saves ~$340/month.
§ IV
OpenAI's 'usage limit' is a soft alert, not a hard cap
The 'usage limit' setting in your billing dashboard looks like a spending cap. It is not. It is a notification threshold — OpenAI emails you when you cross it and continues to process your requests. Hard caps only exist via project-scoped API keys (each project has an enforced spending limit that stops traffic) or external tooling that actively monitors and alerts before you overshoot.
Fix: Create a separate project in the OpenAI dashboard with a hard spending limit. Use a project-scoped API key in production. Layer real-time external monitoring on top — Capped alerts at 80% of your cap so you have time to throttle before the provider-level cap fires.
Typical impact: Users who rely only on the dashboard 'usage limit' regularly overshoot by 50–200%.
§ V
Retry loops silently multiply cost
Production code retries failed requests. On 429 rate-limit errors or transient 5xx failures, a poorly-configured retry loop can fire 3–10 retries within seconds — each billed separately if the model processed any tokens before failing. A single upstream incident can turn a $50/day workload into $500 in an hour.
Fix: Use exponential backoff with jitter. Cap retries at 3. Log retry counts to your observability platform. Track retries-per-request as a first-class metric alongside latency and error rate.
Typical impact: Observed in the wild: a single bad retry loop during an OpenAI 500-error incident cost one team $2,400 in 90 minutes.
§ VI
Vision, tools, and structured outputs add surcharges
Image inputs have a per-image cost calculated from detail level and resolution. Function/tool calling adds ~10–20% output-token overhead from the generated call schemas. Structured outputs (JSON mode with schema) can add 5–15% from the schema-conforming generation. None of these appear as line items — they roll up into the base model's token cost.
Fix: Separate metered usage by feature: vision requests in one project, tool-heavy agents in another, plain text in a third. Track per-feature cost via usage API's model-level breakdown.
Typical impact: A vision-enabled agent processing 1,000 images/day at medium detail adds roughly $200–$400/month over text-only equivalents.
§ VII
Unexpected model upgrades change the pricing surface
When you alias a model ('gpt-4' or 'gpt-4-turbo'), OpenAI may silently upgrade the underlying version — and the pricing can shift with it. Even pinned model names have deprecation timelines that force migrations to differently-priced alternatives. A team upgrading from gpt-4-turbo ($10/$30) to gpt-4o ($2.50/$10) sees cost drop; moving to o1 ($15/$60) sees it spike. Neither happens with a dashboard warning.
Fix: Pin to specific model versions (gpt-4o-2024-08-06 rather than gpt-4o) in production. Subscribe to OpenAI's changelog. Reconcile your provider's cost API with your monthly estimates at least weekly.
Typical impact: One unannounced model switch in early 2025 moved a production workload from $4,000/month to $12,000/month before the team noticed.