Every AI interaction costs tokens. Every token carries a price. This guide breaks down how token economics work — and gives you the exact levers to cut spend by 40–80% without sacrificing quality.
Tokens are the atomic unit of AI computation — not characters, not words, but sub-word fragments that every LLM uses to read input and write output. Understanding them is understanding the bill.
Common words are 1 token. Complex or rare words split into multiple tokens. Code is densest — often 1 token per character.
GPT-5: $1.25 input vs $10.00 output per MTok. A model that writes a lot costs far more than one that reads a lot.
Models like o1 or Claude Extended Thinking generate internal chain-of-thought tokens that are billed but never shown to you.
All prior messages + system prompt + new message are sent on every API call. A 40-turn chat can carry 25,000+ input tokens silently.
Your instructions to the model. 200–2,000 tokens. Sent on every single call — prime candidate for prompt caching.
All prior messages in the conversation. Grows every turn. The silent multiplier — often the largest input cost driver.
The actual query plus any retrieved documents (RAG). Usually the smallest component but grows with retrieval.
What the model writes back. Billed at 4–10× the input rate. Longer responses = exponentially higher cost. The most expensive component per token.
USD per million tokens (MTok) — April 2026. Output tokens cost significantly more than input. Always model the input:output ratio for your specific workload before comparing.
| Model | Provider | Tier | Input $/MTok | Output $/MTok | Cached Input | Context | Best For |
|---|---|---|---|---|---|---|---|
| GPT-5.2 Pro | OpenAI | Flagship | $21.00 | $168.00 | $2.10 | 200K | Hardest reasoning, executive tasks |
| Claude Opus 4.6 | Anthropic | Flagship | $5.00 | $25.00 | $0.30 | 200K | Risk reviews, high-stakes analysis |
| o1 | OpenAI | Reasoning | $15.00 | $60.00 | $7.50 | 200K | Complex multi-step logic, planning |
| GPT-5.2 | OpenAI | Mid-Tier | $1.75 | $14.00 | $0.175 | 200K | Coding, agentic workflows |
| GPT-5 | OpenAI | Mid-Tier | $1.25 | $10.00 | $0.125 | 128K | General flagship, copilots |
| Claude Sonnet 4.6 | Anthropic | Mid-Tier | $3.00 | $15.00 | $0.30 | 1M | Enterprise copilots, knowledge workflows |
| Gemini 2.5 Pro | Mid-Tier | $1.25 | $10.00 | $0.31 | 1M | Multimodal, long-context analysis | |
| GPT-4.1 | OpenAI | Mid-Tier | $2.00 | $8.00 | $0.20 | 1M | Product UIs, multi-turn workflows |
| GPT-5 Mini | OpenAI | Fast & Cheap | $0.25 | $2.00 | $0.025 | 200K | Automation, high-volume batch |
| Claude Haiku 4.5 | Anthropic | Fast & Cheap | $0.80 | $4.00 | $0.08 | 200K | Realtime copilots, support bots |
| Gemini 2.5 Flash | Fast & Cheap | $0.30 | $2.50 | $0.03 | 1M | High-volume summarisation, triggered automations | |
| DeepSeek V3.2 | DeepSeek | Fast & Cheap | $0.28 | $0.42 | $0.028 | 128K | Best value per token, 90% cache discounts |
| Gemini 2.0 Flash-Lite | Fast & Cheap | $0.075 | $0.30 | — | 1M | Cheapest mainstream option, simple tasks | |
| Llama 4 Maverick | Open Source | $0.15 | $0.60 | — | 1M | Open weights, fine-tuning, sovereignty | |
| Llama 3.3 70B | Open Source | $0.10 | $0.32 | — | 131K | 5–14× cheaper than GPT-4o at comparable quality | |
| Mistral Small 3.2 | Mistral | Open Source | $0.07 | $0.20 | — | 128K | Ultra-low cost European open model |
| Mistral Nemo | Mistral | Open Source | $0.02 | $0.04 | — | 131K | Absolute lowest API cost, simple extraction tasks |
These are proven, production-tested techniques. Combined, organisations routinely achieve 40–80% cost reduction. Start with the high-impact ones — they require minimal engineering effort.
Providers cache the key-value matrices of repeated prompt prefixes. This is the single highest-impact optimisation — up to 90% cheaper on cached input tokens with minimal code changes.
cache_control parameter for explicit breakpoints. Target 70%+ cache hit rate. Rate limits don't count against cached reads (Claude 3.7+).
Route tasks to the cheapest model that meets the quality bar. A cascade strategy sends simple tasks to Flash/Haiku/Mini and escalates to Sonnet/GPT-5 only when required. FAQ bots do not need frontier models.
Every message in a multi-turn chat is resent on every call. A 40-turn conversation carries 25,000+ input tokens — the largest invisible cost driver. Sliding window truncation + summarisation eliminates this.
Instead of caching exact strings, semantic caching uses vector embeddings to serve cached responses when queries are semantically similar (≥0.85–0.95 cosine similarity). Avoids the API call entirely for repetitive queries.
Verbose prompts don't produce better outputs — they just cost more. Filler phrases like "You are a helpful, professional, knowledgeable assistant" add tokens with zero information gain. Telling a model to "be concise" reduces output tokens by 57–59%.
Output tokens cost 4–10× more than input. Setting max_tokens limits and requesting structured JSON instead of prose cuts the most expensive part of every call.
max_tokens parameter per call typemax_tokens as a hard upper bound for each use-case tier.
RAG pipelines often dump entire document chunks into context — including sentences irrelevant to the query. Pre-filtering retrieved chunks to only query-relevant sentences cuts RAG input tokens by 50–80% with maintained accuracy.
Most providers offer 50% discounts for asynchronous batch API calls. Non-real-time workloads — nightly reports, document analysis, bulk classification — are ideal candidates with zero quality trade-off.
Reasoning models (o1, Claude Extended Thinking) generate internal chain-of-thought tokens before responding — billed but invisible. For a FAQ bot or classifier, you may be paying 5–20× more than necessary.
Fine-tuning embeds task knowledge directly into the model, eliminating the need for lengthy few-shot examples in every prompt. For stable, high-volume tasks, fine-tuned smaller models can outperform generic larger ones at a fraction of the cost.
* Savings are on applicable token spend per technique — not additive across all calls. Most production systems achieve 70–80% total spend reduction by combining prompt caching + model routing + context management. Source: Obvious Works 2026, Redis LLMOps Guide.
Beyond prompt-level changes, these infrastructure techniques address token costs at the compute layer — particularly relevant for self-hosted deployments and high-scale agentic systems.
For self-hosted deployments, the KV (key-value) cache is the dominant memory bottleneck. Recent research shows 70–90% memory reduction is achievable with minimal accuracy loss — enabling longer contexts on the same hardware.
Pair the target model with a lightweight draft model that proposes multiple tokens simultaneously. The target model verifies the batch in a single forward pass — achieving 2–5× faster generation with identical output quality.
Multi-agent systems can consume 4–15× more tokens than single-agent calls if not carefully orchestrated. Parallel execution, tool fusion, and model tiering within agent graphs dramatically reduce token overhead.
Quantisation reduces model weight precision from FP16 to INT8 or INT4, cutting memory requirements by 2–4× with minimal quality loss. Enables running larger models on cheaper hardware.
Where your model runs shapes per-token economics as much as which model you choose. At low volumes, API access wins. At scale, the economics flip decisively. Two-thirds of enterprises are now repatriating AI workloads on-premise (Finout 2026).
NCP becomes cheaper per-token than API access
AI factory per-token cost drops below API access
AI factory beats both API and NCP on per-token TCO
AI factory delivers 50%+ savings vs API at equivalent token volumes
Source: Deloitte "The Pivot to Tokenomics" — AI Economics Report 2025
Largest direct cost. NVIDIA HGX B200 GPUs, high-bandwidth memory, accelerators. Dominant cost at 10B token scale.
AI GPU racks draw 250–300kW vs 10–15kW for standard servers. Liquid cooling required at scale.
AI frameworks, orchestration tools, MLOps platforms, compliance tooling, enterprise support.
InfiniBand/NVLink GPU interconnects. High-bandwidth switches. Contributes 10–20% of TCO typically.
You cannot optimise what you cannot see. AI FinOps is the emerging practice of applying cloud financial governance discipline to token-based spending — and it's now the #1 priority for FinOps teams in 2026.
State of FinOps 2026 (FinOps Foundation): 98% of respondents now use FinOps to manage AI spend (up from 31% in 2024). 58% cite AI cost management as their most desired skill addition. 33% named FinOps for AI as their top current or future priority — ahead of all others. Yet 80% of companies still miss AI infrastructure cost forecasts by more than 25%.
Estimate your monthly API spend and see how the optimisation levers reduce it in real time. Toggle each lever to build your optimisation roadmap.