Every AI interaction costs tokens. Every token carries a price. And in the agentic era, a single user request can fan out into 5–50 LLM calls. This guide breaks down 2026 token economics — and gives you the exact levers to cut spend by 70–85% without sacrificing quality.
Tokens are the atomic unit of AI computation — not characters, not words, but sub-word fragments that every LLM uses to read input and write output. Understanding them is understanding the bill.
Common words are 1 token. Complex or rare words split into multiple tokens. Code is densest — often 1 token per character.
GPT-5.5: $5 input vs $30 output per MTok. Claude Opus 4.7: $5 vs $25. A model that writes a lot costs far more than one that reads a lot.
Models like GPT-5.5 (reasoning mode), Claude Opus 4.7 Extended Thinking, and Gemini 3.1 Pro Deep Think generate internal chain-of-thought tokens that are billed but never shown to you — often 3–10× the visible output.
One user request can trigger 5–50 LLM + MCP tool calls. Simple agents use 5K–15K tokens per task; complex multi-agent systems consume 200K–1M+ tokens per task. This is the dominant cost story of 2026.
Your instructions to the model. 200–2,000 tokens. Sent on every single call — prime candidate for prompt caching.
All prior messages in the conversation. Grows every turn. The silent multiplier — often the largest input cost driver.
The actual query plus any retrieved documents (RAG). Usually the smallest component but grows with retrieval.
What the model writes back. Billed at 4–10× the input rate. Longer responses = exponentially higher cost. The most expensive component per token.
USD per million tokens (MTok) — May 2026. The frontier has bifurcated: Western flagships sit at $5–$30 (input) / $25–$180 (output), while Chinese open-weight models like DeepSeek V4 Flash undercut at $0.14/$0.28 with parity quality. Output still costs 4–10× input — always model the ratio for your workload.
| Model | Provider | Tier | Input $/MTok | Output $/MTok | Cached Input | Context | Best For |
|---|---|---|---|---|---|---|---|
| GPT-5.5 Pro | OpenAI | Flagship | $30.00 | $180.00 | $3.00 | 1M | Hardest reasoning, executive tasks |
| GPT-5.5 | OpenAI | Flagship | $5.00 | $30.00 | $0.50 | 1M | Agentic-first default, 78.7% OSWorld-V |
| Claude Opus 4.7 | Anthropic | Flagship | $5.00 | $25.00 | $0.30 | 1M | #1 on SWE-bench Pro (64.3%), high-stakes coding & analysis |
| Gemini 3.1 Pro | Flagship | $2.00 | $12.00 | $0.20 | 1M | Price-perf winner. Leads 13/16 benchmarks at 1/2 the cost | |
| GPT-5.5 (Reasoning Mode) | OpenAI | Reasoning | $5.00 | $30.00 | $0.50 | 1M | Multi-step logic — internal CoT tokens billed |
| Claude Opus 4.7 Extended Thinking | Anthropic | Reasoning | $5.00 | $25.00 | $0.30 | 1M | Planning, long-horizon analysis. CoT counted as output |
| Claude Sonnet 4.6 | Anthropic | Mid-Tier | $3.00 | $15.00 | $0.30 | 1M | Enterprise copilots, the default mid-tier for coding agents |
| Mistral Medium 3.5 | Mistral | Mid-Tier | $1.50 | $7.50 | $0.15 | 256K | Open-weight (mod. MIT). 77.6% SWE-bench. EU-sovereign default |
| DeepSeek V4 Pro | DeepSeek | Mid-Tier | $1.74 | $3.48 | $0.035 | 1M | Best dollar-per-quality at frontier-adjacent tier |
| Qwen 3.6 Plus | Alibaba | Mid-Tier | $0.40 | $1.20 | $0.04 | 1M | Beats Opus 4.6 on Terminal-Bench 2.0. Free on OpenRouter |
| Claude Haiku 4.5 | Anthropic | Fast & Cheap | $1.00 | $5.00 | $0.10 | 200K | Realtime copilots, support bots, sub-agent tier |
| Gemini 3.1 Flash | Fast & Cheap | $0.30 | $2.50 | $0.03 | 1M | High-volume summarisation, long-context cheap | |
| GPT-5 Mini | OpenAI | Fast & Cheap | $0.25 | $2.00 | $0.025 | 200K | Automation, high-volume batch |
| DeepSeek V4 Flash | DeepSeek | Fast & Cheap | $0.14 | $0.28 | $0.003 | 1M | The new 2026 nano-tier default. 30–70% routing cost cut |
| GPT-5 Nano | OpenAI | Fast & Cheap | $0.05 | $0.40 | $0.005 | 128K | Cheapest OpenAI option, simple classification + extraction |
| Qwen 3.6 (open weights) | Alibaba | Open Source | Free* | Free* | — | 1M | Apache 2.0. *Self-host or via OpenRouter, sovereignty default |
| Llama 4 Maverick | Open Source | $0.15 | $0.60 | — | 1M | Open weights, fine-tuning, sovereignty | |
| Gemma 4 (31B Dense) | Open Source | Free* | Free* | — | 128K | Apache 2.0. 400M+ downloads. #3 globally on Arena | |
| MiniMax M2.7 | MiniMax | Open Source | $0.40 | $1.50 | — | 256K | 229B MoE. Best OSS at ELO 1495 (GDPval-AA) |
These are proven, production-tested techniques. Combined, organisations routinely achieve 40–80% cost reduction. Start with the high-impact ones — they require minimal engineering effort.
Providers cache the key-value matrices of repeated prompt prefixes. This is the single highest-impact optimisation — up to 90% cheaper on cached input tokens with minimal code changes.
cache_control parameter for explicit breakpoints. Target 80–95% hit rate. At 95% hit rate, Anthropic caching delivers a 82% effective input-cost reduction versus OpenAI's 55% — a key reason cache-adjusted pricing often favours Claude despite higher list prices.
The 2026 default architecture. Send 70% of traffic to nano-tier (DeepSeek V4 Flash / GPT-5 Nano / Gemini 3.1 Flash) for classification + simple tool calls, 25% to mid-tier (Sonnet 4.6 / Qwen 3.6 / GLM-5.1) for structured reasoning, and only 5% to frontier (Opus 4.7 / GPT-5.5 / Gemini 3.1 Pro) for ambiguous planning. Cost runs at ~15% of all-frontier, with statistically indistinguishable quality on 200-pair golden sets.
gen_ai spans BEFORE turning on routing.
Every message in a multi-turn chat is resent on every call. A 40-turn conversation carries 25,000+ input tokens — the largest invisible cost driver. Sliding window truncation + summarisation eliminates this.
Instead of caching exact strings, semantic caching uses vector embeddings to serve cached responses when queries are semantically similar (≥0.85–0.95 cosine similarity). Avoids the API call entirely for repetitive queries.
Verbose prompts don't produce better outputs — they just cost more. Filler phrases like "You are a helpful, professional, knowledgeable assistant" add tokens with zero information gain. Telling a model to "be concise" reduces output tokens by 57–59%.
Output tokens cost 4–10× more than input. Setting max_tokens limits and requesting structured JSON instead of prose cuts the most expensive part of every call.
max_tokens parameter per call typemax_tokens as a hard upper bound for each use-case tier.
RAG pipelines often dump entire document chunks into context — including sentences irrelevant to the query. Pre-filtering retrieved chunks to only query-relevant sentences cuts RAG input tokens by 50–80% with maintained accuracy.
Most providers offer 50% discounts for asynchronous batch API calls. Non-real-time workloads — nightly reports, document analysis, bulk classification — are ideal candidates with zero quality trade-off.
Reasoning models (GPT-5.5 Reasoning, Claude Opus 4.7 Extended Thinking, Gemini 3.1 Pro Deep Think) generate internal chain-of-thought tokens before responding — billed but invisible. CoT can be 3–10× the visible output. For a FAQ bot or classifier, you may be paying 5–20× more than necessary.
Fine-tuning embeds task knowledge directly into the model, eliminating the need for lengthy few-shot examples in every prompt. For stable, high-volume tasks, fine-tuned smaller models can outperform generic larger ones at a fraction of the cost.
* Savings are on applicable token spend per technique — not additive across all calls. Most production systems achieve 70–85% total spend reduction by combining prompt caching + 70/25/5 routing + context management + agentic budget caps. Source: Obvious Works 2026, Redis LLMOps Guide, FinOps Foundation State of FinOps 2026.
In the chatbot era, one user message = one API call. In the agentic era, one user message can fan out into 5–50 LLM + MCP tool calls — and a Reflexion loop running 10 cycles consumes 50× the tokens of a linear pass. This is the dominant tokenomics story of 2026, and it's why 73% of enterprises blew past their AI budget.
1 call · 500–2K tokens · $0.001–$0.02 per task
3–8 calls · 5K–15K tokens · $0.05–$0.50 per task
15–30 calls · 50K–200K tokens · $1–$8 per task
50+ calls · 200K–1M+ tokens · $5–$50 per task
Anyone who modelled costs on chatbot benchmarks is now 5–30× over budget. Microsoft Research: an unconstrained coding agent costs $5–$8 to resolve a single SWE-bench issue.
A self-critique loop running 10 cycles consumes 50× the tokens of a single linear pass. Without circuit breakers, a stuck ReAct loop can burn $50/min on frontier models.
Episodic memory writes (MemOS, Mem0, Zep) and skill-doc generation (Hermes Agent pattern) consume output tokens. Budget for them in OTEL gen_ai.tokens.write spans.
The 2026 SLO unit: cost-per-resolved-task, not cost-per-token. Agent SREs alarm when OTEL cost.per_task p95 exceeds 2× baseline, auto-rolling back via Argo Rollouts.
FinOps Foundation State of FinOps 2026: Average enterprise AI budget grew from $1.2M (2024) → $7M (2026). 73% of enterprises report AI costs exceeded original budget projections. Fortune 500 firms report monthly inference bills in the tens of millions. The root cause in nearly every case: chatbot-era cost models applied to agentic-era workloads.
Beyond prompt-level changes, these infrastructure techniques address token costs at the compute layer — particularly relevant for self-hosted deployments and high-scale agentic systems.
For self-hosted deployments, the KV (key-value) cache is the dominant memory bottleneck. Recent research shows 70–90% memory reduction is achievable with minimal accuracy loss — enabling longer contexts on the same hardware.
Pair the target model with a lightweight draft model that proposes multiple tokens simultaneously. The target model verifies the batch in a single forward pass — achieving 2–5× faster generation with identical output quality.
Multi-agent systems can consume 4–15× more tokens than single-agent calls if not carefully orchestrated. Parallel execution, tool fusion, and model tiering within agent graphs dramatically reduce token overhead.
Quantisation reduces model weight precision from FP16 to INT8 or INT4, cutting memory requirements by 2–4× with minimal quality loss. Enables running larger models on cheaper hardware.
Where your model runs shapes per-token economics as much as which model you choose. At low volumes, API access wins. At scale, the economics flip decisively. Two-thirds of enterprises are now repatriating AI workloads on-premise (Finout 2026).
NCP becomes cheaper per-token than API access
AI factory per-token cost drops below API access
AI factory beats both API and NCP on per-token TCO
AI factory delivers 50%+ savings vs API at equivalent token volumes
Source: Deloitte "The Pivot to Tokenomics" — AI Economics Report 2025
Largest direct cost. NVIDIA HGX B200 GPUs, high-bandwidth memory, accelerators. Dominant cost at 10B token scale.
AI GPU racks draw 250–300kW vs 10–15kW for standard servers. Liquid cooling required at scale.
AI frameworks, orchestration tools, MLOps platforms, compliance tooling, enterprise support.
InfiniBand/NVLink GPU interconnects. High-bandwidth switches. Contributes 10–20% of TCO typically.
You cannot optimise what you cannot see. AI FinOps is the emerging practice of applying cloud financial governance discipline to token-based spending — and it's now the #1 priority for FinOps teams in 2026.
State of FinOps 2026 (FinOps Foundation): AI is now the fastest-growing new category of enterprise spend — average AI budget jumped from $1.2M (2024) to $7M (2026), and 73% of enterprises report AI costs exceeded original budget projections. 98% of respondents now use FinOps to manage AI spend (up from 31% in 2024). 33% named FinOps for AI as their #1 current or future priority — ahead of all others. New 2026 metric: cost per thought and token budget per project have replaced raw token counts as the unit economics of choice.
gen_ai spans to CFO-grade unit economics. Enterprise FinOps standard.Estimate your monthly API spend and see how the optimisation levers reduce it in real time. Toggle each lever to build your optimisation roadmap.