AI Tokenomics — Understand & Optimise Token Spend (May 2026 Update)

01 — Fundamentals

What is a Token?

Tokens are the atomic unit of AI computation — not characters, not words, but sub-word fragments that every LLM uses to read input and write output. Understanding them is understanding the bill.

Type to see live tokenisation:

≈ 0 tokens · 0 chars · ratio: 0 chars/token

1 token ≈ ¾ of an English word

Common words are 1 token. Complex or rare words split into multiple tokens. Code is densest — often 1 token per character.

Output tokens cost 4–10× more

GPT-5.5: $5 input vs $30 output per MTok. Claude Opus 4.7: $5 vs $25. A model that writes a lot costs far more than one that reads a lot.

Reasoning models have hidden token cost

Models like GPT-5.5 (reasoning mode), Claude Opus 4.7 Extended Thinking, and Gemini 3.1 Pro Deep Think generate internal chain-of-thought tokens that are billed but never shown to you — often 3–10× the visible output.

Agentic loops fan-out 5–50×

One user request can trigger 5–50 LLM + MCP tool calls. Simple agents use 5K–15K tokens per task; complex multi-agent systems consume 200K–1M+ tokens per task. This is the dominant cost story of 2026.

How One API Call Consumes Tokens

INPUT

System Prompt

Your instructions to the model. 200–2,000 tokens. Sent on every single call — prime candidate for prompt caching.

→

INPUT

Chat History

All prior messages in the conversation. Grows every turn. The silent multiplier — often the largest input cost driver.

→

INPUT

User Message + Context

The actual query plus any retrieved documents (RAG). Usually the smallest component but grows with retrieval.

→

OUTPUT 4–10×

Model Response

What the model writes back. Billed at 4–10× the input rate. Longer responses = exponentially higher cost. The most expensive component per token.

Three Ways to Buy AI Tokens

📦

Packaged SaaS

Per-seat subscription (e.g. Microsoft Copilot ~$30/user/mo). Tokens are invisible — bundled into the vendor's price. Low control, high simplicity. Risk: you cannot optimise what you cannot see.

Low visibility · Predictable cost

🔌

API Access

Pay per token, metered in real time. Full visibility, full volatility. Costs scale with every prompt length decision and model choice you make. Best for builders who want control.

Full transparency · Variable cost

🏭

AI Factory (Self-hosted)

Own or co-locate the GPUs. Tokens emerge from capex decisions. Maximum sovereignty, lowest per-token cost at scale (84B+ tokens/year). Requires MLOps capability and significant upfront investment.

Maximum control · High capex

02 — Market Pricing

Token Prices Across Providers

USD per million tokens (MTok) — May 2026. The frontier has bifurcated: Western flagships sit at $5–$30 (input) / $25–$180 (output), while Chinese open-weight models like DeepSeek V4 Flash undercut at $0.14/$0.28 with parity quality. Output still costs 4–10× input — always model the ratio for your workload.

Model	Provider	Tier	Input $/MTok	Output $/MTok	Cached Input	Context	Best For
GPT-5.5 Pro	OpenAI	Flagship	$30.00	$180.00	$3.00	1M	Hardest reasoning, executive tasks
GPT-5.5	OpenAI	Flagship	$5.00	$30.00	$0.50	1M	Agentic-first default, 78.7% OSWorld-V
Claude Opus 4.7	Anthropic	Flagship	$5.00	$25.00	$0.30	1M	#1 on SWE-bench Pro (64.3%), high-stakes coding & analysis
Gemini 3.1 Pro	Google	Flagship	$2.00	$12.00	$0.20	1M	Price-perf winner. Leads 13/16 benchmarks at 1/2 the cost
GPT-5.5 (Reasoning Mode)	OpenAI	Reasoning	$5.00	$30.00	$0.50	1M	Multi-step logic — internal CoT tokens billed
Claude Opus 4.7 Extended Thinking	Anthropic	Reasoning	$5.00	$25.00	$0.30	1M	Planning, long-horizon analysis. CoT counted as output
Claude Sonnet 4.6	Anthropic	Mid-Tier	$3.00	$15.00	$0.30	1M	Enterprise copilots, the default mid-tier for coding agents
Mistral Medium 3.5	Mistral	Mid-Tier	$1.50	$7.50	$0.15	256K	Open-weight (mod. MIT). 77.6% SWE-bench. EU-sovereign default
DeepSeek V4 Pro	DeepSeek	Mid-Tier	$1.74	$3.48	$0.035	1M	Best dollar-per-quality at frontier-adjacent tier
Qwen 3.6 Plus	Alibaba	Mid-Tier	$0.40	$1.20	$0.04	1M	Beats Opus 4.6 on Terminal-Bench 2.0. Free on OpenRouter
Claude Haiku 4.5	Anthropic	Fast & Cheap	$1.00	$5.00	$0.10	200K	Realtime copilots, support bots, sub-agent tier
Gemini 3.1 Flash	Google	Fast & Cheap	$0.30	$2.50	$0.03	1M	High-volume summarisation, long-context cheap
GPT-5 Mini	OpenAI	Fast & Cheap	$0.25	$2.00	$0.025	200K	Automation, high-volume batch
DeepSeek V4 Flash	DeepSeek	Fast & Cheap	$0.14	$0.28	$0.003	1M	The new 2026 nano-tier default. 30–70% routing cost cut
GPT-5 Nano	OpenAI	Fast & Cheap	$0.05	$0.40	$0.005	128K	Cheapest OpenAI option, simple classification + extraction
Qwen 3.6 (open weights)	Alibaba	Open Source	Free*	Free*	—	1M	Apache 2.0. *Self-host or via OpenRouter, sovereignty default
Llama 4 Maverick	Meta	Open Source	$0.15	$0.60	—	1M	Open weights, fine-tuning, sovereignty
Gemma 4 (31B Dense)	Google	Open Source	Free*	Free*	—	128K	Apache 2.0. 400M+ downloads. #3 globally on Arena
MiniMax M2.7	MiniMax	Open Source	$0.40	$1.50	—	256K	229B MoE. Best OSS at ELO 1495 (GDPval-AA)

Prices from Anthropic Pricing, OpenAI API Pricing, DeepSeek, Gemini 3.1 Pro Guide, AI Pricing Guru 2026. Prices change frequently — verify directly with each provider before budgeting.

Real-World Cost Scenarios

Chatbot — 800 in / 400 out tokens/turn · 10K users · 20 turns/day

GPT-5 Nano$13/mo

DeepSeek V4 Flash$15/mo

Gemini 3.1 Flash$78/mo

Claude Haiku 4.5$192/mo

Claude Sonnet 4.6$504/mo

RAG Pipeline — 8,000 in / 800 out per query · 50K queries/month

DeepSeek V4 Flash$67/mo

GPT-5 Mini$180/mo

Gemini 3.1 Flash$220/mo

Gemini 3.1 Pro$1,280/mo

Claude Sonnet 4.6$1,800/mo

Code Generation — 2,000 in / 1,500 out per request · 500 req/day

DeepSeek V4 Flash$8/mo

Qwen 3.6 Plus$66/mo

DeepSeek V4 Pro$130/mo

Claude Sonnet 4.6$427/mo

Claude Opus 4.7$712/mo

GPT-5.5$825/mo

Agentic Workflow — 1 user task = 25 LLM+MCP calls · 500K tokens/task · 1K tasks/day

DeepSeek V4 Flash (all calls)$1,260/mo

70/25/5 routing (V4 Flash + Sonnet + Opus)$5,400/mo

All Sonnet 4.6$22,500/mo

All Opus 4.7 (worst case)$37,500/mo

All GPT-5.5$45,000/mo

03 — Core Optimisation

10 Levers to Cut Token Spend

These are proven, production-tested techniques. Combined, organisations routinely achieve 40–80% cost reduction. Start with the high-impact ones — they require minimal engineering effort.

Biggest Lever

Prompt Caching (KV Cache)

Providers cache the key-value matrices of repeated prompt prefixes. This is the single highest-impact optimisation — up to 90% cheaper on cached input tokens with minimal code changes.

May 2026 Provider Cache Pricing

Anthropic Opus 4.7 / Sonnet 4.6: $3.00 → $0.30/MTok cached (90% off)
OpenAI GPT-5.5 (auto-caching): 50% off input, no code change
DeepSeek V4 Flash: $0.14 → $0.003/MTok cached (98% off) — most aggressive in market
Gemini 3.1 Pro: $2.00 → $0.20/MTok cached (90% off)
AWS Bedrock AgentCore: 90% cost cut, 85% latency cut, MCP-aware
Min 1,024 tokens; TTL 5 min standard, 1 hour Anthropic extended

Implementation: Place stable content first — system prompt, docs, tool definitions, MCP tool schemas. Place dynamic content last — user queries, session data. Anthropic: use cache_control parameter for explicit breakpoints. Target 80–95% hit rate. At 95% hit rate, Anthropic caching delivers a 82% effective input-cost reduction versus OpenAI's 55% — a key reason cache-adjusted pricing often favours Claude despite higher list prices.

Up to 90–98% off input tokens

High Impact

Multi-Model Routing — The 70/25/5 Rule

The 2026 default architecture. Send 70% of traffic to nano-tier (DeepSeek V4 Flash / GPT-5 Nano / Gemini 3.1 Flash) for classification + simple tool calls, 25% to mid-tier (Sonnet 4.6 / Qwen 3.6 / GLM-5.1) for structured reasoning, and only 5% to frontier (Opus 4.7 / GPT-5.5 / Gemini 3.1 Pro) for ambiguous planning. Cost runs at ~15% of all-frontier, with statistically indistinguishable quality on 200-pair golden sets.

2026 routing strategies (production-validated)

Static rules (intent → model map): 30–40% cost reduction, <5 min to ship
Embedding-based (k-means buckets): 50–70% reduction
Trained classifier (RouteLLM-style): 65–85% reduction
Cascade with confidence threshold: 70–85% reduction — 2026 default
Architecture lifts (42%→78%) outperform model upgrades alone

Implementation: LiteLLM (33K+ stars, OpenAI-compatible proxy, 100+ providers) is the safest first move. Portkey for enterprise gateway features. Bifrost (Rust) for sub-ms latency. Watch for: cascade thrash (threshold too high → escalates everything), tool-schema mismatch across providers, context-window asymmetry. Wire OTEL gen_ai spans BEFORE turning on routing.

70–85% reduction at frontier-equivalent quality

High Impact

Truncate & Summarise Conversation History

Every message in a multi-turn chat is resent on every call. A 40-turn conversation carries 25,000+ input tokens — the largest invisible cost driver. Sliding window truncation + summarisation eliminates this.

Implementation patterns

Sliding window: Keep only last N turns
Summarisation: After 10 turns, compress older history to 1 paragraph
Structured state: Extract facts into a key-value store, not raw chat
Topic clear: Reset context on new user intent/topic
Max messages limit: Hard-cap turns per session in agent platforms

Implementation: Use a cheap model (Haiku, Flash-Lite) to summarise older context before the main call. The summarisation cost is negligible vs the savings. Context engines with this pattern achieve 40–60% input reduction.

40–70% context token reduction

High Impact

Semantic Caching

Instead of caching exact strings, semantic caching uses vector embeddings to serve cached responses when queries are semantically similar (≥0.85–0.95 cosine similarity). Avoids the API call entirely for repetitive queries.

2026 production benchmarks

Cache hit rates: 60–85% in support/FAQ workloads
API call reduction: up to 68.8% fewer calls
Latency: 1.67s → 0.052s per cache hit (96.9% faster)
Cost reduction: up to 73% on conversational workloads
31% of LLM queries show semantic similarity — often untapped

Implementation: Redis with LangCache, Portkey, Helicone, or Bifrost all support semantic caching in 2026 via one-line gateway integration. Namespace by model + provider to avoid cross-contamination. Skip caching for conversations exceeding ~10 turns to reduce false positives.

50–73% on semantically repetitive workloads

Medium Impact

Concise Prompt Engineering

Verbose prompts don't produce better outputs — they just cost more. Filler phrases like "You are a helpful, professional, knowledgeable assistant" add tokens with zero information gain. Telling a model to "be concise" reduces output tokens by 57–59%.

Before → After

You are a helpful, knowledgeable, friendly, professional customer support agent who always responds in a polite and courteous manner and ensures customer satisfaction...
↓ 55% fewer tokens, same output quality
You are a customer support agent. Be concise and accurate.

Implementation: Audit every system prompt. Remove filler, pleasantries, and redundant instructions. LLMLingua prompt compression achieves 20× token reduction with only 1.5% quality loss — available as a plug-and-play LangChain/LlamaIndex integration.

30–55% prompt size reduction

Medium Impact

Constrain Output Length & Format

Output tokens cost 4–10× more than input. Setting max_tokens limits and requesting structured JSON instead of prose cuts the most expensive part of every call.

Control techniques

Set explicit max_tokens parameter per call type
Request JSON/structured output schemas: ~15% token reduction vs prose
Use stop sequences to prevent unnecessary continuation
Request bullet responses: "Answer in 3 bullet points"
Use "be concise" instruction: 57–59% output reduction (OPSDC research)

Implementation: Map output length requirements to use-case type. FAQ = 1–2 sentences. Summary = 3–5 bullets. Analysis = structured JSON. Set max_tokens as a hard upper bound for each use-case tier.

15–60% output token reduction

Medium Impact

RAG Context Compression

RAG pipelines often dump entire document chunks into context — including sentences irrelevant to the query. Pre-filtering retrieved chunks to only query-relevant sentences cuts RAG input tokens by 50–80% with maintained accuracy.

Optimised RAG pipeline

Retrieve top-K chunks (e.g. 10 chunks)
Score each sentence for query relevance (cheap model)
Pass only high-relevance sentences (~20% of retrieved text)
Reduce Top-K from 10 to 3–5 via hybrid search
Result: same answer quality, 50–80% fewer tokens

Implementation: Use a small model (Flash-Lite, Haiku) or a BM25/reranker to score and filter context before passing to the main LLM. One production case study: cost per contract dropped from expensive to $0.91, with 40% latency reduction.

25–80% RAG context reduction

Medium Impact

Batch Processing

Most providers offer 50% discounts for asynchronous batch API calls. Non-real-time workloads — nightly reports, document analysis, bulk classification — are ideal candidates with zero quality trade-off.

Batch API discount (2026)

OpenAI Batch API: 50% off all token costs
Anthropic Message Batches: 50% off standard pricing
Use cases: doc processing, bulk tagging, analytics, embeddings
Processing time: minutes to 24 hours (vs milliseconds sync)

Implementation: Audit all AI calls for real-time necessity. Most analytics, classification, and reporting workflows can be shifted to batch. No engineering complexity — just use the batch endpoint.

50% off all eligible non-real-time workloads

Targeted Impact

Avoid Reasoning Models for Simple Tasks

Reasoning models (GPT-5.5 Reasoning, Claude Opus 4.7 Extended Thinking, Gemini 3.1 Pro Deep Think) generate internal chain-of-thought tokens before responding — billed but invisible. CoT can be 3–10× the visible output. For a FAQ bot or classifier, you may be paying 5–20× more than necessary.

When to use reasoning models

✓ Multi-step mathematical/logical deduction
✓ Code debugging with complex dependency chains
✓ Long-horizon agentic planning (orchestrator role only)
✓ Legal/financial analysis requiring step-by-step reasoning
✗ Summarisation, classification, translation, FAQ
✗ Sub-agent tool calls — use Sonnet 4.6 / Qwen 3.6 instead

Implementation: Implement a reasoning-model gate: only route to o1/extended-thinking when a classifier scores the task as requiring multi-step inference. Default to standard models.

5–20× cost reduction vs always-on reasoning

Enabler

Fine-Tuning for Repetitive High-Volume Tasks

Fine-tuning embeds task knowledge directly into the model, eliminating the need for lengthy few-shot examples in every prompt. For stable, high-volume tasks, fine-tuned smaller models can outperform generic larger ones at a fraction of the cost.

Economics (2026)

GPT-4.1 fine-tuning: $25/MTok training tokens
Inference savings: 40–60% fewer tokens per call
Self-hosted Llama fine-tuned: savings reach 20–25× vs proprietary API
Break-even: typically ~1M calls on the specific task
Only viable for stable tasks with 10K+ calls/month

Implementation: Only invest in fine-tuning for tasks where: (1) instructions are stable, (2) volume is high, (3) few-shot examples are large and repetitive. Avoid for evolving tasks or low-volume use cases.

40–75% long-term inference reduction

Combined Optimisation Potential

Prompt caching (KV cache)

90%

Semantic caching

73%

Model routing / right-sizing

60–80%

History truncation & summarisation

40–70%

Concise prompt engineering

30–55%

RAG context compression

25–80%

Batch processing

50%

Output length constraints

15–60%

* Savings are on applicable token spend per technique — not additive across all calls. Most production systems achieve 70–85% total spend reduction by combining prompt caching + 70/25/5 routing + context management + agentic budget caps. Source: Obvious Works 2026, Redis LLMOps Guide, FinOps Foundation State of FinOps 2026.

03B — The 2026 Cost Story

Agentic Token Bloat: When One Request Becomes 50

In the chatbot era, one user message = one API call. In the agentic era, one user message can fan out into 5–50 LLM + MCP tool calls — and a Reflexion loop running 10 cycles consumes 50× the tokens of a linear pass. This is the dominant tokenomics story of 2026, and it's why 73% of enterprises blew past their AI budget.

Token Consumption by Agent Complexity

SIMPLE

Single-Shot Chatbot

1 call · 500–2K tokens · $0.001–$0.02 per task

TOOL-USE

Tool-Calling Agent (ReAct)

3–8 calls · 5K–15K tokens · $0.05–$0.50 per task

AGENTIC

Multi-Step Agent (Three-Agent Harness)

15–30 calls · 50K–200K tokens · $1–$8 per task

RUNAWAY 50×

Multi-Agent / Reflexion Loops

50+ calls · 200K–1M+ tokens · $5–$50 per task

5–30× cost multiplier vs chatbot baseline

Anyone who modelled costs on chatbot benchmarks is now 5–30× over budget. Microsoft Research: an unconstrained coding agent costs $5–$8 to resolve a single SWE-bench issue.

Reflexion loops compound exponentially

A self-critique loop running 10 cycles consumes 50× the tokens of a single linear pass. Without circuit breakers, a stuck ReAct loop can burn $50/min on frontier models.

Memory writes are silent token costs

Episodic memory writes (MemOS, Mem0, Zep) and skill-doc generation (Hermes Agent pattern) consume output tokens. Budget for them in OTEL gen_ai.tokens.write spans.

Cost-per-Task replaces cost-per-token

The 2026 SLO unit: cost-per-resolved-task, not cost-per-token. Agent SREs alarm when OTEL cost.per_task p95 exceeds 2× baseline, auto-rolling back via Argo Rollouts.

7 Patterns That Tame Agentic Token Bloat

🚦

Circuit Breakers

Open after 5 consecutive failures or 25% error rate over 30s. LangGraph 2.0 supports them at graph-edge level. Stops $50/min ReAct death-spirals before the budget alert fires.

Mandatory · Production default

🪢

Context Auto-Reset

Cosine similarity <0.7 between goal and recent actions = context drift. Save state snapshot → fresh context → resume from checkpoint. Cuts silent failure >50%. Salesforce Agentforce 3.0 native.

High impact · Easy retrofit

💰

Budget Hard Caps

$ per agent per session. $ per task. Calls/minute throttling. Hard-fail on breach. Microsoft Agent 365 ships this as default policy. Treats token spend as a first-class SLO.

Mandatory · CFO-friendly

🔁

Loop Detection (Rolling Hash)

Detects >3× same tool call with same args = stuck loop. Rolling-hash 30s window. Auto-restart from last good checkpoint. Cheapest single change that prevents catastrophic bills.

Quick win · 1-hour ship

🧠

MemRL + Skill Docs

Memory-as-RL: agent writes skill-doc Markdown after every >5-tool-call task (Hermes Agent pattern, 65K+ stars). Next session starts past prior failure mode = 40% task-time reduction.

Compound savings

🗜️

MemOS Token Compression

MemOS v1.1 (MemTensor) ships modular MemCube architecture with 72% token reduction via OpenClaw plugin. Apache Cassandra + Valkey backends scale to billions of rows p99 <50ms.

2026 OSS standard

📞

Idempotency Keys

Every MCP tool call gets a deterministic key. Voice/barge-in retry? Duplicate dedup'd. Stops one ambiguous user turn becoming N billed bookings. Mandatory for any payment-touching agent.

Mandatory · Safety + cost

🔪

Kill Switch T1–T4 (<1s)

App (OAuth revoke) → Network (egress block) → Infra (container down) → Memory (DB freeze). EU AI Act Annex III evidence requirement. Also caps maximum financial blast radius per agent run.

Compliance + cost ceiling

FinOps Foundation State of FinOps 2026: Average enterprise AI budget grew from $1.2M (2024) → $7M (2026). 73% of enterprises report AI costs exceeded original budget projections. Fortune 500 firms report monthly inference bills in the tens of millions. The root cause in nearly every case: chatbot-era cost models applied to agentic-era workloads.

04 — Advanced Techniques

Infrastructure-Level Optimisation

Beyond prompt-level changes, these infrastructure techniques address token costs at the compute layer — particularly relevant for self-hosted deployments and high-scale agentic systems.

Self-Hosted

KV Cache Compression

For self-hosted deployments, the KV (key-value) cache is the dominant memory bottleneck. Recent research shows 70–90% memory reduction is achievable with minimal accuracy loss — enabling longer contexts on the same hardware.

2026 state-of-the-art (ICLR 2026)

Google TurboQuant: 6× memory reduction, zero accuracy loss, no calibration
NVIDIA KVTC: Up to 20× compression via PCA + entropy coding
FastKV: 1.82× faster prefill + 2.87× faster decoding vs baseline
FP8/INT4 quantisation: 2–4× memory reduction, supported in vLLM natively

Implementation: Use vLLM with PagedAttention for 14–24× higher throughput vs naive implementations. Apply FP8 KV quantisation on NVIDIA Hopper/Blackwell GPUs. For cutting-edge: TurboQuant/KVTC land at ICLR April 2026.

6–20× memory reduction → lower hardware cost

Self-Hosted

Speculative Decoding

Pair the target model with a lightweight draft model that proposes multiple tokens simultaneously. The target model verifies the batch in a single forward pass — achieving 2–5× faster generation with identical output quality.

2026 benchmarks

Standard speculative decoding: 2–4× faster inference
Speculative Speculative Decoding (Saguaro): 5× vs autoregressive
14–17% throughput gain on Oracle OCI with A100 GPUs
Zero accuracy degradation — output is mathematically equivalent

Implementation: Supported natively in vLLM, SGLang, and TRT-LLM. DeepSeek uses Multi-Token Prediction (MTP) heads as a draft mechanism. EAGLE/EAGLE-2 are the most widely deployed speculative decoding variants as of 2026.

2–5× throughput increase → lower GPU hours per token

Agentic Systems

Agentic Workflow Optimisation

Multi-agent systems can consume 4–15× more tokens than single-agent calls if not carefully orchestrated. Parallel execution, tool fusion, and model tiering within agent graphs dramatically reduce token overhead.

Key agentic cost patterns

DAG-based topologies: Parallel instead of sequential tool calls
Tool fusion: Combine related tool calls → 12–40% token reduction
Model tiering: Haiku for sub-tasks, Sonnet for orchestration, Opus for core reasoning
Agent cost pre-estimation: Use LLM to evaluate plan cost before execution
Token quotas: Hard-cap tokens per agent per session to prevent runaway costs

Implementation: Set monthly quota limits ($) in agent platforms. Set execution limits per minute to prevent runaway loops. Set time limits per conversation. Track cost per agent per task. Source: Tonic3 Agentic Budget Framework 2025.

12–40% token reduction in multi-agent systems

Infrastructure

Model Quantisation (Self-hosted)

Quantisation reduces model weight precision from FP16 to INT8 or INT4, cutting memory requirements by 2–4× with minimal quality loss. Enables running larger models on cheaper hardware.

Quantisation options

FP8 (8-bit): 2× memory savings, near-zero quality loss
INT4 (4-bit): 4× memory savings, <5% accuracy delta on most tasks
Llama 70B at INT4: runs on 2× A100 80GB vs 4× in FP16
Libraries: bitsandbytes, AutoGPTQ, GGUF (llama.cpp)
H100 GPUs: 80% more cost-efficient per token vs older hardware

Implementation: Use INT8 for production serving as the conservative default. Use INT4 for less critical inference workloads where you've tested quality. Always benchmark quality on your specific task before deploying quantised models.

2–4× memory reduction → lower infrastructure cost

05 — Infrastructure Decision

Build vs Buy: The Hosting Decision

Where your model runs shapes per-token economics as much as which model you choose. At low volumes, API access wins. At scale, the economics flip decisively. Two-thirds of enterprises are now repatriating AI workloads on-premise (Finout 2026).

☁️

API Access

Pure OpexInstant setup

✓ No upfront investment
✓ Instant start, infinite scale
✓ Latest models available immediately
✓ Best for spiky / exploratory workloads

✗ Highest per-token cost
✗ Costs scale linearly, unpredictably
✗ No data sovereignty control
✗ Vendor lock-in risk

$0.075–$168 per MTok output

Best below ~7B tokens/month

⚡

Neocloud (NCP)

Pure OpexInstant

✓ Purpose-built for AI workloads
✓ Lower latency than hyperscalers
✓ Dynamic GPU provisioning
✓ Good mid-point before full ownership

✗ No control over physical layer
✗ High on-demand price variability
✗ External data residency risk

~$1–$4/GPU hour

Cheaper than API at 49B+ tokens/year

🏭

AI Factory (On-Prem)

Capex ModelHigh control

✓ Lowest per-token cost at scale
✓ Full data sovereignty
✓ Open-source models (free inference)
✓ Custom fine-tuning, no vendor lock-in

✗ Large upfront capex
✗ Multi-month procurement
✗ MLOps expertise required
✗ GPU obsolescence risk (annual release cycles)

~$1–$2/GPU hour amortised

Wins decisively at 84B+ tokens/year

TCO Inflection Points (Deloitte Simulation)

49B tokens/yr

NCP becomes cheaper per-token than API access

67B tokens/yr

AI factory per-token cost drops below API access

84B tokens/yr

AI factory beats both API and NCP on per-token TCO

3-year horizon

AI factory delivers 50%+ savings vs API at equivalent token volumes

Year 1
10B tokens

API $1.06M

NCP $0.97M

Factory $0.49M

Year 2
300B tokens

API $3.50M

NCP $2.72M

Factory $1.45M

API Neocloud AI Factory

Source: Deloitte "The Pivot to Tokenomics" — AI Economics Report 2025

AI Factory TCO Breakdown (10B tokens/year = $1.45M)

Compute (GPUs) 53% · $125,080

Largest direct cost. NVIDIA HGX B200 GPUs, high-bandwidth memory, accelerators. Dominant cost at 10B token scale.

Facilities, Power & Cooling 17% · $40,670

AI GPU racks draw 250–300kW vs 10–15kW for standard servers. Liquid cooling required at scale.

Software & Licensing 17% · $39,646

AI frameworks, orchestration tools, MLOps platforms, compliance tooling, enterprise support.

Networking 13% · $31,302

InfiniBand/NVLink GPU interconnects. High-bandwidth switches. Contributes 10–20% of TCO typically.

06 — AI FinOps

Token Governance & FinOps

You cannot optimise what you cannot see. AI FinOps is the emerging practice of applying cloud financial governance discipline to token-based spending — and it's now the #1 priority for FinOps teams in 2026.

State of FinOps 2026 (FinOps Foundation): AI is now the fastest-growing new category of enterprise spend — average AI budget jumped from $1.2M (2024) to $7M (2026), and 73% of enterprises report AI costs exceeded original budget projections. 98% of respondents now use FinOps to manage AI spend (up from 31% in 2024). 33% named FinOps for AI as their #1 current or future priority — ahead of all others. New 2026 metric: cost per thought and token budget per project have replaced raw token counts as the unit economics of choice.

FinOps Maturity Model for Token Spend

Level 1 — Inform

Observability

Log input + output tokens per API call
Tag every request with model name, team, use-case
Build dashboards: tokens/user, cost/app, cost/department
Set monthly budget alerts by business unit
Surface prompt efficiency metrics

Level 2 — Optimise

Cost Reduction

Activate prompt caching on all eligible system prompts
Implement model routing rules per use-case type
Apply concise prompt engineering across all templates
Shift non-real-time workloads to batch endpoints
Compress RAG context before sending to LLM

Level 3 — Operate

Governance

Track cost-per-unit-of-value (cost per resolved ticket, etc.)
Monthly review: top 10 token-consuming call types
Set hard quotas: $ per agent/month, calls/minute throttling
Establish model tier policies per use-case type
Auto-routing layer for least-cost capable model selection

Monitoring & Observability Tools (May 2026)

Microsoft Agent 365

Enterprise control plane · GA May 1 2026

Central agent registry, Entra workload identity, per-agent SLOs + burn-rate alerts, kill switches, auto EU AI Act Annex III audit trail. $15/user standalone or M365 E7. First new MS tier since E5 (2015).

LiteLLM

OSS routing proxy · 33K+ stars

OpenAI-compatible proxy across 100+ providers. The safest first move for multi-model routing. Sub-ms overhead, supports 70/25/5 + cascade strategies.

Langfuse

OSS tracing · MIT · 21K+ stars

OTEL-native, self-hostable. Token cost dashboards, prompt versioning, online evals on sampled prod traffic. Best general-purpose OSS LLM observability.

Arize Phoenix

OSS · OTEL-native · Agent Evals MCP

$70M Series C. Ships Agent Evals MCP — runs production evals as IDE actions inside Cursor / Claude Code. Drift detection on cost-per-task spans.

Braintrust

Eval-as-code · $80M Series

Same metric definition runs in CI gate (>2% regression blocks merge), shadow on mirrored prod traffic, and online prod sampled traffic. Argo Rollouts auto-rollback on quality drop.

LangSmith

LangGraph-native · Insights Agent

First-class threads + Insights Agent auto-clusters failure modes and drafts golden-set additions. Closes the self-healing dataset loop. Best for LangGraph pipelines.

Galileo Signals

Live traffic eval · Prescriptive remediation

Lightweight LLM-as-Judge on sampled prod. Signals auto-clusters millions of traces and prescribes fixes ("swap to GPT-5 Nano on this intent class → 38% cost cut"). Agent SRE favourite.

Finout / Vantage

Enterprise AI FinOps · Chargeback

Token-level attribution, virtual tagging, anomaly detection, chargeback/showback by BU. Connects OTEL gen_ai spans to CFO-grade unit economics. Enterprise FinOps standard.

The New Currencyof Enterprise AI

What is a Token?

How One API Call Consumes Tokens

Three Ways to Buy AI Tokens

Packaged SaaS

API Access

AI Factory (Self-hosted)

Token Prices Across Providers

Real-World Cost Scenarios

10 Levers to Cut Token Spend

Prompt Caching (KV Cache)

Multi-Model Routing — The 70/25/5 Rule

Truncate & Summarise Conversation History

Semantic Caching

Concise Prompt Engineering

Constrain Output Length & Format

RAG Context Compression

Batch Processing

Avoid Reasoning Models for Simple Tasks

Fine-Tuning for Repetitive High-Volume Tasks

Combined Optimisation Potential

Agentic Token Bloat: When One Request Becomes 50

Token Consumption by Agent Complexity

7 Patterns That Tame Agentic Token Bloat

Circuit Breakers

Context Auto-Reset

Budget Hard Caps

Loop Detection (Rolling Hash)

MemRL + Skill Docs

MemOS Token Compression

Idempotency Keys

Kill Switch T1–T4 (<1s)

Infrastructure-Level Optimisation

KV Cache Compression

Speculative Decoding

Agentic Workflow Optimisation

Model Quantisation (Self-hosted)

Build vs Buy: The Hosting Decision

API Access

Neocloud (NCP)

AI Factory (On-Prem)

TCO Inflection Points (Deloitte Simulation)

AI Factory TCO Breakdown (10B tokens/year = $1.45M)

Token Governance & FinOps

FinOps Maturity Model for Token Spend

Observability

Cost Reduction

Governance

Monitoring & Observability Tools (May 2026)

Token Cost Calculator

Your Workload

Apply Optimisation Levers

The New Currency
of Enterprise AI