AI FinOps · Precision Economics · May 2026 Update

The New Currency
of Enterprise AI

Every AI interaction costs tokens. Every token carries a price. And in the agentic era, a single user request can fan out into 5–50 LLM calls. This guide breaks down 2026 token economics — and gives you the exact levers to cut spend by 70–85% without sacrificing quality.

$1.2M → $7M Average enterprise AI budget growth, 2024 → 2026 (FinOps Foundation State of FinOps 2026)
73% of enterprises report AI costs exceeded original budget projections in 2026 (FinOps Foundation)
5–30× Token cost multiplier when moving from chatbot to multi-step agentic workflows
98% of enterprises now use FinOps to manage AI spend, up from 31% in 2024 (FinOps Foundation 2026)
1M+Tokens consumed per task by complex multi-agent systems
90%Input cost reduction via prompt caching (Anthropic / Bedrock)
70/25/5The 2026 model-routing rule: 70% nano · 25% mid · 5% frontier
84BAnnual tokens where AI factory beats API (Deloitte TCO)

What is a Token?

Tokens are the atomic unit of AI computation — not characters, not words, but sub-word fragments that every LLM uses to read input and write output. Understanding them is understanding the bill.

0 tokens  ·  0 chars  ·  ratio: 0 chars/token
1 token ≈ ¾ of an English word

Common words are 1 token. Complex or rare words split into multiple tokens. Code is densest — often 1 token per character.

Output tokens cost 4–10× more

GPT-5.5: $5 input vs $30 output per MTok. Claude Opus 4.7: $5 vs $25. A model that writes a lot costs far more than one that reads a lot.

Reasoning models have hidden token cost

Models like GPT-5.5 (reasoning mode), Claude Opus 4.7 Extended Thinking, and Gemini 3.1 Pro Deep Think generate internal chain-of-thought tokens that are billed but never shown to you — often 3–10× the visible output.

Agentic loops fan-out 5–50×

One user request can trigger 5–50 LLM + MCP tool calls. Simple agents use 5K–15K tokens per task; complex multi-agent systems consume 200K–1M+ tokens per task. This is the dominant cost story of 2026.

How One API Call Consumes Tokens

1
INPUT
System Prompt

Your instructions to the model. 200–2,000 tokens. Sent on every single call — prime candidate for prompt caching.

2
INPUT
Chat History

All prior messages in the conversation. Grows every turn. The silent multiplier — often the largest input cost driver.

3
INPUT
User Message + Context

The actual query plus any retrieved documents (RAG). Usually the smallest component but grows with retrieval.

4
OUTPUT 4–10×
Model Response

What the model writes back. Billed at 4–10× the input rate. Longer responses = exponentially higher cost. The most expensive component per token.

Three Ways to Buy AI Tokens

📦

Packaged SaaS

Per-seat subscription (e.g. Microsoft Copilot ~$30/user/mo). Tokens are invisible — bundled into the vendor's price. Low control, high simplicity. Risk: you cannot optimise what you cannot see.

Low visibility · Predictable cost
🔌

API Access

Pay per token, metered in real time. Full visibility, full volatility. Costs scale with every prompt length decision and model choice you make. Best for builders who want control.

Full transparency · Variable cost
🏭

AI Factory (Self-hosted)

Own or co-locate the GPUs. Tokens emerge from capex decisions. Maximum sovereignty, lowest per-token cost at scale (84B+ tokens/year). Requires MLOps capability and significant upfront investment.

Maximum control · High capex

Token Prices Across Providers

USD per million tokens (MTok) — May 2026. The frontier has bifurcated: Western flagships sit at $5–$30 (input) / $25–$180 (output), while Chinese open-weight models like DeepSeek V4 Flash undercut at $0.14/$0.28 with parity quality. Output still costs 4–10× input — always model the ratio for your workload.

Model Provider Tier Input $/MTok Output $/MTok Cached Input Context Best For
GPT-5.5 Pro OpenAI Flagship $30.00 $180.00 $3.00 1M Hardest reasoning, executive tasks
GPT-5.5 OpenAI Flagship $5.00 $30.00 $0.50 1M Agentic-first default, 78.7% OSWorld-V
Claude Opus 4.7 Anthropic Flagship $5.00 $25.00 $0.30 1M #1 on SWE-bench Pro (64.3%), high-stakes coding & analysis
Gemini 3.1 Pro Google Flagship $2.00 $12.00 $0.20 1M Price-perf winner. Leads 13/16 benchmarks at 1/2 the cost
GPT-5.5 (Reasoning Mode) OpenAI Reasoning $5.00 $30.00 $0.50 1M Multi-step logic — internal CoT tokens billed
Claude Opus 4.7 Extended Thinking Anthropic Reasoning $5.00 $25.00 $0.30 1M Planning, long-horizon analysis. CoT counted as output
Claude Sonnet 4.6 Anthropic Mid-Tier $3.00 $15.00 $0.30 1M Enterprise copilots, the default mid-tier for coding agents
Mistral Medium 3.5 Mistral Mid-Tier $1.50 $7.50 $0.15 256K Open-weight (mod. MIT). 77.6% SWE-bench. EU-sovereign default
DeepSeek V4 Pro DeepSeek Mid-Tier $1.74 $3.48 $0.035 1M Best dollar-per-quality at frontier-adjacent tier
Qwen 3.6 Plus Alibaba Mid-Tier $0.40 $1.20 $0.04 1M Beats Opus 4.6 on Terminal-Bench 2.0. Free on OpenRouter
Claude Haiku 4.5 Anthropic Fast & Cheap $1.00 $5.00 $0.10 200K Realtime copilots, support bots, sub-agent tier
Gemini 3.1 Flash Google Fast & Cheap $0.30 $2.50 $0.03 1M High-volume summarisation, long-context cheap
GPT-5 Mini OpenAI Fast & Cheap $0.25 $2.00 $0.025 200K Automation, high-volume batch
DeepSeek V4 Flash DeepSeek Fast & Cheap $0.14 $0.28 $0.003 1M The new 2026 nano-tier default. 30–70% routing cost cut
GPT-5 Nano OpenAI Fast & Cheap $0.05 $0.40 $0.005 128K Cheapest OpenAI option, simple classification + extraction
Qwen 3.6 (open weights) Alibaba Open Source Free* Free* 1M Apache 2.0. *Self-host or via OpenRouter, sovereignty default
Llama 4 Maverick Meta Open Source $0.15 $0.60 1M Open weights, fine-tuning, sovereignty
Gemma 4 (31B Dense) Google Open Source Free* Free* 128K Apache 2.0. 400M+ downloads. #3 globally on Arena
MiniMax M2.7 MiniMax Open Source $0.40 $1.50 256K 229B MoE. Best OSS at ELO 1495 (GDPval-AA)
Prices from Anthropic Pricing, OpenAI API Pricing, DeepSeek, Gemini 3.1 Pro Guide, AI Pricing Guru 2026. Prices change frequently — verify directly with each provider before budgeting.

Real-World Cost Scenarios

Chatbot — 800 in / 400 out tokens/turn · 10K users · 20 turns/day
GPT-5 Nano$13/mo
DeepSeek V4 Flash$15/mo
Gemini 3.1 Flash$78/mo
Claude Haiku 4.5$192/mo
Claude Sonnet 4.6$504/mo
RAG Pipeline — 8,000 in / 800 out per query · 50K queries/month
DeepSeek V4 Flash$67/mo
GPT-5 Mini$180/mo
Gemini 3.1 Flash$220/mo
Gemini 3.1 Pro$1,280/mo
Claude Sonnet 4.6$1,800/mo
Code Generation — 2,000 in / 1,500 out per request · 500 req/day
DeepSeek V4 Flash$8/mo
Qwen 3.6 Plus$66/mo
DeepSeek V4 Pro$130/mo
Claude Sonnet 4.6$427/mo
Claude Opus 4.7$712/mo
GPT-5.5$825/mo
Agentic Workflow — 1 user task = 25 LLM+MCP calls · 500K tokens/task · 1K tasks/day
DeepSeek V4 Flash (all calls)$1,260/mo
70/25/5 routing (V4 Flash + Sonnet + Opus)$5,400/mo
All Sonnet 4.6$22,500/mo
All Opus 4.7 (worst case)$37,500/mo
All GPT-5.5$45,000/mo

10 Levers to Cut Token Spend

These are proven, production-tested techniques. Combined, organisations routinely achieve 40–80% cost reduction. Start with the high-impact ones — they require minimal engineering effort.

01
Biggest Lever

Prompt Caching (KV Cache)

Providers cache the key-value matrices of repeated prompt prefixes. This is the single highest-impact optimisation — up to 90% cheaper on cached input tokens with minimal code changes.

May 2026 Provider Cache Pricing
  • Anthropic Opus 4.7 / Sonnet 4.6: $3.00 → $0.30/MTok cached (90% off)
  • OpenAI GPT-5.5 (auto-caching): 50% off input, no code change
  • DeepSeek V4 Flash: $0.14 → $0.003/MTok cached (98% off) — most aggressive in market
  • Gemini 3.1 Pro: $2.00 → $0.20/MTok cached (90% off)
  • AWS Bedrock AgentCore: 90% cost cut, 85% latency cut, MCP-aware
  • Min 1,024 tokens; TTL 5 min standard, 1 hour Anthropic extended
Implementation: Place stable content first — system prompt, docs, tool definitions, MCP tool schemas. Place dynamic content last — user queries, session data. Anthropic: use cache_control parameter for explicit breakpoints. Target 80–95% hit rate. At 95% hit rate, Anthropic caching delivers a 82% effective input-cost reduction versus OpenAI's 55% — a key reason cache-adjusted pricing often favours Claude despite higher list prices.
Up to 90–98% off input tokens
02
High Impact

Multi-Model Routing — The 70/25/5 Rule

The 2026 default architecture. Send 70% of traffic to nano-tier (DeepSeek V4 Flash / GPT-5 Nano / Gemini 3.1 Flash) for classification + simple tool calls, 25% to mid-tier (Sonnet 4.6 / Qwen 3.6 / GLM-5.1) for structured reasoning, and only 5% to frontier (Opus 4.7 / GPT-5.5 / Gemini 3.1 Pro) for ambiguous planning. Cost runs at ~15% of all-frontier, with statistically indistinguishable quality on 200-pair golden sets.

2026 routing strategies (production-validated)
  • Static rules (intent → model map): 30–40% cost reduction, <5 min to ship
  • Embedding-based (k-means buckets): 50–70% reduction
  • Trained classifier (RouteLLM-style): 65–85% reduction
  • Cascade with confidence threshold: 70–85% reduction — 2026 default
  • Architecture lifts (42%→78%) outperform model upgrades alone
Implementation: LiteLLM (33K+ stars, OpenAI-compatible proxy, 100+ providers) is the safest first move. Portkey for enterprise gateway features. Bifrost (Rust) for sub-ms latency. Watch for: cascade thrash (threshold too high → escalates everything), tool-schema mismatch across providers, context-window asymmetry. Wire OTEL gen_ai spans BEFORE turning on routing.
70–85% reduction at frontier-equivalent quality
03
High Impact

Truncate & Summarise Conversation History

Every message in a multi-turn chat is resent on every call. A 40-turn conversation carries 25,000+ input tokens — the largest invisible cost driver. Sliding window truncation + summarisation eliminates this.

Implementation patterns
  • Sliding window: Keep only last N turns
  • Summarisation: After 10 turns, compress older history to 1 paragraph
  • Structured state: Extract facts into a key-value store, not raw chat
  • Topic clear: Reset context on new user intent/topic
  • Max messages limit: Hard-cap turns per session in agent platforms
Implementation: Use a cheap model (Haiku, Flash-Lite) to summarise older context before the main call. The summarisation cost is negligible vs the savings. Context engines with this pattern achieve 40–60% input reduction.
40–70% context token reduction
04
High Impact

Semantic Caching

Instead of caching exact strings, semantic caching uses vector embeddings to serve cached responses when queries are semantically similar (≥0.85–0.95 cosine similarity). Avoids the API call entirely for repetitive queries.

2026 production benchmarks
  • Cache hit rates: 60–85% in support/FAQ workloads
  • API call reduction: up to 68.8% fewer calls
  • Latency: 1.67s → 0.052s per cache hit (96.9% faster)
  • Cost reduction: up to 73% on conversational workloads
  • 31% of LLM queries show semantic similarity — often untapped
Implementation: Redis with LangCache, Portkey, Helicone, or Bifrost all support semantic caching in 2026 via one-line gateway integration. Namespace by model + provider to avoid cross-contamination. Skip caching for conversations exceeding ~10 turns to reduce false positives.
50–73% on semantically repetitive workloads
05
Medium Impact

Concise Prompt Engineering

Verbose prompts don't produce better outputs — they just cost more. Filler phrases like "You are a helpful, professional, knowledgeable assistant" add tokens with zero information gain. Telling a model to "be concise" reduces output tokens by 57–59%.

Before → After
You are a helpful, knowledgeable, friendly, professional customer support agent who always responds in a polite and courteous manner and ensures customer satisfaction...
↓ 55% fewer tokens, same output quality
You are a customer support agent. Be concise and accurate.
Implementation: Audit every system prompt. Remove filler, pleasantries, and redundant instructions. LLMLingua prompt compression achieves 20× token reduction with only 1.5% quality loss — available as a plug-and-play LangChain/LlamaIndex integration.
30–55% prompt size reduction
06
Medium Impact

Constrain Output Length & Format

Output tokens cost 4–10× more than input. Setting max_tokens limits and requesting structured JSON instead of prose cuts the most expensive part of every call.

Control techniques
  • Set explicit max_tokens parameter per call type
  • Request JSON/structured output schemas: ~15% token reduction vs prose
  • Use stop sequences to prevent unnecessary continuation
  • Request bullet responses: "Answer in 3 bullet points"
  • Use "be concise" instruction: 57–59% output reduction (OPSDC research)
Implementation: Map output length requirements to use-case type. FAQ = 1–2 sentences. Summary = 3–5 bullets. Analysis = structured JSON. Set max_tokens as a hard upper bound for each use-case tier.
15–60% output token reduction
07
Medium Impact

RAG Context Compression

RAG pipelines often dump entire document chunks into context — including sentences irrelevant to the query. Pre-filtering retrieved chunks to only query-relevant sentences cuts RAG input tokens by 50–80% with maintained accuracy.

Optimised RAG pipeline
  • Retrieve top-K chunks (e.g. 10 chunks)
  • Score each sentence for query relevance (cheap model)
  • Pass only high-relevance sentences (~20% of retrieved text)
  • Reduce Top-K from 10 to 3–5 via hybrid search
  • Result: same answer quality, 50–80% fewer tokens
Implementation: Use a small model (Flash-Lite, Haiku) or a BM25/reranker to score and filter context before passing to the main LLM. One production case study: cost per contract dropped from expensive to $0.91, with 40% latency reduction.
25–80% RAG context reduction
08
Medium Impact

Batch Processing

Most providers offer 50% discounts for asynchronous batch API calls. Non-real-time workloads — nightly reports, document analysis, bulk classification — are ideal candidates with zero quality trade-off.

Batch API discount (2026)
  • OpenAI Batch API: 50% off all token costs
  • Anthropic Message Batches: 50% off standard pricing
  • Use cases: doc processing, bulk tagging, analytics, embeddings
  • Processing time: minutes to 24 hours (vs milliseconds sync)
Implementation: Audit all AI calls for real-time necessity. Most analytics, classification, and reporting workflows can be shifted to batch. No engineering complexity — just use the batch endpoint.
50% off all eligible non-real-time workloads
09
Targeted Impact

Avoid Reasoning Models for Simple Tasks

Reasoning models (GPT-5.5 Reasoning, Claude Opus 4.7 Extended Thinking, Gemini 3.1 Pro Deep Think) generate internal chain-of-thought tokens before responding — billed but invisible. CoT can be 3–10× the visible output. For a FAQ bot or classifier, you may be paying 5–20× more than necessary.

When to use reasoning models
  • ✓ Multi-step mathematical/logical deduction
  • ✓ Code debugging with complex dependency chains
  • ✓ Long-horizon agentic planning (orchestrator role only)
  • ✓ Legal/financial analysis requiring step-by-step reasoning
  • ✗ Summarisation, classification, translation, FAQ
  • ✗ Sub-agent tool calls — use Sonnet 4.6 / Qwen 3.6 instead
Implementation: Implement a reasoning-model gate: only route to o1/extended-thinking when a classifier scores the task as requiring multi-step inference. Default to standard models.
5–20× cost reduction vs always-on reasoning
10
Enabler

Fine-Tuning for Repetitive High-Volume Tasks

Fine-tuning embeds task knowledge directly into the model, eliminating the need for lengthy few-shot examples in every prompt. For stable, high-volume tasks, fine-tuned smaller models can outperform generic larger ones at a fraction of the cost.

Economics (2026)
  • GPT-4.1 fine-tuning: $25/MTok training tokens
  • Inference savings: 40–60% fewer tokens per call
  • Self-hosted Llama fine-tuned: savings reach 20–25× vs proprietary API
  • Break-even: typically ~1M calls on the specific task
  • Only viable for stable tasks with 10K+ calls/month
Implementation: Only invest in fine-tuning for tasks where: (1) instructions are stable, (2) volume is high, (3) few-shot examples are large and repetitive. Avoid for evolving tasks or low-volume use cases.
40–75% long-term inference reduction

Combined Optimisation Potential

Prompt caching (KV cache)
90%
Semantic caching
73%
Model routing / right-sizing
60–80%
History truncation & summarisation
40–70%
Concise prompt engineering
30–55%
RAG context compression
25–80%
Batch processing
50%
Output length constraints
15–60%

* Savings are on applicable token spend per technique — not additive across all calls. Most production systems achieve 70–85% total spend reduction by combining prompt caching + 70/25/5 routing + context management + agentic budget caps. Source: Obvious Works 2026, Redis LLMOps Guide, FinOps Foundation State of FinOps 2026.

Agentic Token Bloat: When One Request Becomes 50

In the chatbot era, one user message = one API call. In the agentic era, one user message can fan out into 5–50 LLM + MCP tool calls — and a Reflexion loop running 10 cycles consumes 50× the tokens of a linear pass. This is the dominant tokenomics story of 2026, and it's why 73% of enterprises blew past their AI budget.

Token Consumption by Agent Complexity

1
SIMPLE
Single-Shot Chatbot

1 call · 500–2K tokens · $0.001–$0.02 per task

2
TOOL-USE
Tool-Calling Agent (ReAct)

3–8 calls · 5K–15K tokens · $0.05–$0.50 per task

3
AGENTIC
Multi-Step Agent (Three-Agent Harness)

15–30 calls · 50K–200K tokens · $1–$8 per task

4
RUNAWAY 50×
Multi-Agent / Reflexion Loops

50+ calls · 200K–1M+ tokens · $5–$50 per task

5–30× cost multiplier vs chatbot baseline

Anyone who modelled costs on chatbot benchmarks is now 5–30× over budget. Microsoft Research: an unconstrained coding agent costs $5–$8 to resolve a single SWE-bench issue.

Reflexion loops compound exponentially

A self-critique loop running 10 cycles consumes 50× the tokens of a single linear pass. Without circuit breakers, a stuck ReAct loop can burn $50/min on frontier models.

Memory writes are silent token costs

Episodic memory writes (MemOS, Mem0, Zep) and skill-doc generation (Hermes Agent pattern) consume output tokens. Budget for them in OTEL gen_ai.tokens.write spans.

Cost-per-Task replaces cost-per-token

The 2026 SLO unit: cost-per-resolved-task, not cost-per-token. Agent SREs alarm when OTEL cost.per_task p95 exceeds 2× baseline, auto-rolling back via Argo Rollouts.

7 Patterns That Tame Agentic Token Bloat

🚦

Circuit Breakers

Open after 5 consecutive failures or 25% error rate over 30s. LangGraph 2.0 supports them at graph-edge level. Stops $50/min ReAct death-spirals before the budget alert fires.

Mandatory · Production default
🪢

Context Auto-Reset

Cosine similarity <0.7 between goal and recent actions = context drift. Save state snapshot → fresh context → resume from checkpoint. Cuts silent failure >50%. Salesforce Agentforce 3.0 native.

High impact · Easy retrofit
💰

Budget Hard Caps

$ per agent per session. $ per task. Calls/minute throttling. Hard-fail on breach. Microsoft Agent 365 ships this as default policy. Treats token spend as a first-class SLO.

Mandatory · CFO-friendly
🔁

Loop Detection (Rolling Hash)

Detects >3× same tool call with same args = stuck loop. Rolling-hash 30s window. Auto-restart from last good checkpoint. Cheapest single change that prevents catastrophic bills.

Quick win · 1-hour ship
🧠

MemRL + Skill Docs

Memory-as-RL: agent writes skill-doc Markdown after every >5-tool-call task (Hermes Agent pattern, 65K+ stars). Next session starts past prior failure mode = 40% task-time reduction.

Compound savings
🗜️

MemOS Token Compression

MemOS v1.1 (MemTensor) ships modular MemCube architecture with 72% token reduction via OpenClaw plugin. Apache Cassandra + Valkey backends scale to billions of rows p99 <50ms.

2026 OSS standard
📞

Idempotency Keys

Every MCP tool call gets a deterministic key. Voice/barge-in retry? Duplicate dedup'd. Stops one ambiguous user turn becoming N billed bookings. Mandatory for any payment-touching agent.

Mandatory · Safety + cost
🔪

Kill Switch T1–T4 (<1s)

App (OAuth revoke) → Network (egress block) → Infra (container down) → Memory (DB freeze). EU AI Act Annex III evidence requirement. Also caps maximum financial blast radius per agent run.

Compliance + cost ceiling

FinOps Foundation State of FinOps 2026: Average enterprise AI budget grew from $1.2M (2024) → $7M (2026). 73% of enterprises report AI costs exceeded original budget projections. Fortune 500 firms report monthly inference bills in the tens of millions. The root cause in nearly every case: chatbot-era cost models applied to agentic-era workloads.

Infrastructure-Level Optimisation

Beyond prompt-level changes, these infrastructure techniques address token costs at the compute layer — particularly relevant for self-hosted deployments and high-scale agentic systems.

A
Self-Hosted

KV Cache Compression

For self-hosted deployments, the KV (key-value) cache is the dominant memory bottleneck. Recent research shows 70–90% memory reduction is achievable with minimal accuracy loss — enabling longer contexts on the same hardware.

2026 state-of-the-art (ICLR 2026)
  • Google TurboQuant: 6× memory reduction, zero accuracy loss, no calibration
  • NVIDIA KVTC: Up to 20× compression via PCA + entropy coding
  • FastKV: 1.82× faster prefill + 2.87× faster decoding vs baseline
  • FP8/INT4 quantisation: 2–4× memory reduction, supported in vLLM natively
Implementation: Use vLLM with PagedAttention for 14–24× higher throughput vs naive implementations. Apply FP8 KV quantisation on NVIDIA Hopper/Blackwell GPUs. For cutting-edge: TurboQuant/KVTC land at ICLR April 2026.
6–20× memory reduction → lower hardware cost
B
Self-Hosted

Speculative Decoding

Pair the target model with a lightweight draft model that proposes multiple tokens simultaneously. The target model verifies the batch in a single forward pass — achieving 2–5× faster generation with identical output quality.

2026 benchmarks
  • Standard speculative decoding: 2–4× faster inference
  • Speculative Speculative Decoding (Saguaro): 5× vs autoregressive
  • 14–17% throughput gain on Oracle OCI with A100 GPUs
  • Zero accuracy degradation — output is mathematically equivalent
Implementation: Supported natively in vLLM, SGLang, and TRT-LLM. DeepSeek uses Multi-Token Prediction (MTP) heads as a draft mechanism. EAGLE/EAGLE-2 are the most widely deployed speculative decoding variants as of 2026.
2–5× throughput increase → lower GPU hours per token
C
Agentic Systems

Agentic Workflow Optimisation

Multi-agent systems can consume 4–15× more tokens than single-agent calls if not carefully orchestrated. Parallel execution, tool fusion, and model tiering within agent graphs dramatically reduce token overhead.

Key agentic cost patterns
  • DAG-based topologies: Parallel instead of sequential tool calls
  • Tool fusion: Combine related tool calls → 12–40% token reduction
  • Model tiering: Haiku for sub-tasks, Sonnet for orchestration, Opus for core reasoning
  • Agent cost pre-estimation: Use LLM to evaluate plan cost before execution
  • Token quotas: Hard-cap tokens per agent per session to prevent runaway costs
Implementation: Set monthly quota limits ($) in agent platforms. Set execution limits per minute to prevent runaway loops. Set time limits per conversation. Track cost per agent per task. Source: Tonic3 Agentic Budget Framework 2025.
12–40% token reduction in multi-agent systems
D
Infrastructure

Model Quantisation (Self-hosted)

Quantisation reduces model weight precision from FP16 to INT8 or INT4, cutting memory requirements by 2–4× with minimal quality loss. Enables running larger models on cheaper hardware.

Quantisation options
  • FP8 (8-bit): 2× memory savings, near-zero quality loss
  • INT4 (4-bit): 4× memory savings, <5% accuracy delta on most tasks
  • Llama 70B at INT4: runs on 2× A100 80GB vs 4× in FP16
  • Libraries: bitsandbytes, AutoGPTQ, GGUF (llama.cpp)
  • H100 GPUs: 80% more cost-efficient per token vs older hardware
Implementation: Use INT8 for production serving as the conservative default. Use INT4 for less critical inference workloads where you've tested quality. Always benchmark quality on your specific task before deploying quantised models.
2–4× memory reduction → lower infrastructure cost

Build vs Buy: The Hosting Decision

Where your model runs shapes per-token economics as much as which model you choose. At low volumes, API access wins. At scale, the economics flip decisively. Two-thirds of enterprises are now repatriating AI workloads on-premise (Finout 2026).

☁️

API Access

Pure OpexInstant setup
  • ✓ No upfront investment
  • ✓ Instant start, infinite scale
  • ✓ Latest models available immediately
  • ✓ Best for spiky / exploratory workloads
  • ✗ Highest per-token cost
  • ✗ Costs scale linearly, unpredictably
  • ✗ No data sovereignty control
  • ✗ Vendor lock-in risk
$0.075–$168 per MTok output
Best below ~7B tokens/month

Neocloud (NCP)

Pure OpexInstant
  • ✓ Purpose-built for AI workloads
  • ✓ Lower latency than hyperscalers
  • ✓ Dynamic GPU provisioning
  • ✓ Good mid-point before full ownership
  • ✗ No control over physical layer
  • ✗ High on-demand price variability
  • ✗ External data residency risk
~$1–$4/GPU hour
Cheaper than API at 49B+ tokens/year
🏭

AI Factory (On-Prem)

Capex ModelHigh control
  • ✓ Lowest per-token cost at scale
  • ✓ Full data sovereignty
  • ✓ Open-source models (free inference)
  • ✓ Custom fine-tuning, no vendor lock-in
  • ✗ Large upfront capex
  • ✗ Multi-month procurement
  • ✗ MLOps expertise required
  • ✗ GPU obsolescence risk (annual release cycles)
~$1–$2/GPU hour amortised
Wins decisively at 84B+ tokens/year

TCO Inflection Points (Deloitte Simulation)

49B tokens/yr

NCP becomes cheaper per-token than API access

67B tokens/yr

AI factory per-token cost drops below API access

84B tokens/yr

AI factory beats both API and NCP on per-token TCO

3-year horizon

AI factory delivers 50%+ savings vs API at equivalent token volumes

Year 1
10B tokens
API $1.06M
NCP $0.97M
Factory $0.49M
Year 2
300B tokens
API $3.50M
NCP $2.72M
Factory $1.45M
API    Neocloud    AI Factory

Source: Deloitte "The Pivot to Tokenomics" — AI Economics Report 2025

AI Factory TCO Breakdown (10B tokens/year = $1.45M)

$1.45M Annual TCO
Compute (GPUs) 53% · $125,080

Largest direct cost. NVIDIA HGX B200 GPUs, high-bandwidth memory, accelerators. Dominant cost at 10B token scale.

Facilities, Power & Cooling 17% · $40,670

AI GPU racks draw 250–300kW vs 10–15kW for standard servers. Liquid cooling required at scale.

Software & Licensing 17% · $39,646

AI frameworks, orchestration tools, MLOps platforms, compliance tooling, enterprise support.

Networking 13% · $31,302

InfiniBand/NVLink GPU interconnects. High-bandwidth switches. Contributes 10–20% of TCO typically.

Token Governance & FinOps

You cannot optimise what you cannot see. AI FinOps is the emerging practice of applying cloud financial governance discipline to token-based spending — and it's now the #1 priority for FinOps teams in 2026.

State of FinOps 2026 (FinOps Foundation): AI is now the fastest-growing new category of enterprise spend — average AI budget jumped from $1.2M (2024) to $7M (2026), and 73% of enterprises report AI costs exceeded original budget projections. 98% of respondents now use FinOps to manage AI spend (up from 31% in 2024). 33% named FinOps for AI as their #1 current or future priority — ahead of all others. New 2026 metric: cost per thought and token budget per project have replaced raw token counts as the unit economics of choice.

FinOps Maturity Model for Token Spend

Level 1 — Inform

Observability

  • Log input + output tokens per API call
  • Tag every request with model name, team, use-case
  • Build dashboards: tokens/user, cost/app, cost/department
  • Set monthly budget alerts by business unit
  • Surface prompt efficiency metrics
Level 2 — Optimise

Cost Reduction

  • Activate prompt caching on all eligible system prompts
  • Implement model routing rules per use-case type
  • Apply concise prompt engineering across all templates
  • Shift non-real-time workloads to batch endpoints
  • Compress RAG context before sending to LLM
Level 3 — Operate

Governance

  • Track cost-per-unit-of-value (cost per resolved ticket, etc.)
  • Monthly review: top 10 token-consuming call types
  • Set hard quotas: $ per agent/month, calls/minute throttling
  • Establish model tier policies per use-case type
  • Auto-routing layer for least-cost capable model selection

Monitoring & Observability Tools (May 2026)

Microsoft Agent 365
Enterprise control plane · GA May 1 2026
Central agent registry, Entra workload identity, per-agent SLOs + burn-rate alerts, kill switches, auto EU AI Act Annex III audit trail. $15/user standalone or M365 E7. First new MS tier since E5 (2015).
LiteLLM
OSS routing proxy · 33K+ stars
OpenAI-compatible proxy across 100+ providers. The safest first move for multi-model routing. Sub-ms overhead, supports 70/25/5 + cascade strategies.
Langfuse
OSS tracing · MIT · 21K+ stars
OTEL-native, self-hostable. Token cost dashboards, prompt versioning, online evals on sampled prod traffic. Best general-purpose OSS LLM observability.
Arize Phoenix
OSS · OTEL-native · Agent Evals MCP
$70M Series C. Ships Agent Evals MCP — runs production evals as IDE actions inside Cursor / Claude Code. Drift detection on cost-per-task spans.
Braintrust
Eval-as-code · $80M Series
Same metric definition runs in CI gate (>2% regression blocks merge), shadow on mirrored prod traffic, and online prod sampled traffic. Argo Rollouts auto-rollback on quality drop.
LangSmith
LangGraph-native · Insights Agent
First-class threads + Insights Agent auto-clusters failure modes and drafts golden-set additions. Closes the self-healing dataset loop. Best for LangGraph pipelines.
Galileo Signals
Live traffic eval · Prescriptive remediation
Lightweight LLM-as-Judge on sampled prod. Signals auto-clusters millions of traces and prescribes fixes ("swap to GPT-5 Nano on this intent class → 38% cost cut"). Agent SRE favourite.
Finout / Vantage
Enterprise AI FinOps · Chargeback
Token-level attribution, virtual tagging, anomaly detection, chargeback/showback by BU. Connects OTEL gen_ai spans to CFO-grade unit economics. Enterprise FinOps standard.

Token Cost Calculator

Estimate your monthly API spend and see how the optimisation levers reduce it in real time. Toggle each lever to build your optimisation roadmap.

Your Workload

Apply Optimisation Levers

Monthly Cost (Baseline)
$0
Monthly Cost (Optimised)
$0
0% saved per month
Daily input tokens
Daily output tokens
Monthly total tokens
Monthly input cost
Monthly output cost
Annual projection
Estimates are illustrative. Real costs depend on actual token counts, cache hit rates, and provider-specific pricing rules. Always prototype and measure before budgeting.