Agent Cost Observability: Tracking Tokens, Tool Calls, and Retry Costs
30-Second Takeaway
- The Core Problem: When agents run in production, API bills climb but no one can answer: “How much did this task cost?”, “Where did the money go — tokens, tool calls, or retries?”, “Which user burned the most?” Without cost observability, you’re spending blind.
- Four Tracking Dimensions: Token costs (per provider/model with exact pricing) → Tool call costs (LLM function definition overhead + third-party API fees) → Retry waste (15–30% of total cost — illustrative scenario, the most hidden drain) → Per-user/tenant attribution (who’s burning the budget in multi-tenant systems).
- Key Design Insight: Cost tracking granularity must match agent loop granularity — every LLM call, every tool execution, every retry is an independent cost record. The Provider Pricing Registry pattern means one table changes when model prices change.
- What You’ll Build: A complete cost observability layer for your agent system — from token pricing to tool cost registries, from retry waste quantification to per-user cost attribution, from budget hard stops to cost-aware model routing, all unified with OpenTelemetry + Prometheus into a single cost dashboard.
1. Why Cost Observability Is Different for Agents
June 2026. A production code review agent handles 800 PRs per day. Its monthly bill is $4,200. The team knows the number, but can't answer three questions: How much of that $4,200 went toward valuable reviews versus being consumed by retries? How is cost distributed across different customers? If traffic doubles next month, should the budget be $8,000 or $12,000?
These aren't "dashboard prettiness" problems — they're cost visibility gaps. Traditional observability's three pillars (metrics/traces/logs) tell you about latency and error rates, but they can't tell you where the dollars flow. Cost is the systematically neglected fourth dimension of observability.
Why Traditional Cost Tracking Breaks for Agents
If you only run single-shot LLM calls ("summarize this document," "translate this text"), cost tracking is straightforward: one cost record per request, sum at month-end. Agents are fundamentally different — they're multi-step loop bodies: think → call tool → observe result → think again → call another tool → ... → final output. A single task can trigger 5–30 LLM calls, 3–15 tool invocations, and 0–5 retries.
This means agent cost isn't "how much was one request" — it's a multi-dimensional cost vector:
- Token costs: Every LLM call's input/output token spend, varying wildly by provider and model — the same prompt costs ~$0.0003 on DeepSeek-V4-Flash versus ~$0.015 on GPT-5.5 (a 50× difference).
- Tool call costs: Agents call search APIs, run database queries, execute code in containers — all with independent pricing models unrelated to the LLM provider.
- Retry costs: Agents invoke wrong tools, hit timeouts, suffer hallucinations — these failed steps consume tokens and tool fees but generate zero value.
- Attribution costs: In multi-tenant SaaS, the same agent pipeline serves requests from 200 customers — each customer's cost must be independently tracked and billed.
Agent cost observability and traditional observability are complementary: latency tells you "which step was slow," cost tells you "which step was expensive" — cross-analyze both to answer "was the spend worth it?" For the complete observability framework (metrics/traces/logs), see Agent Observability. Cost is the fourth dimension — its data source is still traces and metrics, but its output is dollars, not milliseconds.
The Agent Loop Cost Structure: A Concrete Example
Here's a real cost breakdown for a single code review agent task (using GPT-5.4):
| Step | Operation | Input Tokens | Output Tokens | Token Cost | Tool Cost | Cumulative |
|---|---|---|---|---|---|---|
| 1 | LLM reasoning (analyze PR diff) | 8,200 | 450 | $0.0273 | — | $0.0273 |
| 2 | Tool: git blame (check author) | 10,500 | 180 | $0.0290 | $0.0001 | $0.0564 |
| 3 | LLM reasoning (analyze blame results) | 11,200 | 320 | $0.0328 | — | $0.0892 |
| 4 | Tool: search related issues | 12,800 | 200 | $0.0350 | $0.003 | $0.1272 |
| 5 | Retry (Step 4 timed out) | 12,800 | 200 | $0.0350 | $0.003 | $0.1652 |
| 6 | LLM reasoning (synthesize information) | 15,000 | 500 | $0.0450 | — | $0.2102 |
| 7 | LLM output review comments | 15,800 | 1,200 | $0.0575 | — | $0.2677 |
Key finding: In a single $0.27 task, $0.038 (14.2%) was spent on Step 5's retry — the retry succeeded, so it "looks fine." But at 800 tasks/day, this 14.2% amounts to $30/day of waste, $900/month. Without step-level cost tracking, you'd only see "each task costs ~$0.27" and never know this money exists.
Core insight: Agent cost tracking must operate at step-level granularity. Treating the entire task as a single cost unit masks three critical questions: which steps burn the most money, what fraction goes to retry waste, and whether tool calls cost more than LLM reasoning. None of these can be answered without step-level data, and cost optimization without answers is guesswork.
The Cost Observability Maturity Model
Don't build all dimensions at once — start with a minimal viable version and add incrementally. The table below defines four maturity levels:
| Level | What You Track | Questions You Can Answer | Engineering Effort |
|---|---|---|---|
| L1: Token Counting | Input/output tokens per LLM call | "How many tokens did I use this month?" | Extract usage from API response — 10 lines of code |
| L2: Dollar Pricing | L1 + Provider Pricing Registry → USD | "Is GPT-5.4 cheaper than DeepSeek-V4 for my workload?" | Maintain model pricing table + cost calculator function |
| L3: Multi-Dimensional Attribution | L2 + tool call costs + retry waste + per-user breakdown | "How much did Customer A spend last month? How much was wasted?" | CostRecord data model + trace context propagation + SQL aggregation |
| L4: Cost Control Loop | L3 + budget alerts + cost-aware routing + automated cost reduction | "Budget is nearly exhausted — auto-switch to cheaper model" | BudgetController + CostAwareRouter + Prometheus alerts |
This article covers L2–L4, assuming you already have L1 (if not, Section 2's code will get you there in one shot).
2. Token Cost Tracking Per Provider
Token cost is the starting point of observability — if your agent can't track how many tokens each call consumes, tool costs, retry analysis, and budget controls have nothing to build on. This section's goal: track token costs across OpenAI, Anthropic, DeepSeek, and any other provider through a single, unified interface.
Provider Pricing Comparison (June 2026)
Before choosing models, look at the numbers. Here's the current per-million-token pricing for major models:
| Provider | Model | Input ($/MTok) | Output ($/MTok) | Cache Read ($/MTok) | Batch Discount | Notes |
|---|---|---|---|---|---|---|
| OpenAI | GPT-5.5 | $5.00 | $30.00 | $0.50 | 50% | Best reasoning, most expensive |
| OpenAI | GPT-5.4 | $2.50 | $15.00 | $0.25 | 50% | Production workhorse |
| OpenAI | GPT-5.4-mini | $0.75 | $4.50 | $0.075 | 50% | Simple routing |
| OpenAI | GPT-5.4-nano | $0.20 | $1.25 | $0.02 | 50% | Cheapest GPT (account-specific pricing) |
| Anthropic | Claude Opus 4.8 | $5.00 | $25.00 | $0.50 | 50% | Complex agents |
| Anthropic | Claude Sonnet 4.6 | $3.00 | $15.00 | $0.30 | 50% | Recommended production |
| Anthropic | Claude Haiku 4.5 | $1.00 | $5.00 | $0.10 | 50% | Lightweight tasks |
| DeepSeek | DeepSeek-V4-Flash | $0.14 | $0.28 | $0.0028 | — | Extreme value |
| DeepSeek | DeepSeek-V4-Pro | $0.435 | $0.87 | $0.003625 | — | DeepSeek flagship |
Key takeaways from the table:
- Massive price spread: GPT-5.5's input price is 35.7× DeepSeek-V4-Flash; output is 107×. A 10K-token prompt costs $0.05 on GPT-5.5 versus $0.0014 on DeepSeek-V4-Flash.
- The power of prompt caching: Most models price cache reads at 1/10 of regular input (OpenAI/Anthropic) or as low as 1/50 (DeepSeek). If your agent has a fixed system prompt (most agents do), enabling prompt caching can instantly halve your input costs.
- Batch API 50% discount: Tasks that don't require real-time responses (offline evaluation, batch analysis, overnight reports) should use batch APIs — this is the most overlooked cost reduction lever.
The Provider Pricing Registry Pattern
Each provider's pricing interface is incompatible. OpenAI returns usage.prompt_tokens, Anthropic returns usage.input_tokens, DeepSeek follows OpenAI format but with different cache hit field names. Worse, prompt caching details live in completely different field paths across providers.
The solution: a Provider Pricing Registry — a centralized price table paired with a standardized cost calculator. Its core design principle: when vendors change prices, you change one dictionary and all cost records automatically update.
from __future__ import annotations
from dataclasses import dataclass
from decimal import Decimal
from typing import Dict, Tuple, Optional
from enum import Enum
# ---------------------------------------------------------------------------
# Provider Pricing Registry — single source of truth for all model costs.
# Updated June 2026 with current pricing from OpenAI, Anthropic, DeepSeek.
#
# PRICING CONSTANTS USE Decimal: binary float cannot represent most dollar
# amounts precisely. A single DeepSeek cache-hit token costs $0.0000000028
# and rounds to zero in float. Decimal preserves exact values.
# ---------------------------------------------------------------------------
class CostCategory(str, Enum):
TOKEN_INPUT = "token_input"
TOKEN_OUTPUT = "token_output"
TOKEN_CACHE_READ = "token_cache_read"
TOOL_CALL = "tool_call"
RETRY_WASTE = "retry_waste"
# Prices in USD per 1M tokens — Decimal from string literals (June 2026)
MODEL_PRICING: Dict[Tuple[str, str], Dict[str, Decimal]] = {
# OpenAI
("openai", "gpt-5.5"): {"input": Decimal("5.00"), "output": Decimal("30.00"), "cache_read": Decimal("0.50")},
("openai", "gpt-5.4"): {"input": Decimal("2.50"), "output": Decimal("15.00"), "cache_read": Decimal("0.25")},
("openai", "gpt-5.4-mini"): {"input": Decimal("0.75"), "output": Decimal("4.50"), "cache_read": Decimal("0.075")},
("openai", "gpt-5.4-nano"): {"input": Decimal("0.20"), "output": Decimal("1.25"), "cache_read": Decimal("0.02")}, # varies by account
# Anthropic
("anthropic", "claude-opus-4-8"): {"input": Decimal("5.00"), "output": Decimal("25.00"), "cache_read": Decimal("0.50")},
("anthropic", "claude-sonnet-4-6"): {"input": Decimal("3.00"), "output": Decimal("15.00"), "cache_read": Decimal("0.30")},
("anthropic", "claude-haiku-4-5"): {"input": Decimal("1.00"), "output": Decimal("5.00"), "cache_read": Decimal("0.10")},
# DeepSeek — roughly 18–36× cheaper than GPT-5.5 on input
("deepseek", "deepseek-v4-flash"): {"input": Decimal("0.14"), "output": Decimal("0.28"), "cache_read": Decimal("0.0028")},
("deepseek", "deepseek-v4-pro"): {"input": Decimal("0.435"), "output": Decimal("0.87"), "cache_read": Decimal("0.003625")},
}
# Batch processing gives 50% discount for OpenAI and Anthropic
BATCH_DISCOUNT = Decimal("0.5")
# 1M token divisor
_ONE_MILLION = Decimal("1_000_000")
def _cost_to_display(c: Decimal) -> str:
"""Format a Decimal cost for display. No rounding until display time."""
return f"${float(c):.8f}"
@dataclass
class TokenCost:
"""Result of a token cost calculation.
Token counts are integers (not money). All cost fields use Decimal
to avoid binary float precision loss on small dollar amounts.
"""
provider: str
model: str
input_tokens: int
output_tokens: int
cache_read_tokens: int
input_cost_usd: Decimal
output_cost_usd: Decimal
total_cost_usd: Decimal
def __repr__(self) -> str:
return (f"TokenCost({self.provider}/{self.model}: "
f"in={self.input_tokens} out={self.output_tokens} cache={self.cache_read_tokens} "
f"→ {_cost_to_display(self.total_cost_usd)})")
def calculate_token_cost(
provider: str,
model: str,
input_tokens: int,
output_tokens: int,
cache_read_tokens: int = 0,
*,
is_batch: bool = False,
) -> TokenCost:
"""
Calculate dollar cost from token usage using the central pricing registry.
All monetary arithmetic uses Decimal — aggregate unrounded, round only at
display time. Token counts remain plain ints (they're counts, not money).
"""
prices = MODEL_PRICING.get((provider.lower(), model.lower()))
if prices is None:
return TokenCost(
provider=provider, model=model,
input_tokens=input_tokens, output_tokens=output_tokens,
cache_read_tokens=cache_read_tokens,
input_cost_usd=Decimal("0"), output_cost_usd=Decimal("0"),
total_cost_usd=Decimal("0"),
)
cache_read = min(cache_read_tokens, input_tokens)
uncached_input = max(0, input_tokens - cache_read)
input_cost = (
uncached_input * prices["input"] + cache_read * prices["cache_read"]
) / _ONE_MILLION
output_cost = output_tokens * prices["output"] / _ONE_MILLION
discount = BATCH_DISCOUNT if is_batch else Decimal("1")
total = (input_cost + output_cost) * discount
return TokenCost(
provider=provider, model=model,
input_tokens=input_tokens, output_tokens=output_tokens,
cache_read_tokens=cache_read,
input_cost_usd=input_cost * discount,
output_cost_usd=output_cost * discount,
total_cost_usd=total,
)
# ---------------------------------------------------------------------------
# Usage — the calculator works with all providers through a single interface
# ---------------------------------------------------------------------------
# Example: GPT-5.4 call with 8000 input, 2000 output, 3000 cache hits
cost_gpt = calculate_token_cost("openai", "gpt-5.4", 8000, 2000, 3000)
print(cost_gpt)
# TokenCost(openai/gpt-5.4: in=8000 out=2000 cache=3000 → $0.04325000)
# Example: Same tokens on DeepSeek-V4-Flash — 15× cheaper
cost_ds = calculate_token_cost("deepseek", "deepseek-v4-flash", 8000, 2000, 3000)
print(cost_ds)
# TokenCost(deepseek/deepseek-v4-flash: in=8000 out=2000 cache=3000 → $0.00126840)
# Batch mode — additional 50% off for OpenAI/Anthropic
cost_gpt_batch = calculate_token_cost("openai", "gpt-5.4", 8000, 2000, 3000, is_batch=True)
print(f"Batch: {_cost_to_display(cost_gpt_batch.total_cost_usd)}")
# Batch: $0.02162500
This Registry design has several deliberate choices:
- The lookup key is a (provider, model) tuple: Model names can collide across providers ("mini" means different things for OpenAI and Anthropic). Tuple keys eliminate ambiguity.
- Cache read token handling: Subtract cache tokens first, then price the remainder at the regular rate — prevents double-counting. Both Anthropic and OpenAI stash cache_read in different response fields, but once extracted they flow into the same function parameter.
- Unknown model → zero cost: This is intentional — in production, you should alert on this. Silently recording zero cost is better than throwing an exception, because cost tracking is a non-functional concern that should never block the agent's functional path.
Parsing Usage Responses Across Providers
The calculator above assumes you already have a unified (provider, model, input_tokens, output_tokens, cache_read_tokens) tuple. But each provider's API response structure is different, so you need a thin adapter layer:
def extract_usage_openai(response: dict) -> tuple[int, int, int]:
"""Extract (input_tokens, output_tokens, cache_read_tokens) from OpenAI response."""
usage = response.get("usage", {})
details = usage.get("input_token_details", {})
return (
usage.get("prompt_tokens", 0),
usage.get("completion_tokens", 0),
details.get("cached_tokens", 0),
)
def extract_usage_anthropic(response: dict) -> tuple[int, int, int]:
"""Extract (input_tokens, output_tokens, cache_read_tokens) from Anthropic response."""
usage = response.get("usage", {})
return (
usage.get("input_tokens", 0),
usage.get("output_tokens", 0),
usage.get("cache_read_input_tokens", 0),
)
def extract_usage_deepseek(response: dict) -> tuple[int, int, int]:
"""Extract (input_tokens, output_tokens, cache_read_tokens) from DeepSeek response."""
usage = response.get("usage", {})
return (
usage.get("prompt_tokens", 0),
usage.get("completion_tokens", 0),
usage.get("prompt_cache_hit_tokens", 0),
)
# Provider-specific extractors registry
USAGE_EXTRACTORS = {
"openai": extract_usage_openai,
"anthropic": extract_usage_anthropic,
"deepseek": extract_usage_deepseek,
}
def track_llm_call(
provider: str, model: str, response: dict, *, is_batch: bool = False
) -> TokenCost:
"""Unified entry point: extract usage from any provider response and calculate cost."""
extractor = USAGE_EXTRACTORS.get(provider.lower())
if extractor is None:
raise ValueError(f"Unknown provider: {provider}")
in_tok, out_tok, cache_tok = extractor(response)
return calculate_token_cost(provider, model, in_tok, out_tok, cache_tok, is_batch=is_batch)
For prompt caching and context compression strategies that further reduce token costs, see Agent Context Window Management. These techniques and the price calculator form an upstream-downstream relationship — context management reduces token volume, and the calculator converts remaining tokens to dollars.
3. Tool Call Cost Accounting
Many teams track token costs and assume they've completed cost observability — they're wrong. Tool calls have three layers of cost, and the first layer (function definition token overhead) is almost universally ignored.
The Three-Layer Tool Cost Model
| Cost Layer | Source | Typical Magnitude | Billed By |
|---|---|---|---|
| L1: Function definition tokens | Tool JSON schemas occupy tokens in every LLM request's system prompt | 290–806 tokens per tool (including parameter descriptions) | LLM Provider |
| L2: Function call + result tokens | LLM generates a tool_use block + tool result is injected back into context | 200–2,000 tokens per invocation | LLM Provider |
| L3: External API / infrastructure | Search API per-call pricing, database query connections, container runtime CPU | $0.001–0.05 per call | Third-party vendors / cloud platform |
L1 is the biggest blind spot. Suppose your agent defines 8 tools, each averaging 450 tokens for their JSON Schema plus description. In an agent loop, these definitions are injected into every LLM request's system prompt (or Anthropic's tool_use parameter). If the agent makes 12 LLM calls per task: 8 tools × 450 tokens × 12 calls = 43,200 tokens consumed purely on tool definitions — tokens that produce zero reasoning value, just "so the LLM knows what it can do." On GPT-5.4, those 43,200 input tokens cost $0.108.
Critical engineering decision: Tool definition granularity directly impacts cost. Eight individual tools and one "multi-function tool" (differentiated by a parameter) are functionally equivalent, but the former consumes 3,150 extra tokens per request (seven additional tool schemas). At 10,000 LLM calls/day, those "extra definitions" burn $0.007875 × 10,000 = $78.75/day (approximately $2,362.50/month). Before adding a new tool, ask: must this tool be declared in the system prompt, or can it be a fallback (added only when needed)? See Agent Tool Design Best Practices for detailed discussion on tool definition optimization.
The Tool Cost Registry Pattern
Like the Provider Pricing Registry, tool call costs need a centralized registry — but tool cost models are more heterogeneous: some charge per call, some per second, some are free.
from dataclasses import dataclass, field
from datetime import datetime
from typing import Dict, Optional, List
@dataclass
class ToolCostRecord:
"""A cost record for a single tool invocation.
All monetary fields use Decimal for exact sub-cent precision.
Token counts remain int (counts, not money).
"""
tool_name: str
call_duration_s: float
llm_tokens_consumed: int # L1 + L2 token overhead
llm_cost_usd: Decimal # dollar cost of those tokens
external_api_cost_usd: Decimal # L3: third-party API fee
infrastructure_cost_usd: Decimal # L3: compute/runtime cost
total_cost_usd: Decimal = Decimal("0")
metadata: Dict = field(default_factory=dict)
timestamp: str = field(default_factory=lambda: datetime.utcnow().isoformat())
def __post_init__(self):
self.total_cost_usd = self.llm_cost_usd + self.external_api_cost_usd + self.infrastructure_cost_usd
# Tool Cost Registry — pricing models for every tool in the agent's arsenal
# Key design: each tool declares its pricing model type (per_call, per_second, free)
# All monetary amounts use Decimal from string literals to avoid binary float
# precision loss on sub-cent amounts.
TOOL_COST_REGISTRY: Dict[str, Dict] = {
# --- Third-party API tools ---
"web_search_serpapi": {
"model": "per_call",
"cost_per_call": Decimal("0.01"),
"unit": "api_call",
},
"web_search_bing": {
"model": "per_call",
"cost_per_call": Decimal("0.003"),
"unit": "api_call",
},
"web_search_google_custom": {
"model": "per_call",
"cost_per_call": Decimal("0.005"),
"unit": "api_call",
},
# --- Infrastructure tools ---
"db_query_postgres": {
"model": "per_call",
"cost_per_call": Decimal("0.0001"),
"unit": "query",
},
"code_execution_docker": {
"model": "per_second",
"cost_per_second": Decimal("0.000014"), # ~$0.05/hour container runtime
"unit": "second",
},
# --- Free tools (local operations) ---
"file_read": {"model": "free", "cost_per_call": Decimal("0"), "unit": "free"},
"file_write": {"model": "free", "cost_per_call": Decimal("0"), "unit": "free"},
"bash_exec": {"model": "free", "cost_per_call": Decimal("0"), "unit": "free"},
"regex_search": {"model": "free", "cost_per_call": Decimal("0"), "unit": "free"},
}
# Canonical tool definition sizes (tokens consumed per LLM call just for the schema)
TOOL_DEFINITION_TOKENS: Dict[str, int] = {
"web_search_serpapi": 520,
"web_search_bing": 480,
"db_query_postgres": 350,
"code_execution_docker": 620,
"file_read": 290,
"file_write": 320,
"bash_exec": 450,
"regex_search": 380,
"__default__": 400,
}
def calculate_tool_cost(
tool_name: str,
call_duration_s: float = 0.0,
num_llm_calls_in_task: int = 1,
llm_input_price_per_mtok: Decimal = Decimal("0.75"),
) -> ToolCostRecord:
"""
Calculate the full cost of a tool invocation.
Three cost sources:
1. LLM token overhead (L1 + L2)
2. External API cost (L3, from Tool Cost Registry)
3. Infrastructure cost (L3, container runtime, DB connections, etc.)
"""
pricing = TOOL_COST_REGISTRY.get(tool_name)
if pricing is None:
pricing = {"model": "per_call", "cost_per_call": Decimal("0")}
# L3: External API / infrastructure cost
if pricing["model"] == "per_call":
external_cost = pricing.get("cost_per_call", Decimal("0"))
infra_cost = Decimal("0")
elif pricing["model"] == "per_second":
external_cost = Decimal("0")
infra_cost = pricing.get("cost_per_second", Decimal("0")) * Decimal(str(call_duration_s))
else:
external_cost = Decimal("0")
infra_cost = Decimal("0")
# L1: Tool definition overhead across all LLM calls in the task
tool_def_tokens = TOOL_DEFINITION_TOKENS.get(
tool_name, TOOL_DEFINITION_TOKENS["__default__"]
)
total_definition_tokens = tool_def_tokens * num_llm_calls_in_task
llm_cost = (Decimal(total_definition_tokens) * llm_input_price_per_mtok) / Decimal("1_000_000")
return ToolCostRecord(
tool_name=tool_name,
call_duration_s=call_duration_s,
llm_tokens_consumed=total_definition_tokens,
llm_cost_usd=llm_cost,
external_api_cost_usd=external_cost,
infrastructure_cost_usd=infra_cost,
)
# Usage examples
cost_search = calculate_tool_cost("web_search_serpapi", call_duration_s=1.2, num_llm_calls_in_task=5)
print(f"Search tool total: ${float(cost_search.total_cost_usd):.6f}")
# Search tool total: $0.011950
# L1: 520 tokens x 5 calls x $0.75/MTok = $0.001950
# L3: $0.01 per call
# Notice: L1 is 19.5% of the tool's total cost — not negligible!
This Registry reveals two important facts about tool call cost tracking:
- High-frequency "free" tools can dominate costs.
bash_execis free (local execution), but 450 tokens × 20 LLM calls/task × 800 tasks/day = 7,200,000 tokens/day. On GPT-5.4 that's $18/day in pure tool definition overhead. Free tools are not zero-cost. - Search tools' L3 costs far exceed L1. SerpAPI at $0.01/call — if an agent searches 6 times per task, L3 cost is $0.06 while L1 might be only $0.002. The optimization direction is reducing unnecessary search rounds, not shrinking tool definitions.
For cross-process cost tracking of MCP protocol tools, see MCP Protocol Production Guide — MCP tool calls add extra serialization and network overhead that require separate cost consideration.
4. Retry and Wastage Costs — the Hidden Money Pit in Agent Systems
In production, agent retry waste typically accounts for 15–30% of total token costs. This is a commonly cited range across production agent systems (illustrative scenario — your actual ratio depends on agent design, retry policy, and model reliability). If you spend $10,000/month on agent API calls, $1,500–$3,000 is pure waste — tokens consumed and tools invoked that produced zero successful output.
This is the section's "wow moment": most teams have no idea how much they're burning on retries. Once you quantify it, the number will make you immediately start optimizing your retry strategy.
Four Categories of Retry Waste
| Retry Type | Trigger | Avoidable? | Cost Per Instance (typical) | Frequency |
|---|---|---|---|---|
| Timeout retry | LLM API response timeout | Partially | $0.005–0.05 | 2–8% of calls |
| Rate limit retry | Provider concurrency limit exceeded | Yes | $0.001–0.02 | 1–5% of calls |
| Hallucinated tool call retry | LLM invokes wrong tool/parameters | Partially | $0.01–0.10 | 3–12% of tool calls |
| Multi-round failure retry | Task too complex for agent | Partially | $0.10–2.00 | 5–15% of tasks |
Hallucinated tool calls are the most expensive waste category — they not only consume LLM tokens to generate the tool_use, but also incur real tool execution costs, and the results are worthless. In an analysis of 50,000 agent tasks (illustrative scenario), we found 7.2% of tool calls were "hallucinatory" — the right tool was invoked but with semantically unreasonable parameters. These calls averaged $0.023 each, and the total waste accounted for 11.4% of all tool call costs (illustrative figures; actual ratios depend on agent design).
RetryCost Tracker Implementation
from dataclasses import dataclass, field
from typing import List, Optional
from datetime import datetime
@dataclass
class RetryRecord:
"""A single retry attempt's cost footprint."""
attempt_number: int
reason: str # timeout, rate_limit, wrong_tool, hallucination, model_error
tokens_wasted: int
cost_wasted_usd: float
tool_calls_wasted: int
tool_cost_wasted_usd: float
was_recoverable: bool = False
timestamp: str = field(default_factory=lambda: datetime.utcnow().isoformat())
@dataclass
class AgentRunCostTracker:
"""
Tracks all costs for a single agent task execution, with special attention
to retry waste. Every LLM call, tool invocation, and retry attempt is
recorded so you can answer: "How much of this task's cost was wasted?"
"""
task_id: str
user_id: str
task_type: str = "unknown"
_total_token_cost: float = 0.0
_total_tool_cost: float = 0.0
_total_tokens: int = 0
_retries: List[RetryRecord] = field(default_factory=list)
_succeeded: bool = False
_attempt_number: int = 0
def start_attempt(self) -> int:
self._attempt_number += 1
return self._attempt_number
def record_retry(self, reason: str, tokens: int, token_cost_usd: float,
tool_calls: int = 0, tool_cost_usd: float = 0.0) -> None:
self._retries.append(RetryRecord(
attempt_number=self._attempt_number,
reason=reason, tokens_wasted=tokens,
cost_wasted_usd=token_cost_usd,
tool_calls_wasted=tool_calls,
tool_cost_wasted_usd=tool_cost_usd,
))
def mark_success(self) -> None:
self._succeeded = True
for r in self._retries:
r.was_recoverable = True
def add_cost(self, token_cost_usd: float, tool_cost_usd: float, tokens: int) -> None:
"""Record actual cost of every operation — successful or failed.
Called for EVERY LLM call and tool invocation. Failed operations
later get classified as retry waste via record_retry(), which
tags already-recorded costs — it doesn't add new spend.
"""
self._total_token_cost += token_cost_usd
self._total_tool_cost += tool_cost_usd
self._total_tokens += tokens
@property
def total_retry_waste_usd(self) -> float:
"""Classification of already-recorded costs that were wasted on retries.
This is NOT additional spend — it's a tag/filter over costs that were
already recorded via add_cost(). Retry waste is a subset of total cost,
not an extra cost on top.
"""
return sum(r.cost_wasted_usd + r.tool_cost_wasted_usd for r in self._retries)
@property
def total_cost_usd(self) -> float:
"""Actual LLM + tool spend, recorded exactly once per operation.
total_retry_waste_usd is NOT added here — it's a classification of
already-recorded costs. Summing them would double-count.
"""
return self._total_token_cost + self._total_tool_cost
@property
def waste_ratio(self) -> float:
"""THE single most important metric on your cost dashboard.
Ratio of retry-classified cost to total cost. The denominator is total
cost (all spend including what was later tagged as waste). This means
waste_ratio answers: 'what fraction of every dollar was burned on
operations that had to be retried?'
"""
total = self.total_cost_usd
if total == 0:
return 0.0
return self.total_retry_waste_usd / total
@property
def retry_count(self) -> int:
return len(self._retries)
def summary(self) -> str:
return (
f"Task {self.task_id} ({self.task_type}) — "
f"Status: {'✓' if self._succeeded else '✗'} | "
f"Total: ${self.total_cost_usd:.6f} | "
f"Waste: ${self.total_retry_waste_usd:.6f} "
f"({self.waste_ratio:.1%}) | "
f"Retries: {self.retry_count}"
)
The wow moment math: If your agent system handles 1,000 tasks/day, each averaging $0.25, with a 20% waste ratio, that's $50/day of pure waste, $1,500/month, $18,250/year. A system processing 10,000 tasks/day at this ratio wastes $182,500/year. This money won't appear as a line item on any provider's bill — only your own cost tracker can see it. Go calculate your waste_ratio right now.
Waste Breakdown by Retry Reason
-- Retry waste breakdown by reason (last 7 days)
SELECT
reason,
COUNT(*) as retry_count,
SUM(cost_wasted_usd + tool_cost_wasted_usd) as total_waste_usd,
ROUND(AVG(cost_wasted_usd + tool_cost_wasted_usd), 6) as avg_waste_per_retry,
ROUND(100.0 * SUM(cost_wasted_usd + tool_cost_wasted_usd) /
(SELECT SUM(cost_wasted_usd + tool_cost_wasted_usd) FROM retry_records
WHERE timestamp >= NOW() - INTERVAL '7 days'), 1) as pct_of_total_waste
FROM retry_records
WHERE timestamp >= NOW() - INTERVAL '7 days'
GROUP BY reason
ORDER BY total_waste_usd DESC;
Abnormal retry costs can serve as a leading indicator of system health — if the retry rate suddenly spikes, it may signal an upstream service outage or model behavior regression. See Agent Rollback Design for how to automatically revert when cost anomalies are detected.
5. Per-Task and Per-User Cost Attribution
With per-step cost data in hand, the next question is: how do you attribute those costs to the correct task and user? This is especially critical for multi-tenant SaaS systems — you need to know Customer A spent $340 and Customer B spent $12, both for accurate billing and for assessing customer health.
Trace Context Propagation: The Foundation of Cost Attribution
The key to cost attribution is injecting user_id, tenant_id, and task_id at the request entry point and propagating them through trace context to all downstream spans. No matter how many layers of LLM calls and tool invocations your agent runs, every cost record carries its attribution at birth.
from opentelemetry import trace, baggage
from opentelemetry.trace import SpanKind
import uuid
tracer = trace.get_tracer("agent-cost-attribution")
class CostAttributionContext:
"""
Propagates attribution metadata (user_id, tenant_id, task_id) across all
spans in an agent task's trace tree.
"""
def __init__(self, user_id: str, tenant_id: str = "default"):
self.user_id = user_id
self.tenant_id = tenant_id
self.task_id = str(uuid.uuid4())[:8]
def span(self, name: str, **attrs):
span = tracer.start_span(name, kind=SpanKind.INTERNAL)
span.set_attribute("cost.user_id", self.user_id)
span.set_attribute("cost.tenant_id", self.tenant_id)
span.set_attribute("cost.task_id", self.task_id)
baggage.set_baggage("cost.user_id", self.user_id)
baggage.set_baggage("cost.tenant_id", self.tenant_id)
baggage.set_baggage("cost.task_id", self.task_id)
for k, v in attrs.items():
span.set_attribute(k, str(v))
return span
SQL Aggregation: From Raw Records to Per-User Bills
-- 1. Per-user cost breakdown (multi-tenant billing)
SELECT
tenant_id, user_id,
COUNT(DISTINCT task_id) as task_count,
SUM(cost_usd) as total_cost_usd,
SUM(CASE WHEN category = 'retry_waste' THEN cost_usd ELSE 0 END) as waste_cost_usd,
ROUND(100.0 * SUM(CASE WHEN category = 'retry_waste' THEN cost_usd ELSE 0 END)
/ NULLIF(SUM(cost_usd), 0), 2) as waste_pct
FROM cost_records
WHERE timestamp >= NOW() - INTERVAL '30 days'
GROUP BY tenant_id, user_id
ORDER BY total_cost_usd DESC;
-- 2. Cost per successful vs failed task
SELECT
task_outcome,
COUNT(DISTINCT task_id) as task_count,
SUM(cost_usd) as total_cost_usd,
ROUND(SUM(cost_usd) / COUNT(DISTINCT task_id), 6) as cost_per_task
FROM cost_records
WHERE timestamp >= NOW() - INTERVAL '7 days'
GROUP BY task_outcome;
-- 3. Model cost efficiency (cost per successful task by model)
SELECT
model,
COUNT(DISTINCT CASE WHEN task_outcome = 'success' THEN task_id END) as successful_tasks,
ROUND(SUM(CASE WHEN task_outcome = 'success' THEN cost_usd ELSE 0 END) /
NULLIF(COUNT(DISTINCT CASE WHEN task_outcome = 'success' THEN task_id END), 0), 6)
as cost_per_successful_task
FROM cost_records
WHERE timestamp >= NOW() - INTERVAL '30 days'
GROUP BY model
ORDER BY cost_per_successful_task ASC;
The third query (cost_per_successful_task) is the core metric for measuring "was the spend worth it." Suppose GPT-5.5 costs $0.15 per successful task and DeepSeek-V4-Flash costs $0.008 — GPT-5.5 is 18.75× more expensive. Does its success rate or quality exceed DeepSeek's by 18.75×? If not, route more traffic to DeepSeek. That's Section 6's cost-aware router.
Audit logs provide the foundational data source for cost attribution — see Agent Audit Log Design. For cost-to-quality ratio analysis, see Agent Evaluation Framework Design.
6. Budget Alerts and Cost Controls
Cost tracking exists to enable cost control. Without budget constraints, an agent system is like a credit card with no limit — a runaway loop can burn hundreds of dollars in minutes.
Two Budget Control Patterns: Hard Stop vs. Soft Warning
| Pattern | Behavior | Best For | User Experience |
|---|---|---|---|
| Hard Stop | Immediately terminate agent execution when budget is exhausted, raise BudgetExceededError | Internal pipelines, batch processing, non-user-facing scenarios | Task fails, but no further money is burned |
| Soft Warning | Send alert at 80% budget but allow continuation; block new expensive operations at 100% | End-user-facing paid products, high-value tasks | User is notified and can choose to continue |
Most production systems should use both simultaneously: task-level hard stops + user-level soft warnings.
BudgetController Implementation
from enum import Enum
from typing import Dict, Optional, Callable
import threading
class BudgetAction(Enum):
ALLOW = "allow"
WARN = "warn"
BLOCK = "block"
class BudgetExceededError(Exception):
"""Raised when a hard budget limit is exceeded."""
def __init__(self, user_id: str, limit_name: str, current: float, limit: float):
self.user_id = user_id
self.limit_name = limit_name
self.current = current
self.limit = limit
super().__init__(
f"Budget exceeded for {user_id}: {limit_name} "
f"(${current:.4f} >= ${limit:.4f})"
)
class BudgetController:
"""Enforces per-user and per-task budget limits across three scopes:
per_task, daily, monthly. Each can be Hard Stop or Soft Warning.
Key design decisions (production-hardened):
- Pre-authorization: estimate cost BEFORE operation, reserve budget
- >= (not >): prevents spending past the exact limit
- Thread-safe: atomic bookkeeping via Lock (concurrent tasks)
- Reconciliation: after operation, settle actual vs. estimated cost
"""
def __init__(self, cost_store: "CostStore"):
self._store = cost_store
self._limits: Dict[str, Dict[str, float]] = {}
self._actions: Dict[str, Dict[str, BudgetAction]] = {}
self._warning_handler: Optional[Callable] = None
self._lock = threading.Lock()
# Estimated costs reserved but not yet settled
self._reserved: Dict[str, Dict[str, float]] = {} # user_id -> {scope: amount}
def set_limit(self, user_id: str, scope: str, limit_usd: float,
action: BudgetAction = BudgetAction.BLOCK) -> None:
self._limits.setdefault(user_id, {})[scope] = limit_usd
self._actions.setdefault(user_id, {})[scope] = action
def set_warning_handler(self, handler: Callable) -> None:
self._warning_handler = handler
def request_budget(
self, user_id: str, task_id: str, estimated_cost: float,
) -> tuple[BudgetAction, str]:
"""Pre-authorize an operation by checking budgets and reserving cost.
Callers should invoke this BEFORE starting a new operation. It combines
limit checks with cost reservation under a single lock to prevent
concurrent tasks from both passing the gate and overshooting together.
After the operation completes, call reconcile() with the actual cost
to settle the reservation.
"""
return self.check(user_id, task_id, estimated_next_cost=estimated_cost)
def check(self, user_id: str, task_id: str,
estimated_next_cost: float = 0.0) -> tuple[BudgetAction, str]:
"""Check all applicable limits before executing an operation.
Pre-authorization: pass estimated_next_cost to reserve budget BEFORE
the operation runs. After completion, call reconcile() to settle.
All limit checks use store_spend + existing_reservations +
estimated_next_cost so that reservations from concurrent tasks
are visible in every comparison.
Atomicity guarantee: within a single lock, all outstanding reservations
(_reserved) are included in every limit comparison. Two concurrent tasks
cannot both pass the budget gate and overshoot together — the second
check sees the first task's reserved amount and correctly blocks/limits.
This is equivalent to Redis INCR / DB SELECT ... FOR UPDATE.
"""
with self._lock:
limits = self._limits.get(user_id, {})
actions = self._actions.get(user_id, {})
# Include outstanding reservations from prior check() calls
# that haven't been reconciled yet
reserved = self._reserved.get(user_id, {})
# --- 1. Per-task limit (always hard stop) ---
task_limit = limits.get("per_task", float("inf"))
task_spend = self._store.get_task_cost(task_id)
task_reserved = reserved.get("per_task", 0.0)
task_projected = task_spend + task_reserved + estimated_next_cost
if task_projected >= task_limit:
raise BudgetExceededError(
user_id, "per_task",
task_projected, task_limit)
# --- 2. Daily limit ---
daily_limit = limits.get("daily", float("inf"))
daily_action = actions.get("daily", BudgetAction.BLOCK)
daily_spend = self._store.get_user_daily_cost(user_id)
daily_reserved = reserved.get("daily", 0.0)
daily_projected = daily_spend + daily_reserved + estimated_next_cost
if daily_projected >= daily_limit and daily_action == BudgetAction.BLOCK:
raise BudgetExceededError(
user_id, "daily",
daily_projected, daily_limit)
# --- 3. Monthly limit ---
monthly_limit = limits.get("monthly", float("inf"))
monthly_action = actions.get("monthly", BudgetAction.BLOCK)
monthly_spend = self._store.get_user_monthly_cost(user_id)
monthly_reserved = reserved.get("monthly", 0.0)
monthly_projected = monthly_spend + monthly_reserved + estimated_next_cost
if monthly_projected >= monthly_limit and monthly_action == BudgetAction.BLOCK:
raise BudgetExceededError(
user_id, "monthly",
monthly_projected, monthly_limit)
# --- Determine action: BLOCK already raised → only ALLOW or WARN ---
# All WARN conditions include estimated_next_cost in the projection.
action = BudgetAction.ALLOW
msg = "OK"
# Per-task 80% (hard scope, always WARN)
if task_projected >= task_limit * 0.8:
action = BudgetAction.WARN
msg = f"Task budget at {task_projected/task_limit:.0%}"
# Daily — 100% soft cap overrides 80% warning
if daily_projected >= daily_limit:
action = BudgetAction.WARN
msg = (f"Daily budget exceeded (soft): "
f"${daily_projected:.4f} >= ${daily_limit:.4f}")
elif daily_projected >= daily_limit * 0.8:
action = BudgetAction.WARN
msg = f"Daily budget at {daily_projected/daily_limit:.0%}"
# Monthly — 100% soft cap overrides 80% warning
if monthly_projected >= monthly_limit:
action = BudgetAction.WARN
msg = (f"Monthly budget exceeded (soft): "
f"${monthly_projected:.4f}")
elif monthly_projected >= monthly_limit * 0.8:
action = BudgetAction.WARN
msg = f"Monthly budget at {monthly_projected/monthly_limit:.0%}"
# --- Pre-authorize: reserve for ALLOW and WARN ---
# (BLOCK already raised; only ALLOW/WARN reach here)
self._reserved.setdefault(user_id, {})
for scope in ["per_task", "daily", "monthly"]:
self._reserved[user_id][scope] = (
self._reserved[user_id].get(scope, 0.0) + estimated_next_cost)
return (action, msg)
def reconcile(self, user_id: str, task_id: str,
actual_cost: float, estimated_cost: float) -> None:
"""After operation completes, reconcile actual vs. estimated cost.
Always call this after check() + operation, even on failure.
This settles the reserved estimate to the actual cost.
"""
with self._lock:
self._store.record_cost(task_id, user_id, actual_cost)
# Release any over-reservation or add under-reservation
for scope in ["per_task", "daily", "monthly"]:
reserved = self._reserved.get(user_id, {}).get(scope, 0.0)
self._reserved.setdefault(user_id, {})[scope] = max(
0.0, reserved - estimated_cost)
class CostStore:
"""Abstract interface for querying current spend."""
def get_task_cost(self, task_id: str) -> float: ...
def get_user_daily_cost(self, user_id: str) -> float: ...
def get_user_monthly_cost(self, user_id: str) -> float: ...
def record_cost(self, task_id: str, user_id: str, cost: float) -> None: ...
Cost-Aware Model Routing
The smarter approach: automatically switching to cheaper models when budget is constrained. Especially valuable for teams using a mix of DeepSeek and GPT/Claude:
class BlockedDecision(Exception):
"""Raised when budget doesn't permit ANY model for this operation."""
pass
class CostAwareRouter:
"""
Routes agent tasks to the most appropriate model based on:
1. Task complexity (low/medium/high)
2. Remaining budget
3. Estimated cost (all components: input, cached, output, retry + tool headroom)
Critical safety rules:
- If budget is insufficient for ANY model → raise BlockedDecision
(do NOT silently route to cheapest — that still burns budget)
- Estimate ALL cost components, not just input tokens
- Unknown complexity → assume HIGH (safer to overestimate)
- The budget gate (BudgetController.check) is PRIMARY;
routing is a secondary decision after the gate passes
"""
# Full cost model per provider: (input, output, cache_read) per 1M tokens
# plus estimated output ratio (~20% of input) and tool headroom (~10%)
# All prices use Decimal to avoid binary float precision loss.
_PRICE_PER_1M: Dict[Tuple[str, str], Tuple[Decimal, Decimal, Decimal]] = {
("openai", "gpt-5.5"): (Decimal("5.00"), Decimal("30.00"), Decimal("0.50")),
("openai", "gpt-5.4"): (Decimal("2.50"), Decimal("15.00"), Decimal("0.25")),
("openai", "gpt-5.4-mini"): (Decimal("0.75"), Decimal("4.50"), Decimal("0.075")),
("anthropic", "claude-sonnet-4-6"): (Decimal("3.00"), Decimal("15.00"), Decimal("0.30")),
("anthropic", "claude-haiku-4-5"): (Decimal("1.00"), Decimal("5.00"), Decimal("0.10")),
("deepseek", "deepseek-v4-flash"): (Decimal("0.14"), Decimal("0.28"), Decimal("0.0028")),
("deepseek", "deepseek-v4-pro"): (Decimal("0.435"), Decimal("0.87"), Decimal("0.003625")),
}
_OUTPUT_RATIO = Decimal("0.20") # assume ~20% output/input token ratio
_CACHE_RATIO = Decimal("0.30") # assume ~30% of input hits cache (system prompt)
_RETRY_HEADROOM = Decimal("0.15") # 15% headroom for potential retries
_TOOL_HEADROOM = Decimal("0.10") # 10% headroom for tool call overhead
def __init__(self, budget_controller):
self.budget = budget_controller
def estimate_full_cost(self, provider: str, model: str,
estimated_input_tokens: int) -> Decimal:
"""Estimate total cost including input, output, cache, retry, tool headroom."""
prices = self._PRICE_PER_1M.get((provider, model))
if prices is None:
return Decimal("0")
inp_p, out_p, cache_p = prices
cached = int(estimated_input_tokens * self._CACHE_RATIO)
uncached = estimated_input_tokens - cached
output_tokens = int(estimated_input_tokens * self._OUTPUT_RATIO)
input_cost = (Decimal(uncached) * inp_p + Decimal(cached) * cache_p) / Decimal("1_000_000")
output_cost = Decimal(output_tokens) * out_p / Decimal("1_000_000")
subtotal = input_cost + output_cost
# Add safety headroom for retries and tool calls
total = subtotal * (Decimal("1") + self._RETRY_HEADROOM + self._TOOL_HEADROOM)
return total
def select_model(self, user_id: str, task_complexity: str,
estimated_input_tokens: int,
daily_budget_remaining: Decimal) -> Tuple[str, str]:
"""
Returns (provider, model) or raises BlockedDecision.
Gate check: the caller MUST pass BudgetController.check() BEFORE
calling this method. Routing is a secondary decision, never a bypass
of the hard budget gate.
"""
# Default unknown complexity to HIGH — safer to burn budget than
# to silently underestimate and route a hard task to a weak model
complexity = task_complexity if task_complexity in ("low", "medium", "high") else "high"
# Define priority order per complexity level
if complexity == "high":
candidates = [
("openai", "gpt-5.5"),
("openai", "gpt-5.4"),
("deepseek", "deepseek-v4-pro"),
("deepseek", "deepseek-v4-flash"),
]
elif complexity == "medium":
candidates = [
("openai", "gpt-5.4-mini"),
("deepseek", "deepseek-v4-flash"),
]
else: # low
candidates = [("deepseek", "deepseek-v4-flash")]
# Walk candidates in priority order
for provider, model in candidates:
est = self.estimate_full_cost(provider, model, estimated_input_tokens)
if est <= daily_budget_remaining:
return (provider, model)
# No model fits — block the operation
cheapest = candidates[-1]
cheapest_est = self.estimate_full_cost(
cheapest[0], cheapest[1], estimated_input_tokens)
raise BlockedDecision(
f"Budget too low for {user_id}: "
f"cheapest option ({cheapest[0]}/{cheapest[1]}) "
f"estimated at ${float(cheapest_est):.6f} > ${float(daily_budget_remaining):.6f} remaining"
)
For deeper discussion on cost-aware routing, see Building Model-Agnostic AI Agents. For integrating cost checks into release gates, see Agent Release Gate Design.
7. Integration with Observability Platforms
The previous sections generated cost records — but they're still trapped in your Python process's memory or database tables. To make cost data truly observable, inject it into your observability infrastructure: OpenTelemetry traces (span-level cost context) and Prometheus metrics (real-time aggregation and alerting).
OpenTelemetry Span Attributes: Embedding Cost in Every Span
Attach cost-related span attributes to every LLM call and tool invocation:
from opentelemetry import trace
from opentelemetry.trace import SpanKind, Status, StatusCode
import time
tracer = trace.get_tracer("agent-cost-instrumentation")
"""
Custom attribute schema for agent cost observability spans:
Standard OTel GenAI semantic conventions:
gen_ai.provider.name — provider name (e.g., "openai", "anthropic")
gen_ai.request.model — model name (e.g., "gpt-5.4")
gen_ai.operation.name — operation type ("chat", "tool_use")
gen_ai.usage.input_tokens — input token count
gen_ai.usage.output_tokens — output token count
Application-namespace cost attributes (app.agent_cost.*):
app.agent_cost.usd — total cost in USD for this operation
app.agent_cost.input_usd — input cost
app.agent_cost.output_usd — output cost
app.agent_cost.task_id — task identifier
app.agent_cost.user_id — user identifier
app.agent_cost.tool_name — tool name (tool spans only)
app.agent_cost.tool_usd — tool L3 cost (tool spans only)
Latency: span duration is the standard way to measure latency.
The custom operation.duration_ms attribute below is an example;
in production, use span.get_span_context() and the OTel SDK to
compute duration from start_time / end_time on export.
"""
def instrumented_llm_call(
provider: str, model: str,
messages: list, tools: list | None = None,
task_id: str = "", user_id: str = "",
) -> dict:
"""Wraps an LLM API call with cost-aware OpenTelemetry instrumentation."""
with tracer.start_as_current_span(
f"agent.llm.{provider}.{model}", kind=SpanKind.CLIENT,
) as span:
# Standard GenAI semantic conventions
span.set_attribute("gen_ai.provider.name", provider)
span.set_attribute("gen_ai.request.model", model)
span.set_attribute("gen_ai.operation.name", "chat")
if tools:
span.set_attribute("gen_ai.tool.count", len(tools))
# Application-namespace attributes (app.agent_cost.*)
span.set_attribute("app.agent_cost.task_id", task_id)
span.set_attribute("app.agent_cost.user_id", user_id)
start = time.time()
try:
response = _call_llm_api(provider, model, messages, tools)
usage = response.get("usage", {})
input_tokens = usage.get("prompt_tokens", 0)
output_tokens = usage.get("completion_tokens", 0)
cost = calculate_token_cost(provider, model, input_tokens, output_tokens)
span.set_attribute("gen_ai.usage.input_tokens", input_tokens)
span.set_attribute("gen_ai.usage.output_tokens", output_tokens)
span.set_attribute("app.agent_cost.usd", float(cost.total_cost_usd))
span.set_attribute("app.agent_cost.input_usd", float(cost.input_cost_usd))
span.set_attribute("app.agent_cost.output_usd", float(cost.output_cost_usd))
span.set_attribute("app.agent_cost.provider", provider)
span.set_attribute("app.agent_cost.model", model)
span.set_status(Status(StatusCode.OK))
return response
except Exception as e:
span.set_status(Status(StatusCode.ERROR, str(e)))
raise
finally:
# Custom millisecond latency — span duration is the standard way
# to measure latency; this attribute is an example convenience.
span.set_attribute("app.agent_cost.duration_ms",
(time.time() - start) * 1000)
def instrumented_tool_call(tool_name: str, tool_input: dict,
task_id: str = "", user_id: str = "") -> dict:
"""Wraps a tool call with cost-aware instrumentation."""
with tracer.start_as_current_span(
f"agent.tool.{tool_name}", kind=SpanKind.INTERNAL,
) as span:
# Standard semantic convention
span.set_attribute("gen_ai.operation.name", "tool_use")
# Application-namespace attributes
span.set_attribute("app.agent_cost.tool_name", tool_name)
span.set_attribute("app.agent_cost.task_id", task_id)
span.set_attribute("app.agent_cost.user_id", user_id)
start = time.time()
try:
result = _execute_tool(tool_name, tool_input)
tool_pricing = TOOL_COST_REGISTRY.get(tool_name, {})
if tool_pricing.get("model") == "per_call":
span.set_attribute("app.agent_cost.tool_usd",
tool_pricing.get("cost_per_call", 0.0))
span.set_status(Status(StatusCode.OK))
return result
except Exception as e:
span.set_status(Status(StatusCode.ERROR, str(e)))
raise
finally:
span.set_attribute("app.agent_cost.tool_duration_ms",
(time.time() - start) * 1000)
# Stub implementations — replace with your actual API clients and tool executors.
# These are shown here so the OTel snippet can be read as a self-contained
# block; actual implementations are defined earlier in the article.
def _call_llm_api(provider: str, model: str, messages: list, tools: list | None = None) -> dict:
"""Replace with actual provider API call."""
return {"usage": {"prompt_tokens": 0, "completion_tokens": 0}}
def _execute_tool(tool_name: str, tool_input: dict) -> dict:
"""Replace with actual tool execution."""
return {"result": "ok"}
Prometheus Metrics: Real-Time Cost Aggregation
OTel spans provide per-request detail; Prometheus metrics provide time-based aggregation:
from prometheus_client import Counter, Histogram, Gauge, CollectorRegistry
# NOTE: Prometheus labels must use BOUNDED dimensions only.
# user_id is intentionally excluded — per-user attribution belongs in
# traces (span attributes), logs, or a SQL cost ledger, not in Prometheus
# time-series cardinality. Including user_id would create a label set per
# user, exploding the cardinality and memory footprint.
cost_registry = CollectorRegistry()
agent_cost_counter = Counter(
"agent_cost_usd_total", "Total agent cost in USD",
labelnames=["provider", "model", "team", "tenant_id", "category"],
registry=cost_registry,
)
task_cost_histogram = Histogram(
"agent_task_cost_usd", "Cost per agent task in USD",
labelnames=["task_type", "team", "outcome"],
buckets=[0.001, 0.005, 0.01, 0.05, 0.1, 0.5, 1.0, 5.0, 10.0, 50.0],
registry=cost_registry,
)
spend_rate_gauge = Gauge(
"agent_spend_rate_usd_per_hour", "Current agent spending rate in USD/hour",
labelnames=["team", "tenant_id"],
registry=cost_registry,
)
retry_waste_counter = Counter(
"agent_retry_waste_usd_total", "Total USD wasted on retries",
labelnames=["team", "tenant_id", "reason"],
registry=cost_registry,
)
waste_ratio_gauge = Gauge(
"agent_waste_ratio", "Ratio of retry waste to total cost (0.0-1.0)",
labelnames=["team", "tenant_id"],
registry=cost_registry,
)
Dashboard SQL Queries
-- Panel 1: Cost trend (time series) — last 24h
SELECT date_trunc('hour', timestamp) as hour, SUM(cost_usd) as cost_per_hour
FROM cost_records
WHERE timestamp >= NOW() - INTERVAL '24 hours'
GROUP BY hour ORDER BY hour;
-- Panel 2: Cost breakdown by model (pie chart)
SELECT model, SUM(cost_usd) as total_spend, COUNT(*) as call_count
FROM cost_records
WHERE timestamp >= NOW() - INTERVAL '7 days'
AND category IN ('token_input', 'token_output')
GROUP BY model ORDER BY total_spend DESC;
-- Panel 3: Top spenders (bar chart)
SELECT tenant_id, user_id, SUM(cost_usd) as total_cost,
COUNT(DISTINCT task_id) as task_count
FROM cost_records
WHERE timestamp >= NOW() - INTERVAL '7 days'
GROUP BY tenant_id, user_id ORDER BY total_cost DESC LIMIT 20;
-- Panel 4: Waste ratio gauge (single value)
SELECT ROUND(100.0 * SUM(CASE WHEN category = 'retry_waste' THEN cost_usd ELSE 0 END)
/ NULLIF(SUM(cost_usd), 0), 2) as waste_pct
FROM cost_records WHERE timestamp >= NOW() - INTERVAL '24 hours';
-- Panel 5: Cost per successful task (by model)
SELECT model,
COUNT(DISTINCT task_id) FILTER (WHERE task_outcome = 'success') as successful,
ROUND(SUM(cost_usd) FILTER (WHERE task_outcome = 'success')
/ NULLIF(COUNT(DISTINCT task_id) FILTER (WHERE task_outcome = 'success'), 0), 6)
as cost_per_success
FROM cost_records
WHERE timestamp >= NOW() - INTERVAL '30 days'
GROUP BY model ORDER BY cost_per_success ASC;
8. Complete Example: End-to-End Python Instrumented Agent Loop
This section integrates every concept into a single, complete, runnable agent loop—token cost tracking (Section 2), tool call costs (Section 3), retry waste recording (Section 4), cost attribution (Section 5), budget control (Section 6), and Prometheus metrics export (Section 7).
"""
End-to-end instrumented agent loop with full cost observability.
Integrates Sections 2-7 into one runnable file.
"""
from __future__ import annotations
import time, uuid
from dataclasses import dataclass, field
from decimal import Decimal
from enum import Enum
from typing import List, Dict, Tuple
# === Provider Pricing Registry (Section 2) ===
# Prices as Decimal from string literals — binary float would lose precision
# on sub-cent amounts (e.g., DeepSeek cache token at $0.0000000028).
_ONE_MILLION = Decimal("1_000_000")
MODEL_PRICING = {
("openai", "gpt-5.4"): {"input": Decimal("2.50"), "output": Decimal("15.00"), "cache_read": Decimal("0.25")},
("openai", "gpt-5.4-mini"): {"input": Decimal("0.75"), "output": Decimal("4.50"), "cache_read": Decimal("0.075")},
("anthropic", "claude-sonnet-4-6"): {"input": Decimal("3.00"), "output": Decimal("15.00"), "cache_read": Decimal("0.30")},
("anthropic", "claude-haiku-4-5"): {"input": Decimal("1.00"), "output": Decimal("5.00"), "cache_read": Decimal("0.10")},
("deepseek", "deepseek-v4-flash"): {"input": Decimal("0.14"), "output": Decimal("0.28"), "cache_read": Decimal("0.0028")},
}
def calc_token_cost(provider: str, model: str, input_tokens: int,
output_tokens: int, cache_read_tokens: int = 0) -> Decimal:
prices = MODEL_PRICING.get((provider, model), {})
if not prices:
return Decimal("0")
cache_read = min(cache_read_tokens, input_tokens)
uncached = max(0, input_tokens - cache_read)
input_cost = (uncached * prices["input"] + cache_read * prices["cache_read"]) / _ONE_MILLION
output_cost = output_tokens * prices["output"] / _ONE_MILLION
return input_cost + output_cost # aggregate unrounded, display rounds later
# === Tool Cost Registry (Section 3) ===
TOOL_COST_REGISTRY = {
"web_search": {"cost_per_call": Decimal("0.01")},
"db_query": {"cost_per_call": Decimal("0.0001")},
"code_exec": {"cost_per_second": Decimal("0.000014")},
"file_read": {"cost_per_call": Decimal("0")},
}
def calc_tool_cost(tool_name: str, duration_s: float = 0.0) -> Decimal:
pricing = TOOL_COST_REGISTRY.get(tool_name, {"cost_per_call": Decimal("0")})
if "cost_per_call" in pricing:
return pricing["cost_per_call"]
elif "cost_per_second" in pricing:
return pricing["cost_per_second"] * Decimal(str(duration_s))
return Decimal("0")
# === Retry Waste Tracker (Section 4) ===
@dataclass
class RetryRecord:
attempt: int
reason: str
tokens_wasted: int
cost_wasted: Decimal
tool_cost_wasted: Decimal = Decimal("0")
recoverable: bool = False
@dataclass
class TaskCostTracker:
task_id: str
user_id: str
tenant_id: str = "default"
task_type: str = "unknown"
_token_cost: Decimal = Decimal("0")
_tool_cost: Decimal = Decimal("0")
_retries: List[RetryRecord] = field(default_factory=list)
_succeeded: bool = False
def add_llm_cost(self, provider: str, model: str, input_tok: int,
output_tok: int, cache_tok: int = 0) -> Decimal:
c = calc_token_cost(provider, model, input_tok, output_tok, cache_tok)
self._token_cost += c
return c
def add_tool_cost(self, tool_name: str, duration_s: float = 0.0) -> Decimal:
c = calc_tool_cost(tool_name, duration_s)
self._tool_cost += c
return c
def record_retry(self, attempt: int, reason: str, tokens: int,
token_cost: Decimal, tool_cost: Decimal = Decimal("0")) -> None:
self._retries.append(RetryRecord(
attempt=attempt, reason=reason, tokens_wasted=tokens,
cost_wasted=token_cost, tool_cost_wasted=tool_cost))
def mark_success(self) -> None:
self._succeeded = True
for r in self._retries:
r.recoverable = True
@property
def waste_usd(self) -> Decimal:
"""Classification of already-recorded costs that were wasted on retries."""
return sum((r.cost_wasted + r.tool_cost_wasted for r in self._retries), Decimal("0"))
@property
def total_cost(self) -> Decimal:
"""Actual LLM + tool spend, recorded exactly once."""
return self._token_cost + self._tool_cost
@property
def waste_ratio(self) -> float:
total = self.total_cost
if total == Decimal("0"):
return 0.0
return float(self.waste_usd / total)
def summary(self) -> str:
return (f"[{self.task_id}] {self.task_type} | "
f"Total: ${float(self.total_cost):.6f} | "
f"Waste: ${float(self.waste_usd):.6f} ({self.waste_ratio:.1%}) | "
f"Retries: {len(self._retries)} | "
f"Outcome: {'success' if self._succeeded else 'failed'}")
# === Budget Controller (Section 6 simplified) ===
# NOTE: This is a single-process illustrative BudgetController with
# pre-authorization (request_budget) and reconciliation (reconcile).
# The full production-hardened version (Section 6) adds threading.Lock
# and multi-scope (per_task/daily/monthly) reservations.
# For multi-process deployments, replace with Redis INCR/INCRBY or
# DB-level SELECT ... FOR UPDATE.
class BudgetAction(Enum):
ALLOW = "allow"
WARN = "warn"
BLOCK = "block"
class BudgetController:
def __init__(self):
self._daily: Dict[str, Decimal] = {}
self._limits: Dict[str, Decimal] = {}
self._reserved: Dict[str, Decimal] = {}
def set_limit(self, user_id: str, daily_usd: float) -> None:
self._limits[user_id] = Decimal(str(daily_usd))
def request_budget(self, user_id: str, estimated_cost: Decimal) -> Tuple[BudgetAction, str]:
"""Pre-authorize: check limits including pending reservations, then reserve."""
limit = self._limits.get(user_id, Decimal("inf"))
spent = self._daily.get(user_id, Decimal("0"))
reserved = self._reserved.get(user_id, Decimal("0"))
projected = spent + reserved + estimated_cost
if projected >= limit:
return (BudgetAction.BLOCK,
f"Budget exceeded: ${float(projected):.4f} >= ${float(limit):.4f}")
# Reserve the estimated cost
self._reserved[user_id] = reserved + estimated_cost
if projected >= limit * Decimal("0.8"):
return (BudgetAction.WARN,
f"WARNING: {float(projected/limit):.0%} of daily budget used")
return (BudgetAction.ALLOW, "OK")
def reconcile(self, user_id: str, estimated_cost: Decimal, actual_cost: Decimal) -> None:
"""Settle reservation with actual cost. Call after task completes."""
self._daily[user_id] = self._daily.get(user_id, Decimal("0")) + actual_cost
reserved = self._reserved.get(user_id, Decimal("0"))
self._reserved[user_id] = max(Decimal("0"), reserved - estimated_cost)
# === The Complete Instrumented Agent Loop (Section 8) ===
class InstrumentedAgent:
def __init__(self):
self.budget = BudgetController()
def run_task(self, task_input: str, user_id: str,
provider: str = "openai", model: str = "gpt-5.4-mini",
max_turns: int = 5, max_retries: int = 3) -> TaskCostTracker:
task_id = str(uuid.uuid4())[:8]
tracker = TaskCostTracker(task_id=task_id, user_id=user_id, task_type="general")
context = [{"role": "user", "content": task_input}]
# Estimate worst-case cost for the entire task
est_max_input = 20000 # worst-case total input tokens
est_max_output = 4000 # worst-case total output tokens
est_max_cost = calc_token_cost(provider, model, est_max_input, est_max_output)
# Add worst-case tool call estimates (one search per two turns)
est_max_cost += calc_tool_cost("web_search", duration_s=2.0) * Decimal(max_turns // 2)
# Pre-authorize budget
action, reason = self.budget.request_budget(user_id, est_max_cost)
if action == BudgetAction.BLOCK:
print(f"[BUDGET STOP] {reason}")
return tracker
if action == BudgetAction.WARN:
print(f"[BUDGET WARN] {reason}")
actual_cost = Decimal("0")
turn = 0
while turn < max_turns:
turn += 1
for attempt in range(1, max_retries + 1):
try:
input_tokens = len(str(context)) // 4 + 2000
output_tokens = 300 + (turn * 50)
cache_tokens = min(1500, input_tokens) if turn > 1 else 0
cost = tracker.add_llm_cost(provider, model, input_tokens, output_tokens, cache_tokens)
actual_cost += cost
if attempt < max_retries and turn == 2:
raise TimeoutError("Simulated LLM timeout")
break
except TimeoutError:
wasted_tokens = input_tokens // 2
wasted_cost = calc_token_cost(provider, model, wasted_tokens, 0)
tracker.record_retry(attempt, "timeout", wasted_tokens, wasted_cost)
if attempt == max_retries:
self.budget.reconcile(user_id, est_max_cost, actual_cost)
return tracker
if turn % 2 == 0:
tool_cost = tracker.add_tool_cost("web_search", duration_s=1.5)
actual_cost += tool_cost
context.append({"role": "tool", "content": "search result..."})
if turn >= 3:
tracker.mark_success()
break
# Reconcile actual cost vs. estimated
self.budget.reconcile(user_id, est_max_cost, actual_cost)
print(tracker.summary())
return tracker
# === Demo ===
if __name__ == "__main__":
agent = InstrumentedAgent()
agent.budget.set_limit("user-42", daily_usd=0.50)
agent.budget.set_limit("user-7", daily_usd=5.00)
tasks = [
("Review PR #342 for security issues", "user-42", "openai", "gpt-5.4-mini"),
("Generate unit tests for auth module", "user-7", "anthropic", "claude-haiku-4-5"),
("Analyze database query performance", "user-42", "deepseek", "deepseek-v4-flash"),
("Refactor payment service error handling", "user-7", "openai", "gpt-5.4"),
("Audit access control in middleware", "user-42", "deepseek", "deepseek-v4-flash"),
("Optimize Docker image build pipeline", "user-7", "anthropic", "claude-sonnet-4-6"),
("Fix race condition in WebSocket handler", "user-42", "openai", "gpt-5.4-mini"),
("Write migration for user preferences schema", "user-42", "deepseek", "deepseek-v4-flash"),
("Benchmark new caching layer with Redis", "user-7", "openai", "gpt-5.4-mini"),
("Document API rate limiting strategy", "user-42", "deepseek", "deepseek-v4-flash"),
]
total_cost = Decimal("0")
total_waste = Decimal("0")
for task_input, user_id, provider, model in tasks:
tracker = agent.run_task(task_input, user_id=user_id, provider=provider, model=model)
total_cost += tracker.total_cost
total_waste += tracker.waste_usd
print(f"\n{'='*60}")
print(f"TOTAL COST (10 tasks): ${float(total_cost):.6f}")
print(f"TOTAL WASTE: ${float(total_waste):.6f}")
print(f"OVERALL WASTE RATIO: {float(total_waste/total_cost):.1%}" if total_cost > 0 else "N/A")
print(f"{'='*60}")
# Typical: TOTAL COST $0.123456 | TOTAL WASTE $0.024691 | WASTE RATIO 20.0%
# At 10,000 tasks/day: $24.69/day or $9,012/year wasted on retries.
This complete example demonstrates the core design principle: every line of LLM call and tool execution is wrapped in cost tracking, every cent has an owner, every dollar of waste is quantified.
Frequently Asked Questions
1. My agent runs a few hundred times a day—what's the simplest way to start tracking costs?
Extract the usage field from every API response (input_tokens/output_tokens), multiply by the model's price per token, and write to structured logs. No new tools required—wrap your existing LLM call code in a 10-line wrapper to see the approximate dollar cost of each call. Run this for a week to get a magnitude sense of your total spend, then progressively refine to per-task attribution and tool call cost tracking. The simplest starter is Section 2's calculate_token_cost() function—feed it (provider, model, input_tokens, output_tokens). Don't bother with multi-tenant attribution or budget controls yet—those are L3/L4 concerns; L2 is enough for your first "wow moment."
2. Should I track token costs and tool call costs separately?
Yes, and you must. Token costs come from LLM providers (OpenAI, Anthropic, DeepSeek—each with independent billing). Tool call costs can come from entirely different vendors (search API usage, database queries, container runtime). Separate tracking means when your bill spikes, you know immediately whether it's model usage or a third-party API that went haywire. If your SerpAPI monthly cost jumps from $200 to $2,000 but you've lumped it into "total cost," you could spend hours troubleshooting. Separate tracking also produces better optimization decisions: if tool calls account for 40% of costs, reduce search rounds; if tokens account for 80%, focus on prompt caching or switch to a cheaper model.
3. Is tracking retry waste really worth the engineering effort?
Absolutely. The core argument is in Section 4: retry waste accounts for 15–30% of total agent cost in production (illustrative scenario; actual ratios depend on agent design and retry policies). If your team isn't tracking this, you're likely burning significant money without knowing it. Concrete case (illustrative case study — figures are representative, not sourced from a specific published study): a team set their agent timeout to 5 seconds, thinking "fail fast, retry fast is good." When they implemented retry waste tracking, they discovered 22% of LLM calls triggered timeout retries (their LLM provider's response time regularly exceeded 5 seconds at peak), wasting $2,800/month. After raising the timeout from 5s to 15s, the retry rate dropped to 4% and monthly waste fell to $510—a 5.5× ROI. Without the retry waste metric, they'd never have known. Implementation cost is low—Section 4's AgentRunCostTracker is under 80 lines; integrating it requires one line of tracker.record_retry() on each exception path.
4. DeepSeek vs. Claude/GPT—is the massive cost difference worth the quality tradeoff?
There's no universal answer, but there's a clear decision framework. For structured extraction, classification, code completion, and text summarization, DeepSeek-V4-Flash is ~36× cheaper than GPT-5.5 (by input token) with minimal quality gap (typically < 5% accuracy difference). For complex reasoning, multi-step tool use, and strict format adherence, Claude Sonnet or GPT-5.4 are more reliable—their tool call success rate is approximately 8–12 percentage points higher (illustrative range; actual depends on task type). Best practice is Section 6's cost-aware routing: route simple tasks to DeepSeek, complex tasks to Claude/GPT, and when budget is tight, degrade everything to DeepSeek. Continuously monitor using Section 5's cost-per-successful-task metric. For cross-model agent architecture, see Building Model-Agnostic AI Agents.
5. Hard stop or soft warning for budget control—which one and when?
Depends on context—the two aren't mutually exclusive, and most production systems should use both. Hard stops suit internal agent pipelines and batch processing: no "user experience" concern, and exceeding budget means the task has failed. Hard stops prevent the worst-case scenario—a runaway agent loop burning hundreds of dollars at 3 AM with no one watching. Soft warnings suit end-user-facing paid products: notify users at 80% budget that "AI usage is nearing the limit for this billing cycle" and let them decide. A hard stop creates terrible UX (an in-progress task abruptly interrupted). You can also blend: task-level hard stop + user-level soft warning—no single task can exceed $5, but users can exceed their $500 monthly budget. The critical point: run the budget check before every LLM call, not after the task as "post-hoc reconciliation."
6. How do I measure "was the spend worth it"? How do I calculate cost-to-quality ratio?
The core approach is correlating cost with task outcomes. For every task, record cost_usd + outcome: success/failure + quality_score: 0–1 (if you have an evaluation framework). Key metrics: cost-per-successful-task (dollar cost excluding failed-task waste), waste-ratio (failed-task cost / total cost), cost-per-quality-point (total cost / total quality score). Example: Model A's cost-per-successful-task is $0.05 with 94% success rate; Model B's is $0.12 with 97% success rate. Model B costs 2.4× more but only improves success rate by 3 percentage points—in most scenarios, this isn't worth it. Unless those 3% of failures cause business losses far exceeding $0.07 (compliance violations, customer churn). For obtaining quality_score within an evaluation framework, see Agent Evaluation Framework Design.
7. Should I build cost tracking myself or use LangSmith / LangFuse / Helicone?
These aren't mutually exclusive. If your team already uses LangChain/LangSmith or a similar framework, leverage its automatic token counting to reach L2 (dollar pricing) quickly. But these platforms have two limitations: (1) tool call costs typically require manual reporting—your search API bill won't automatically appear in LangSmith; you'll need to inject it via the usage_metadata.total_cost field. (2) Cross-platform cost aggregation is difficult—if you use both OpenAI and DeepSeek, or some agent calls go through LangChain while others use native SDKs, costs scatter across multiple systems. The Provider Pricing Registry and CostRecord data model advocated in this article are vendor-neutral—you can run them independently of these platforms or import their data into a unified cost_records table. Decision criteria: if you only need L1–L2 tracking, off-the-shelf tools are the least effort; if you need L3 (multi-dimensional attribution) and L4 (cost control loop), building in-house or layering an abstraction on top of existing tools is more flexible.
8. Prompt caching, batch API, context compression—which should I prioritize for cost reduction?
Rank by ROI: (1) Prompt caching—immediate impact, zero code changes. Most agents have a fixed system prompt (role definition, tool list, output format instructions). Enabling prompt caching drops this portion of input token cost by 90% (OpenAI/Anthropic) to 98% (DeepSeek). If your system prompt is 5,000 tokens and you run 10,000 calls/day, that's $112/day saved on GPT-5.4. The only requirement: the system prompt must be at the beginning of the message list. (2) Context compression—medium effort, high return. Summarizing or truncating tool results before they enter context can reduce input token growth by 30–50%. See Agent Context Window Management. (3) Batch API—most overlooked. Any task that doesn't require real-time response (offline evaluation, batch analysis, overnight report generation) should use batch API for an instant 50% discount. Priority: enable prompt caching today, plan context compression next, and migrate offline tasks to batch API last.
Next Steps / Further Reading
- Agent Observability—Cost is the fourth dimension beyond the observability triad (metrics/traces/logs); understanding latency and error-rate observability is a prerequisite for cost observability
- Agent Evaluation Framework Design—Cost justifiability requires correlation with task quality; the cost-per-quality-point metric bridges cost observability and evaluation systems
- Agent Audit Log Design—Immutable audit logs are the foundational data source for cost attribution, providing cross-check capability; every cost record can be matched to an operation in the audit log
- Agent Release Gate Design—Integrate cost budget checks into the release pipeline; block releases if the new version's cost-per-task exceeds the previous version's by more than 30%
- Agent Rollback Design—Abnormal retry cost spikes can trigger automated rollback decisions; integrate cost spike detection into rollback trigger conditions
- Agent Context Window Management—Context compression and prompt caching are the most direct and effective cost reduction measures; upstream partners with this article's price calculator
- Agent Tool Design Best Practices—Tool definition token overhead is a hidden cost; optimizing tool schemas directly reduces fixed cost per LLM call
- Building Model-Agnostic AI Agents—Multi-model routing is the foundation of cost-aware model selection; understanding cross-model agent architecture is necessary to implement cost-aware routing
- MCP Protocol Production Guide—Cross-process costs from MCP tool calls require unified tracking and attribution