Agent Cost Observability: Tracking Tokens, Tool Calls, and Retry Costs

Q: My agent runs a few hundred times a day—what's the simplest way to start tracking costs?

Extract the usage field from every API response (input_tokens/output_tokens), multiply by the model's price per token, and write to structured logs. No new tools required—wrap your existing LLM call code in a 10-line wrapper to see the approximate dollar cost of each call. Run this for a week to get a magnitude sense of your total spend, then progressively refine to per-task attribution and tool call cost tracking. The simplest starter is calculate_token_cost()—feed it (provider, model, input_tokens, output_tokens). Don't bother with multi-tenant attribution or budget controls yet—those are L3/L4 concerns; L2 is enough for your first wow moment.

Q: Is tracking retry waste really worth the engineering effort?

Absolutely. The core argument is: retry waste accounts for 15–30% of total agent cost in production (illustrative scenario). If your team isn't tracking this, you're likely burning significant money without knowing it. A concrete case (illustrative case study — figures are representative, not sourced from a specific published study): a team set agent timeout to 5 seconds; after implementing retry waste tracking, they discovered 22% of LLM calls triggered timeout retries, wasting $2,800/month. Raising timeout from 5s to 15s dropped retry rate to 4% and monthly waste to $510—a 5.5x ROI. Implementation cost is low—AgentRunCostTracker is under 80 lines; integrating it requires one line of tracker.record_retry() on each exception path.

Q: DeepSeek vs. Claude/GPT—is the massive cost difference worth the quality tradeoff?

There's no universal answer, but a clear decision framework. For structured extraction, classification, and code completion, DeepSeek-V4-Flash is ~36x cheaper than GPT-5.5 with smaller quality gap (hypothetical: on one reference eval set the accuracy difference was < 5%; this is illustrative — production teams must benchmark against their own eval set). For complex reasoning and multi-step tool use, Claude/GPT are more reliable—their tool call success rate is ~8–12 percentage points higher (illustrative range). Best practice is cost-aware routing: simple tasks to DeepSeek, complex tasks to Claude/GPT, and when budget is tight, degrade everything to DeepSeek. Monitor using cost-per-successful-task—if DeepSeek's is 1/30 of GPT-5.5's while success rate is only 3% lower, route more traffic to DeepSeek.

Q: Hard stop or soft warning for budget control—which one and when?

Depends on context—the two aren't mutually exclusive. Hard stops suit internal agent pipelines and batch processing—no user experience concern, and exceeding budget means task failure. Soft warnings suit end-user-facing paid products—notify at 80% budget and let users decide. Blend: task-level hard stop (single task can't exceed $5) + user-level soft warning (users can exceed $500 monthly for high-value tasks). Critical: run the budget check before every LLM call, not after the task as post-hoc reconciliation.

Q: How do I measure 'was the spend worth it'? How do I calculate cost-to-quality ratio?

Correlate cost with task outcomes. For every task, record cost_usd + outcome: success/failure + quality_score: 0–1. Key metrics: cost-per-successful-task (dollar cost excluding failed-task waste), waste-ratio (failed-task cost / total cost), cost-per-quality-point (total cost / total quality score). Example: Model A's cost-per-successful-task is $0.05 with 94% success rate; Model B's is $0.12 with 97% success rate. Model B costs 2.4x more but only improves success rate by 3 percentage points—in most scenarios, not worth it, unless those 3% of failures cause business losses far exceeding $0.07.

Q: Should I build cost tracking myself or use LangSmith / LangFuse / Helicone?

These aren't mutually exclusive. Off-the-shelf tools can quickly reach L2 (dollar pricing), but have two limitations: tool call costs usually require manual reporting, and cross-platform cost aggregation is difficult if you use multiple providers. The Provider Pricing Registry and CostRecord data model in this article are vendor-neutral—you can run them independently or import tool data into a unified cost_records table. Decision criteria: if you only need L1–L2 tracking, use off-the-shelf; if you need L3 (multi-dimensional attribution) and L4 (cost control loop), building in-house or layering an abstraction on top of existing tools is more flexible.

Q: Prompt caching, batch API, context compression—which should I prioritize for cost reduction?

Rank by ROI: (1) Prompt caching—immediate impact, zero code changes. Most agents have fixed system prompts; enabling caching drops this input cost by 90–98%. If system prompt is 5,000 tokens at 10,000 calls/day, save $112/day on GPT-5.4. (2) Context compression—medium effort, high return. Summarizing tool results before they enter context reduces input token growth by 30–50%. (3) Batch API—most overlooked. Offline tasks get an instant 50% discount with near-zero engineering effort. Priority: enable prompt caching today, plan context compression next, migrate offline tasks to batch API last.

June 13, 2026 · Intermediate

30-Second Takeaway

The Core Problem: When agents run in production, API bills climb but no one can answer: “How much did this task cost?”, “Where did the money go — tokens, tool calls, or retries?”, “Which user burned the most?” Without cost observability, you’re spending blind.
Four Tracking Dimensions: Token costs (per provider/model with exact pricing) → Tool call costs (LLM function definition overhead + third-party API fees) → Retry waste (15–30% of total cost — illustrative scenario, the most hidden drain) → Per-user/tenant attribution (who’s burning the budget in multi-tenant systems).
Key Design Insight: Cost tracking granularity must match agent loop granularity — every LLM call, every tool execution, every retry is an independent cost record. The Provider Pricing Registry pattern means one table changes when model prices change.
What You’ll Build: A complete cost observability layer for your agent system — from token pricing to tool cost registries, from retry waste quantification to per-user cost attribution, from budget hard stops to cost-aware model routing, all unified with OpenTelemetry + Prometheus into a single cost dashboard.

1. Why Cost Observability Is Different for Agents

June 2026. A production code review agent handles 800 PRs per day. Its monthly bill is $4,200. The team knows the number, but can't answer three questions: How much of that $4,200 went toward valuable reviews versus being consumed by retries? How is cost distributed across different customers? If traffic doubles next month, should the budget be $8,000 or $12,000?

These aren't "dashboard prettiness" problems — they're cost visibility gaps. Traditional observability's three pillars (metrics/traces/logs) tell you about latency and error rates, but they can't tell you where the dollars flow. Cost is the systematically neglected fourth dimension of observability.

Why Traditional Cost Tracking Breaks for Agents

If you only run single-shot LLM calls ("summarize this document," "translate this text"), cost tracking is straightforward: one cost record per request, sum at month-end. Agents are fundamentally different — they're multi-step loop bodies: think → call tool → observe result → think again → call another tool → ... → final output. A single task can trigger 5–30 LLM calls, 3–15 tool invocations, and 0–5 retries.

This means agent cost isn't "how much was one request" — it's a multi-dimensional cost vector:

Token costs: Every LLM call's input/output token spend, varying wildly by provider and model — the same prompt costs ~$0.0003 on DeepSeek-V4-Flash versus ~$0.015 on GPT-5.5 (a 50× difference).
Tool call costs: Agents call search APIs, run database queries, execute code in containers — all with independent pricing models unrelated to the LLM provider.
Retry costs: Agents invoke wrong tools, hit timeouts, suffer hallucinations — these failed steps consume tokens and tool fees but generate zero value.
Attribution costs: In multi-tenant SaaS, the same agent pipeline serves requests from 200 customers — each customer's cost must be independently tracked and billed.

Agent cost observability and traditional observability are complementary: latency tells you "which step was slow," cost tells you "which step was expensive" — cross-analyze both to answer "was the spend worth it?" For the complete observability framework (metrics/traces/logs), see Agent Observability. Cost is the fourth dimension — its data source is still traces and metrics, but its output is dollars, not milliseconds.

The Agent Loop Cost Structure: A Concrete Example

Here's a real cost breakdown for a single code review agent task (using GPT-5.4):

Step	Operation	Input Tokens	Output Tokens	Token Cost	Tool Cost	Cumulative
1	LLM reasoning (analyze PR diff)	8,200	450	$0.0273	—	$0.0273
2	Tool: git blame (check author)	10,500	180	$0.0290	$0.0001	$0.0564
3	LLM reasoning (analyze blame results)	11,200	320	$0.0328	—	$0.0892
4	Tool: search related issues	12,800	200	$0.0350	$0.003	$0.1272
5	Retry (Step 4 timed out)	12,800	200	$0.0350	$0.003	$0.1652
6	LLM reasoning (synthesize information)	15,000	500	$0.0450	—	$0.2102
7	LLM output review comments	15,800	1,200	$0.0575	—	$0.2677

Key finding: In a single $0.27 task, $0.038 (14.2%) was spent on Step 5's retry — the retry succeeded, so it "looks fine." But at 800 tasks/day, this 14.2% amounts to $30/day of waste, $900/month. Without step-level cost tracking, you'd only see "each task costs ~$0.27" and never know this money exists.

Core insight: Agent cost tracking must operate at step-level granularity. Treating the entire task as a single cost unit masks three critical questions: which steps burn the most money, what fraction goes to retry waste, and whether tool calls cost more than LLM reasoning. None of these can be answered without step-level data, and cost optimization without answers is guesswork.

The Cost Observability Maturity Model

Don't build all dimensions at once — start with a minimal viable version and add incrementally. The table below defines four maturity levels:

Level	What You Track	Questions You Can Answer	Engineering Effort
L1: Token Counting	Input/output tokens per LLM call	"How many tokens did I use this month?"	Extract `usage` from API response — 10 lines of code
L2: Dollar Pricing	L1 + Provider Pricing Registry → USD	"Is GPT-5.4 cheaper than DeepSeek-V4 for my workload?"	Maintain model pricing table + cost calculator function
L3: Multi-Dimensional Attribution	L2 + tool call costs + retry waste + per-user breakdown	"How much did Customer A spend last month? How much was wasted?"	CostRecord data model + trace context propagation + SQL aggregation
L4: Cost Control Loop	L3 + budget alerts + cost-aware routing + automated cost reduction	"Budget is nearly exhausted — auto-switch to cheaper model"	BudgetController + CostAwareRouter + Prometheus alerts

This article covers L2–L4, assuming you already have L1 (if not, Section 2's code will get you there in one shot).

2. Token Cost Tracking Per Provider

Token cost is the starting point of observability — if your agent can't track how many tokens each call consumes, tool costs, retry analysis, and budget controls have nothing to build on. This section's goal: track token costs across OpenAI, Anthropic, DeepSeek, and any other provider through a single, unified interface.

Provider Pricing Comparison (June 2026)

Before choosing models, look at the numbers. Here's the current per-million-token pricing for major models:

Provider	Model	Input ($/MTok)	Output ($/MTok)	Cache Read ($/MTok)	Batch Discount	Notes
OpenAI	GPT-5.5	$5.00	$30.00	$0.50	50%	Best reasoning, most expensive
OpenAI	GPT-5.4	$2.50	$15.00	$0.25	50%	Production workhorse
OpenAI	GPT-5.4-mini	$0.75	$4.50	$0.075	50%	Simple routing
OpenAI	GPT-5.4-nano	$0.20	$1.25	$0.02	50%	Cheapest GPT (account-specific pricing)
Anthropic	Claude Opus 4.8	$5.00	$25.00	$0.50	50%	Complex agents
Anthropic	Claude Sonnet 4.6	$3.00	$15.00	$0.30	50%	Recommended production
Anthropic	Claude Haiku 4.5	$1.00	$5.00	$0.10	50%	Lightweight tasks
DeepSeek	DeepSeek-V4-Flash	$0.14	$0.28	$0.0028	—	Extreme value
DeepSeek	DeepSeek-V4-Pro	$0.435	$0.87	$0.003625	—	DeepSeek flagship

Key takeaways from the table:

Massive price spread: GPT-5.5's input price is 35.7× DeepSeek-V4-Flash; output is 107×. A 10K-token prompt costs $0.05 on GPT-5.5 versus $0.0014 on DeepSeek-V4-Flash.
The power of prompt caching: Most models price cache reads at 1/10 of regular input (OpenAI/Anthropic) or as low as 1/50 (DeepSeek). If your agent has a fixed system prompt (most agents do), enabling prompt caching can instantly halve your input costs.
Batch API 50% discount: Tasks that don't require real-time responses (offline evaluation, batch analysis, overnight reports) should use batch APIs — this is the most overlooked cost reduction lever.

The Provider Pricing Registry Pattern

Each provider's pricing interface is incompatible. OpenAI returns usage.prompt_tokens, Anthropic returns usage.input_tokens, DeepSeek follows OpenAI format but with different cache hit field names. Worse, prompt caching details live in completely different field paths across providers.

The solution: a Provider Pricing Registry — a centralized price table paired with a standardized cost calculator. Its core design principle: when vendors change prices, you change one dictionary and all cost records automatically update.

from __future__ import annotations

from dataclasses import dataclass
from decimal import Decimal
from typing import Dict, Tuple, Optional
from enum import Enum

# ---------------------------------------------------------------------------
# Provider Pricing Registry — single source of truth for all model costs.
# Updated June 2026 with current pricing from OpenAI, Anthropic, DeepSeek.
#
# PRICING CONSTANTS USE Decimal: binary float cannot represent most dollar
# amounts precisely. A single DeepSeek cache-hit token costs $0.0000000028
# and rounds to zero in float. Decimal preserves exact values.
# ---------------------------------------------------------------------------

class CostCategory(str, Enum):
    TOKEN_INPUT = "token_input"
    TOKEN_OUTPUT = "token_output"
    TOKEN_CACHE_READ = "token_cache_read"
    TOOL_CALL = "tool_call"
    RETRY_WASTE = "retry_waste"

# Prices in USD per 1M tokens — Decimal from string literals (June 2026)
MODEL_PRICING: Dict[Tuple[str, str], Dict[str, Decimal]] = {
    # OpenAI
    ("openai", "gpt-5.5"):        {"input": Decimal("5.00"), "output": Decimal("30.00"), "cache_read": Decimal("0.50")},
    ("openai", "gpt-5.4"):        {"input": Decimal("2.50"), "output": Decimal("15.00"), "cache_read": Decimal("0.25")},
    ("openai", "gpt-5.4-mini"):   {"input": Decimal("0.75"), "output": Decimal("4.50"),  "cache_read": Decimal("0.075")},
    ("openai", "gpt-5.4-nano"):   {"input": Decimal("0.20"), "output": Decimal("1.25"),  "cache_read": Decimal("0.02")},  # varies by account
    # Anthropic
    ("anthropic", "claude-opus-4-8"):   {"input": Decimal("5.00"), "output": Decimal("25.00"), "cache_read": Decimal("0.50")},
    ("anthropic", "claude-sonnet-4-6"): {"input": Decimal("3.00"), "output": Decimal("15.00"), "cache_read": Decimal("0.30")},
    ("anthropic", "claude-haiku-4-5"):  {"input": Decimal("1.00"), "output": Decimal("5.00"),  "cache_read": Decimal("0.10")},
    # DeepSeek — roughly 18–36× cheaper than GPT-5.5 on input
    ("deepseek", "deepseek-v4-flash"): {"input": Decimal("0.14"), "output": Decimal("0.28"), "cache_read": Decimal("0.0028")},
    ("deepseek", "deepseek-v4-pro"):   {"input": Decimal("0.435"), "output": Decimal("0.87"), "cache_read": Decimal("0.003625")},
}

# Batch processing gives 50% discount for OpenAI and Anthropic
BATCH_DISCOUNT = Decimal("0.5")
# 1M token divisor
_ONE_MILLION = Decimal("1_000_000")


def _cost_to_display(c: Decimal) -> str:
    """Format a Decimal cost for display. No rounding until display time."""
    return f"${float(c):.8f}"


@dataclass
class TokenCost:
    """Result of a token cost calculation.

    Token counts are integers (not money). All cost fields use Decimal
    to avoid binary float precision loss on small dollar amounts.
    """
    provider: str
    model: str
    input_tokens: int
    output_tokens: int
    cache_read_tokens: int
    input_cost_usd: Decimal
    output_cost_usd: Decimal
    total_cost_usd: Decimal

    def __repr__(self) -> str:
        return (f"TokenCost({self.provider}/{self.model}: "
                f"in={self.input_tokens} out={self.output_tokens} cache={self.cache_read_tokens} "
                f"→ {_cost_to_display(self.total_cost_usd)})")


def calculate_token_cost(
    provider: str,
    model: str,
    input_tokens: int,
    output_tokens: int,
    cache_read_tokens: int = 0,
    *,
    is_batch: bool = False,
) -> TokenCost:
    """
    Calculate dollar cost from token usage using the central pricing registry.

    All monetary arithmetic uses Decimal — aggregate unrounded, round only at
    display time. Token counts remain plain ints (they're counts, not money).
    """
    prices = MODEL_PRICING.get((provider.lower(), model.lower()))
    if prices is None:
        return TokenCost(
            provider=provider, model=model,
            input_tokens=input_tokens, output_tokens=output_tokens,
            cache_read_tokens=cache_read_tokens,
            input_cost_usd=Decimal("0"), output_cost_usd=Decimal("0"),
            total_cost_usd=Decimal("0"),
        )

    cache_read = min(cache_read_tokens, input_tokens)
    uncached_input = max(0, input_tokens - cache_read)

    input_cost = (
        uncached_input * prices["input"] + cache_read * prices["cache_read"]
    ) / _ONE_MILLION

    output_cost = output_tokens * prices["output"] / _ONE_MILLION

    discount = BATCH_DISCOUNT if is_batch else Decimal("1")
    total = (input_cost + output_cost) * discount

    return TokenCost(
        provider=provider, model=model,
        input_tokens=input_tokens, output_tokens=output_tokens,
        cache_read_tokens=cache_read,
        input_cost_usd=input_cost * discount,
        output_cost_usd=output_cost * discount,
        total_cost_usd=total,
    )


# ---------------------------------------------------------------------------
# Usage — the calculator works with all providers through a single interface
# ---------------------------------------------------------------------------

# Example: GPT-5.4 call with 8000 input, 2000 output, 3000 cache hits
cost_gpt = calculate_token_cost("openai", "gpt-5.4", 8000, 2000, 3000)
print(cost_gpt)
# TokenCost(openai/gpt-5.4: in=8000 out=2000 cache=3000 → $0.04325000)

# Example: Same tokens on DeepSeek-V4-Flash — 15× cheaper
cost_ds = calculate_token_cost("deepseek", "deepseek-v4-flash", 8000, 2000, 3000)
print(cost_ds)
# TokenCost(deepseek/deepseek-v4-flash: in=8000 out=2000 cache=3000 → $0.00126840)

# Batch mode — additional 50% off for OpenAI/Anthropic
cost_gpt_batch = calculate_token_cost("openai", "gpt-5.4", 8000, 2000, 3000, is_batch=True)
print(f"Batch: {_cost_to_display(cost_gpt_batch.total_cost_usd)}")
# Batch: $0.02162500

This Registry design has several deliberate choices:

The lookup key is a (provider, model) tuple: Model names can collide across providers ("mini" means different things for OpenAI and Anthropic). Tuple keys eliminate ambiguity.
Cache read token handling: Subtract cache tokens first, then price the remainder at the regular rate — prevents double-counting. Both Anthropic and OpenAI stash cache_read in different response fields, but once extracted they flow into the same function parameter.
Unknown model → zero cost: This is intentional — in production, you should alert on this. Silently recording zero cost is better than throwing an exception, because cost tracking is a non-functional concern that should never block the agent's functional path.

Parsing Usage Responses Across Providers

The calculator above assumes you already have a unified (provider, model, input_tokens, output_tokens, cache_read_tokens) tuple. But each provider's API response structure is different, so you need a thin adapter layer:

def extract_usage_openai(response: dict) -> tuple[int, int, int]:
    """Extract (input_tokens, output_tokens, cache_read_tokens) from OpenAI response."""
    usage = response.get("usage", {})
    details = usage.get("input_token_details", {})
    return (
        usage.get("prompt_tokens", 0),
        usage.get("completion_tokens", 0),
        details.get("cached_tokens", 0),
    )

def extract_usage_anthropic(response: dict) -> tuple[int, int, int]:
    """Extract (input_tokens, output_tokens, cache_read_tokens) from Anthropic response."""
    usage = response.get("usage", {})
    return (
        usage.get("input_tokens", 0),
        usage.get("output_tokens", 0),
        usage.get("cache_read_input_tokens", 0),
    )

def extract_usage_deepseek(response: dict) -> tuple[int, int, int]:
    """Extract (input_tokens, output_tokens, cache_read_tokens) from DeepSeek response."""
    usage = response.get("usage", {})
    return (
        usage.get("prompt_tokens", 0),
        usage.get("completion_tokens", 0),
        usage.get("prompt_cache_hit_tokens", 0),
    )

# Provider-specific extractors registry
USAGE_EXTRACTORS = {
    "openai": extract_usage_openai,
    "anthropic": extract_usage_anthropic,
    "deepseek": extract_usage_deepseek,
}

def track_llm_call(
    provider: str, model: str, response: dict, *, is_batch: bool = False
) -> TokenCost:
    """Unified entry point: extract usage from any provider response and calculate cost."""
    extractor = USAGE_EXTRACTORS.get(provider.lower())
    if extractor is None:
        raise ValueError(f"Unknown provider: {provider}")
    in_tok, out_tok, cache_tok = extractor(response)
    return calculate_token_cost(provider, model, in_tok, out_tok, cache_tok, is_batch=is_batch)

For prompt caching and context compression strategies that further reduce token costs, see Agent Context Window Management. These techniques and the price calculator form an upstream-downstream relationship — context management reduces token volume, and the calculator converts remaining tokens to dollars.

3. Tool Call Cost Accounting

Many teams track token costs and assume they've completed cost observability — they're wrong. Tool calls have three layers of cost, and the first layer (function definition token overhead) is almost universally ignored.

The Three-Layer Tool Cost Model

Cost Layer	Source	Typical Magnitude	Billed By
L1: Function definition tokens	Tool JSON schemas occupy tokens in every LLM request's system prompt	290–806 tokens per tool (including parameter descriptions)	LLM Provider
L2: Function call + result tokens	LLM generates a tool_use block + tool result is injected back into context	200–2,000 tokens per invocation	LLM Provider
L3: External API / infrastructure	Search API per-call pricing, database query connections, container runtime CPU	$0.001–0.05 per call	Third-party vendors / cloud platform

L1 is the biggest blind spot. Suppose your agent defines 8 tools, each averaging 450 tokens for their JSON Schema plus description. In an agent loop, these definitions are injected into every LLM request's system prompt (or Anthropic's tool_use parameter). If the agent makes 12 LLM calls per task: 8 tools × 450 tokens × 12 calls = 43,200 tokens consumed purely on tool definitions — tokens that produce zero reasoning value, just "so the LLM knows what it can do." On GPT-5.4, those 43,200 input tokens cost $0.108.

Critical engineering decision: Tool definition granularity directly impacts cost. Eight individual tools and one "multi-function tool" (differentiated by a parameter) are functionally equivalent, but the former consumes 3,150 extra tokens per request (seven additional tool schemas). At 10,000 LLM calls/day, those "extra definitions" burn $0.007875 × 10,000 = $78.75/day (approximately $2,362.50/month). Before adding a new tool, ask: must this tool be declared in the system prompt, or can it be a fallback (added only when needed)? See Agent Tool Design Best Practices for detailed discussion on tool definition optimization.

The Tool Cost Registry Pattern

Like the Provider Pricing Registry, tool call costs need a centralized registry — but tool cost models are more heterogeneous: some charge per call, some per second, some are free.

from dataclasses import dataclass, field
from datetime import datetime
from typing import Dict, Optional, List

@dataclass
class ToolCostRecord:
    """A cost record for a single tool invocation.

    All monetary fields use Decimal for exact sub-cent precision.
    Token counts remain int (counts, not money).
    """
    tool_name: str
    call_duration_s: float
    llm_tokens_consumed: int          # L1 + L2 token overhead
    llm_cost_usd: Decimal             # dollar cost of those tokens
    external_api_cost_usd: Decimal    # L3: third-party API fee
    infrastructure_cost_usd: Decimal  # L3: compute/runtime cost
    total_cost_usd: Decimal = Decimal("0")
    metadata: Dict = field(default_factory=dict)
    timestamp: str = field(default_factory=lambda: datetime.utcnow().isoformat())

    def __post_init__(self):
        self.total_cost_usd = self.llm_cost_usd + self.external_api_cost_usd + self.infrastructure_cost_usd


# Tool Cost Registry — pricing models for every tool in the agent's arsenal
# Key design: each tool declares its pricing model type (per_call, per_second, free)
# All monetary amounts use Decimal from string literals to avoid binary float
# precision loss on sub-cent amounts.
TOOL_COST_REGISTRY: Dict[str, Dict] = {
    # --- Third-party API tools ---
    "web_search_serpapi": {
        "model": "per_call",
        "cost_per_call": Decimal("0.01"),
        "unit": "api_call",
    },
    "web_search_bing": {
        "model": "per_call",
        "cost_per_call": Decimal("0.003"),
        "unit": "api_call",
    },
    "web_search_google_custom": {
        "model": "per_call",
        "cost_per_call": Decimal("0.005"),
        "unit": "api_call",
    },

    # --- Infrastructure tools ---
    "db_query_postgres": {
        "model": "per_call",
        "cost_per_call": Decimal("0.0001"),
        "unit": "query",
    },
    "code_execution_docker": {
        "model": "per_second",
        "cost_per_second": Decimal("0.000014"),  # ~$0.05/hour container runtime
        "unit": "second",
    },

    # --- Free tools (local operations) ---
    "file_read":     {"model": "free", "cost_per_call": Decimal("0"), "unit": "free"},
    "file_write":    {"model": "free", "cost_per_call": Decimal("0"), "unit": "free"},
    "bash_exec":     {"model": "free", "cost_per_call": Decimal("0"), "unit": "free"},
    "regex_search":  {"model": "free", "cost_per_call": Decimal("0"), "unit": "free"},
}

# Canonical tool definition sizes (tokens consumed per LLM call just for the schema)
TOOL_DEFINITION_TOKENS: Dict[str, int] = {
    "web_search_serpapi": 520,
    "web_search_bing": 480,
    "db_query_postgres": 350,
    "code_execution_docker": 620,
    "file_read": 290,
    "file_write": 320,
    "bash_exec": 450,
    "regex_search": 380,
    "__default__": 400,
}


def calculate_tool_cost(
    tool_name: str,
    call_duration_s: float = 0.0,
    num_llm_calls_in_task: int = 1,
    llm_input_price_per_mtok: Decimal = Decimal("0.75"),
) -> ToolCostRecord:
    """
    Calculate the full cost of a tool invocation.

    Three cost sources:
      1. LLM token overhead (L1 + L2)
      2. External API cost (L3, from Tool Cost Registry)
      3. Infrastructure cost (L3, container runtime, DB connections, etc.)
    """
    pricing = TOOL_COST_REGISTRY.get(tool_name)
    if pricing is None:
        pricing = {"model": "per_call", "cost_per_call": Decimal("0")}

    # L3: External API / infrastructure cost
    if pricing["model"] == "per_call":
        external_cost = pricing.get("cost_per_call", Decimal("0"))
        infra_cost = Decimal("0")
    elif pricing["model"] == "per_second":
        external_cost = Decimal("0")
        infra_cost = pricing.get("cost_per_second", Decimal("0")) * Decimal(str(call_duration_s))
    else:
        external_cost = Decimal("0")
        infra_cost = Decimal("0")

    # L1: Tool definition overhead across all LLM calls in the task
    tool_def_tokens = TOOL_DEFINITION_TOKENS.get(
        tool_name, TOOL_DEFINITION_TOKENS["__default__"]
    )
    total_definition_tokens = tool_def_tokens * num_llm_calls_in_task
    llm_cost = (Decimal(total_definition_tokens) * llm_input_price_per_mtok) / Decimal("1_000_000")

    return ToolCostRecord(
        tool_name=tool_name,
        call_duration_s=call_duration_s,
        llm_tokens_consumed=total_definition_tokens,
        llm_cost_usd=llm_cost,
        external_api_cost_usd=external_cost,
        infrastructure_cost_usd=infra_cost,
    )


# Usage examples
cost_search = calculate_tool_cost("web_search_serpapi", call_duration_s=1.2, num_llm_calls_in_task=5)
print(f"Search tool total: ${float(cost_search.total_cost_usd):.6f}")
# Search tool total: $0.011950
#    L1: 520 tokens x 5 calls x $0.75/MTok = $0.001950
#    L3: $0.01 per call
# Notice: L1 is 19.5% of the tool's total cost — not negligible!

This Registry reveals two important facts about tool call cost tracking:

High-frequency "free" tools can dominate costs. bash_exec is free (local execution), but 450 tokens × 20 LLM calls/task × 800 tasks/day = 7,200,000 tokens/day. On GPT-5.4 that's $18/day in pure tool definition overhead. Free tools are not zero-cost.
Search tools' L3 costs far exceed L1. SerpAPI at $0.01/call — if an agent searches 6 times per task, L3 cost is $0.06 while L1 might be only $0.002. The optimization direction is reducing unnecessary search rounds, not shrinking tool definitions.

For cross-process cost tracking of MCP protocol tools, see MCP Protocol Production Guide — MCP tool calls add extra serialization and network overhead that require separate cost consideration.

4. Retry and Wastage Costs — the Hidden Money Pit in Agent Systems

In production, agent retry waste typically accounts for 15–30% of total token costs. This is a commonly cited range across production agent systems (illustrative scenario — your actual ratio depends on agent design, retry policy, and model reliability). If you spend $10,000/month on agent API calls, $1,500–$3,000 is pure waste — tokens consumed and tools invoked that produced zero successful output.

This is the section's "wow moment": most teams have no idea how much they're burning on retries. Once you quantify it, the number will make you immediately start optimizing your retry strategy.

Four Categories of Retry Waste

Retry Type	Trigger	Avoidable?	Cost Per Instance (typical)	Frequency
Timeout retry	LLM API response timeout	Partially	$0.005–0.05	2–8% of calls
Rate limit retry	Provider concurrency limit exceeded	Yes	$0.001–0.02	1–5% of calls
Hallucinated tool call retry	LLM invokes wrong tool/parameters	Partially	$0.01–0.10	3–12% of tool calls
Multi-round failure retry	Task too complex for agent	Partially	$0.10–2.00	5–15% of tasks

Hallucinated tool calls are the most expensive waste category — they not only consume LLM tokens to generate the tool_use, but also incur real tool execution costs, and the results are worthless. In an analysis of 50,000 agent tasks (illustrative scenario), we found 7.2% of tool calls were "hallucinatory" — the right tool was invoked but with semantically unreasonable parameters. These calls averaged $0.023 each, and the total waste accounted for 11.4% of all tool call costs (illustrative figures; actual ratios depend on agent design).

RetryCost Tracker Implementation

from dataclasses import dataclass, field
from typing import List, Optional
from datetime import datetime


@dataclass
class RetryRecord:
    """A single retry attempt's cost footprint."""
    attempt_number: int
    reason: str          # timeout, rate_limit, wrong_tool, hallucination, model_error
    tokens_wasted: int
    cost_wasted_usd: float
    tool_calls_wasted: int
    tool_cost_wasted_usd: float
    was_recoverable: bool = False
    timestamp: str = field(default_factory=lambda: datetime.utcnow().isoformat())


@dataclass
class AgentRunCostTracker:
    """
    Tracks all costs for a single agent task execution, with special attention
    to retry waste. Every LLM call, tool invocation, and retry attempt is
    recorded so you can answer: "How much of this task's cost was wasted?"
    """
    task_id: str
    user_id: str
    task_type: str = "unknown"

    _total_token_cost: float = 0.0
    _total_tool_cost: float = 0.0
    _total_tokens: int = 0
    _retries: List[RetryRecord] = field(default_factory=list)
    _succeeded: bool = False
    _attempt_number: int = 0

    def start_attempt(self) -> int:
        self._attempt_number += 1
        return self._attempt_number

    def record_retry(self, reason: str, tokens: int, token_cost_usd: float,
                     tool_calls: int = 0, tool_cost_usd: float = 0.0) -> None:
        self._retries.append(RetryRecord(
            attempt_number=self._attempt_number,
            reason=reason, tokens_wasted=tokens,
            cost_wasted_usd=token_cost_usd,
            tool_calls_wasted=tool_calls,
            tool_cost_wasted_usd=tool_cost_usd,
        ))

    def mark_success(self) -> None:
        self._succeeded = True
        for r in self._retries:
            r.was_recoverable = True

    def add_cost(self, token_cost_usd: float, tool_cost_usd: float, tokens: int) -> None:
        """Record actual cost of every operation — successful or failed.

        Called for EVERY LLM call and tool invocation. Failed operations
        later get classified as retry waste via record_retry(), which
        tags already-recorded costs — it doesn't add new spend.
        """
        self._total_token_cost += token_cost_usd
        self._total_tool_cost += tool_cost_usd
        self._total_tokens += tokens

    @property
    def total_retry_waste_usd(self) -> float:
        """Classification of already-recorded costs that were wasted on retries.

        This is NOT additional spend — it's a tag/filter over costs that were
        already recorded via add_cost(). Retry waste is a subset of total cost,
        not an extra cost on top.
        """
        return sum(r.cost_wasted_usd + r.tool_cost_wasted_usd for r in self._retries)

    @property
    def total_cost_usd(self) -> float:
        """Actual LLM + tool spend, recorded exactly once per operation.

        total_retry_waste_usd is NOT added here — it's a classification of
        already-recorded costs. Summing them would double-count.
        """
        return self._total_token_cost + self._total_tool_cost

    @property
    def waste_ratio(self) -> float:
        """THE single most important metric on your cost dashboard.

        Ratio of retry-classified cost to total cost. The denominator is total
        cost (all spend including what was later tagged as waste). This means
        waste_ratio answers: 'what fraction of every dollar was burned on
        operations that had to be retried?'
        """
        total = self.total_cost_usd
        if total == 0:
            return 0.0
        return self.total_retry_waste_usd / total

    @property
    def retry_count(self) -> int:
        return len(self._retries)

    def summary(self) -> str:
        return (
            f"Task {self.task_id} ({self.task_type}) — "
            f"Status: {'✓' if self._succeeded else '✗'} | "
            f"Total: ${self.total_cost_usd:.6f} | "
            f"Waste: ${self.total_retry_waste_usd:.6f} "
            f"({self.waste_ratio:.1%}) | "
            f"Retries: {self.retry_count}"
        )

The wow moment math: If your agent system handles 1,000 tasks/day, each averaging $0.25, with a 20% waste ratio, that's $50/day of pure waste, $1,500/month, $18,250/year. A system processing 10,000 tasks/day at this ratio wastes $182,500/year. This money won't appear as a line item on any provider's bill — only your own cost tracker can see it. Go calculate your waste_ratio right now.

Waste Breakdown by Retry Reason

-- Retry waste breakdown by reason (last 7 days)
SELECT
    reason,
    COUNT(*) as retry_count,
    SUM(cost_wasted_usd + tool_cost_wasted_usd) as total_waste_usd,
    ROUND(AVG(cost_wasted_usd + tool_cost_wasted_usd), 6) as avg_waste_per_retry,
    ROUND(100.0 * SUM(cost_wasted_usd + tool_cost_wasted_usd) /
        (SELECT SUM(cost_wasted_usd + tool_cost_wasted_usd) FROM retry_records
         WHERE timestamp >= NOW() - INTERVAL '7 days'), 1) as pct_of_total_waste
FROM retry_records
WHERE timestamp >= NOW() - INTERVAL '7 days'
GROUP BY reason
ORDER BY total_waste_usd DESC;

Abnormal retry costs can serve as a leading indicator of system health — if the retry rate suddenly spikes, it may signal an upstream service outage or model behavior regression. See Agent Rollback Design for how to automatically revert when cost anomalies are detected.

5. Per-Task and Per-User Cost Attribution

With per-step cost data in hand, the next question is: how do you attribute those costs to the correct task and user? This is especially critical for multi-tenant SaaS systems — you need to know Customer A spent $340 and Customer B spent $12, both for accurate billing and for assessing customer health.

Trace Context Propagation: The Foundation of Cost Attribution

The key to cost attribution is injecting user_id, tenant_id, and task_id at the request entry point and propagating them through trace context to all downstream spans. No matter how many layers of LLM calls and tool invocations your agent runs, every cost record carries its attribution at birth.

from opentelemetry import trace, baggage, context
from opentelemetry.trace import SpanKind
import uuid

tracer = trace.get_tracer("agent-cost-attribution")


class CostAttributionContext:
    """
    Propagates attribution metadata (user_id, tenant_id, task_id) across all
    spans in an agent task's trace tree.
    """
    def __init__(self, user_id: str, tenant_id: str = "default"):
        self.user_id = user_id
        self.tenant_id = tenant_id
        self.task_id = str(uuid.uuid4())[:8]

    def span(self, name: str, **attrs):
        span = tracer.start_span(name, kind=SpanKind.INTERNAL)
        span.set_attribute("app.agent_cost.user_id", self.user_id)
        span.set_attribute("app.agent_cost.tenant_id", self.tenant_id)
        span.set_attribute("app.agent_cost.task_id", self.task_id)
        # Propagate via OpenTelemetry baggage so downstream services
        # (e.g., a separate tool-execution service) can extract them.
        # baggage.set_baggage() returns a new Context; capture it and
        # pass to child spans via context.attach(ctx).
        ctx = baggage.set_baggage("app.agent_cost.user_id", self.user_id)
        ctx = baggage.set_baggage("app.agent_cost.tenant_id", self.tenant_id, context=ctx)
        ctx = baggage.set_baggage("app.agent_cost.task_id", self.task_id, context=ctx)
        context.attach(ctx)
        for k, v in attrs.items():
            span.set_attribute(k, str(v))
        return span

SQL Aggregation: From Raw Records to Per-User Bills

-- 1. Per-user cost breakdown (multi-tenant billing)
SELECT
    tenant_id, user_id,
    COUNT(DISTINCT task_id) as task_count,
    SUM(cost_usd) as total_cost_usd,
    SUM(CASE WHEN category = 'retry_waste' THEN cost_usd ELSE 0 END) as waste_cost_usd,
    ROUND(100.0 * SUM(CASE WHEN category = 'retry_waste' THEN cost_usd ELSE 0 END)
          / NULLIF(SUM(cost_usd), 0), 2) as waste_pct
FROM cost_records
WHERE timestamp >= NOW() - INTERVAL '30 days'
GROUP BY tenant_id, user_id
ORDER BY total_cost_usd DESC;

-- 2. Cost per successful vs failed task
SELECT
    task_outcome,
    COUNT(DISTINCT task_id) as task_count,
    SUM(cost_usd) as total_cost_usd,
    ROUND(SUM(cost_usd) / COUNT(DISTINCT task_id), 6) as cost_per_task
FROM cost_records
WHERE timestamp >= NOW() - INTERVAL '7 days'
GROUP BY task_outcome;

-- 3. Model cost efficiency (cost per successful task by model)
SELECT
    model,
    COUNT(DISTINCT CASE WHEN task_outcome = 'success' THEN task_id END) as successful_tasks,
    ROUND(SUM(CASE WHEN task_outcome = 'success' THEN cost_usd ELSE 0 END) /
          NULLIF(COUNT(DISTINCT CASE WHEN task_outcome = 'success' THEN task_id END), 0), 6)
        as cost_per_successful_task
FROM cost_records
WHERE timestamp >= NOW() - INTERVAL '30 days'
GROUP BY model
ORDER BY cost_per_successful_task ASC;

The third query (cost_per_successful_task) is the core metric for measuring "was the spend worth it." Suppose GPT-5.5 costs $0.15 per successful task and DeepSeek-V4-Flash costs $0.008 — GPT-5.5 is 18.75× more expensive. Does its success rate or quality exceed DeepSeek's by 18.75×? If not, route more traffic to DeepSeek. That's Section 6's cost-aware router.

Audit logs provide the foundational data source for cost attribution — see Agent Audit Log Design. For cost-to-quality ratio analysis, see Agent Evaluation Framework Design.

6. Budget Alerts and Cost Controls

Cost tracking exists to enable cost control. Without budget constraints, an agent system is like a credit card with no limit — a runaway loop can burn hundreds of dollars in minutes.

Two Budget Control Patterns: Hard Stop vs. Soft Warning

Pattern	Behavior	Best For	User Experience
Hard Stop	Immediately terminate agent execution when budget is exhausted, raise BudgetExceededError	Internal pipelines, batch processing, non-user-facing scenarios	Task fails, but no further money is burned
Soft Warning	Send alert at 80% budget but allow continuation; block new expensive operations at 100%	End-user-facing paid products, high-value tasks	User is notified and can choose to continue

Most production systems should use both simultaneously: task-level hard stops + user-level soft warnings.

BudgetController Implementation

from enum import Enum
from typing import Dict, Optional, Callable
import threading


class BudgetAction(Enum):
    ALLOW = "allow"
    WARN = "warn"
    BLOCK = "block"


class BudgetExceededError(Exception):
    """Raised when a hard budget limit is exceeded."""
    def __init__(self, user_id: str, limit_name: str, current: float, limit: float):
        self.user_id = user_id
        self.limit_name = limit_name
        self.current = current
        self.limit = limit
        super().__init__(
            f"Budget exceeded for {user_id}: {limit_name} "
            f"(${current:.4f} >= ${limit:.4f})"
        )


class BudgetController:
    """Enforces per-user and per-task budget limits across three scopes:
    per_task, daily, monthly. Each can be Hard Stop or Soft Warning.

    Key design decisions (production-hardened):
      - Pre-authorization: estimate cost BEFORE operation, reserve budget
      - >= (not >): prevents spending past the exact limit
      - Thread-safe: atomic bookkeeping via Lock (concurrent tasks)
      - Reconciliation: after operation, settle actual vs. estimated cost
    """

    def __init__(self, cost_store: "CostStore"):
        self._store = cost_store
        self._limits: Dict[str, Dict[str, float]] = {}
        self._actions: Dict[str, Dict[str, BudgetAction]] = {}
        self._warning_handler: Optional[Callable] = None
        self._lock = threading.Lock()
        # Estimated costs reserved but not yet settled
        self._reserved: Dict[str, Dict[str, float]] = {}  # user_id -> {scope: amount}

    def set_limit(self, user_id: str, scope: str, limit_usd: float,
                  action: BudgetAction = BudgetAction.BLOCK) -> None:
        self._limits.setdefault(user_id, {})[scope] = limit_usd
        self._actions.setdefault(user_id, {})[scope] = action

    def set_warning_handler(self, handler: Callable) -> None:
        self._warning_handler = handler

    def request_budget(
        self, user_id: str, task_id: str, estimated_cost: float,
    ) -> tuple[BudgetAction, str]:
        """Pre-authorize an operation by checking budgets and reserving cost.

        Callers should invoke this BEFORE starting a new operation. It combines
        limit checks with cost reservation under a single lock to prevent
        concurrent tasks from both passing the gate and overshooting together.

        After the operation completes, call reconcile() with the actual cost
        to settle the reservation.
        """
        return self.check(user_id, task_id, estimated_next_cost=estimated_cost)

    def check(self, user_id: str, task_id: str,
              estimated_next_cost: float = 0.0) -> tuple[BudgetAction, str]:
        """Check all applicable limits before executing an operation.

        Pre-authorization: pass estimated_next_cost to reserve budget BEFORE
        the operation runs. After completion, call reconcile() to settle.

        All limit checks use store_spend + existing_reservations +
        estimated_next_cost so that reservations from concurrent tasks
        are visible in every comparison.

        Atomicity guarantee: within a single lock, all outstanding reservations
        (_reserved) are included in every limit comparison. Two concurrent tasks
        cannot both pass the budget gate and overshoot together — the second
        check sees the first task's reserved amount and correctly blocks/limits.
        This is equivalent to Redis INCR / DB SELECT ... FOR UPDATE.
        """
        with self._lock:
            limits = self._limits.get(user_id, {})
            actions = self._actions.get(user_id, {})
            # Include outstanding reservations from prior check() calls
            # that haven't been reconciled yet
            reserved = self._reserved.get(user_id, {})

            # --- 1. Per-task limit (always hard stop) ---
            task_limit = limits.get("per_task", float("inf"))
            task_spend = self._store.get_task_cost(task_id)
            task_reserved = reserved.get("per_task", 0.0)
            task_projected = task_spend + task_reserved + estimated_next_cost

            if task_projected >= task_limit:
                raise BudgetExceededError(
                    user_id, "per_task",
                    task_projected, task_limit)

            # --- 2. Daily limit ---
            daily_limit = limits.get("daily", float("inf"))
            daily_action = actions.get("daily", BudgetAction.BLOCK)
            daily_spend = self._store.get_user_daily_cost(user_id)
            daily_reserved = reserved.get("daily", 0.0)
            daily_projected = daily_spend + daily_reserved + estimated_next_cost

            if daily_projected >= daily_limit and daily_action == BudgetAction.BLOCK:
                raise BudgetExceededError(
                    user_id, "daily",
                    daily_projected, daily_limit)

            # --- 3. Monthly limit ---
            monthly_limit = limits.get("monthly", float("inf"))
            monthly_action = actions.get("monthly", BudgetAction.BLOCK)
            monthly_spend = self._store.get_user_monthly_cost(user_id)
            monthly_reserved = reserved.get("monthly", 0.0)
            monthly_projected = monthly_spend + monthly_reserved + estimated_next_cost

            if monthly_projected >= monthly_limit and monthly_action == BudgetAction.BLOCK:
                raise BudgetExceededError(
                    user_id, "monthly",
                    monthly_projected, monthly_limit)

            # --- Determine action: BLOCK already raised → only ALLOW or WARN ---
            # All WARN conditions include estimated_next_cost in the projection.
            action = BudgetAction.ALLOW
            msg = "OK"

            # Per-task 80% (hard scope, always WARN)
            if task_projected >= task_limit * 0.8:
                action = BudgetAction.WARN
                msg = f"Task budget at {task_projected/task_limit:.0%}"

            # Daily — 100% soft cap overrides 80% warning
            if daily_projected >= daily_limit:
                action = BudgetAction.WARN
                msg = (f"Daily budget exceeded (soft): "
                       f"${daily_projected:.4f} >= ${daily_limit:.4f}")
            elif daily_projected >= daily_limit * 0.8:
                action = BudgetAction.WARN
                msg = f"Daily budget at {daily_projected/daily_limit:.0%}"

            # Monthly — 100% soft cap overrides 80% warning
            if monthly_projected >= monthly_limit:
                action = BudgetAction.WARN
                msg = (f"Monthly budget exceeded (soft): "
                       f"${monthly_projected:.4f}")
            elif monthly_projected >= monthly_limit * 0.8:
                action = BudgetAction.WARN
                msg = f"Monthly budget at {monthly_projected/monthly_limit:.0%}"

            # --- Pre-authorize: reserve for ALLOW and WARN ---
            # (BLOCK already raised; only ALLOW/WARN reach here)
            self._reserved.setdefault(user_id, {})
            for scope in ["per_task", "daily", "monthly"]:
                self._reserved[user_id][scope] = (
                    self._reserved[user_id].get(scope, 0.0) + estimated_next_cost)

            return (action, msg)

    def reconcile(self, user_id: str, task_id: str,
                  actual_cost: float, estimated_cost: float) -> None:
        """After operation completes, reconcile actual vs. estimated cost.

        Always call this after check() + operation, even on failure.
        This settles the reserved estimate to the actual cost.
        """
        with self._lock:
            self._store.record_cost(task_id, user_id, actual_cost)
            # Release any over-reservation or add under-reservation
            for scope in ["per_task", "daily", "monthly"]:
                reserved = self._reserved.get(user_id, {}).get(scope, 0.0)
                self._reserved.setdefault(user_id, {})[scope] = max(
                    0.0, reserved - estimated_cost)


class CostStore:
    """Abstract interface for querying current spend."""
    def get_task_cost(self, task_id: str) -> float: ...
    def get_user_daily_cost(self, user_id: str) -> float: ...
    def get_user_monthly_cost(self, user_id: str) -> float: ...
    def record_cost(self, task_id: str, user_id: str, cost: float) -> None: ...

Cost-Aware Model Routing

The smarter approach: automatically switching to cheaper models when budget is constrained. Especially valuable for teams using a mix of DeepSeek and GPT/Claude:

from decimal import Decimal
from typing import Dict, Tuple, Optional


class BlockedDecision(Exception):
    """Raised when budget doesn't permit ANY model for this operation."""
    pass


class CostAwareRouter:
    """
    Routes agent tasks to the most appropriate model based on:
      1. Task complexity (low/medium/high)
      2. Remaining budget
      3. Estimated cost (all components: input, cached, output, retry + tool headroom)

    Critical safety rules:
      - If budget is insufficient for ANY model → raise BlockedDecision
        (do NOT silently route to cheapest — that still burns budget)
      - Estimate ALL cost components, not just input tokens
      - Unknown complexity → assume HIGH (safer to overestimate)
      - The budget gate (BudgetController.check) is PRIMARY;
        routing is a secondary decision after the gate passes
    """

    # Full cost model per provider: (input, output, cache_read) per 1M tokens
    # plus estimated output ratio (~20% of input) and tool headroom (~10%)
    # All prices use Decimal to avoid binary float precision loss.
    _PRICE_PER_1M: Dict[Tuple[str, str], Tuple[Decimal, Decimal, Decimal]] = {
        ("openai", "gpt-5.5"):      (Decimal("5.00"), Decimal("30.00"), Decimal("0.50")),
        ("openai", "gpt-5.4"):      (Decimal("2.50"), Decimal("15.00"), Decimal("0.25")),
        ("openai", "gpt-5.4-mini"): (Decimal("0.75"), Decimal("4.50"),  Decimal("0.075")),
        ("anthropic", "claude-sonnet-4-6"): (Decimal("3.00"), Decimal("15.00"), Decimal("0.30")),
        ("anthropic", "claude-haiku-4-5"):  (Decimal("1.00"), Decimal("5.00"),  Decimal("0.10")),
        ("deepseek", "deepseek-v4-flash"):  (Decimal("0.14"), Decimal("0.28"), Decimal("0.0028")),
        ("deepseek", "deepseek-v4-pro"):    (Decimal("0.435"), Decimal("0.87"), Decimal("0.003625")),
    }

    _OUTPUT_RATIO = Decimal("0.20")       # assume ~20% output/input token ratio
    _CACHE_RATIO = Decimal("0.30")        # assume ~30% of input hits cache (system prompt)
    _RETRY_HEADROOM = Decimal("0.15")     # 15% headroom for potential retries
    _TOOL_HEADROOM = Decimal("0.10")      # 10% headroom for tool call overhead

    def __init__(self, budget_controller):
        self.budget = budget_controller

    def estimate_full_cost(self, provider: str, model: str,
                           estimated_input_tokens: int) -> Decimal:
        """Estimate total cost including input, output, cache, retry, tool headroom."""
        prices = self._PRICE_PER_1M.get((provider, model))
        if prices is None:
            return Decimal("0")
        inp_p, out_p, cache_p = prices
        cached = int(estimated_input_tokens * self._CACHE_RATIO)
        uncached = estimated_input_tokens - cached
        output_tokens = int(estimated_input_tokens * self._OUTPUT_RATIO)

        input_cost = (Decimal(uncached) * inp_p + Decimal(cached) * cache_p) / Decimal("1_000_000")
        output_cost = Decimal(output_tokens) * out_p / Decimal("1_000_000")
        subtotal = input_cost + output_cost

        # Add safety headroom for retries and tool calls
        total = subtotal * (Decimal("1") + self._RETRY_HEADROOM + self._TOOL_HEADROOM)
        return total

    def select_model(self, user_id: str, task_complexity: str,
                     estimated_input_tokens: int,
                     daily_budget_remaining: Decimal) -> Tuple[str, str]:
        """
        Returns (provider, model) or raises BlockedDecision.

        Gate check: the caller MUST pass BudgetController.check() BEFORE
        calling this method. Routing is a secondary decision, never a bypass
        of the hard budget gate.
        """
        # Default unknown complexity to HIGH — safer to burn budget than
        # to silently underestimate and route a hard task to a weak model
        complexity = task_complexity if task_complexity in ("low", "medium", "high") else "high"

        # Define priority order per complexity level
        if complexity == "high":
            candidates = [
                ("openai", "gpt-5.5"),
                ("openai", "gpt-5.4"),
                ("deepseek", "deepseek-v4-pro"),
                ("deepseek", "deepseek-v4-flash"),
            ]
        elif complexity == "medium":
            candidates = [
                ("openai", "gpt-5.4-mini"),
                ("deepseek", "deepseek-v4-flash"),
            ]
        else:  # low
            candidates = [("deepseek", "deepseek-v4-flash")]

        # Walk candidates in priority order
        for provider, model in candidates:
            est = self.estimate_full_cost(provider, model, estimated_input_tokens)
            if est <= daily_budget_remaining:
                return (provider, model)

        # No model fits — block the operation
        cheapest = candidates[-1]
        cheapest_est = self.estimate_full_cost(
            cheapest[0], cheapest[1], estimated_input_tokens)
        raise BlockedDecision(
            f"Budget too low for {user_id}: "
            f"cheapest option ({cheapest[0]}/{cheapest[1]}) "
            f"estimated at ${float(cheapest_est):.6f} > ${float(daily_budget_remaining):.6f} remaining"
        )

For deeper discussion on cost-aware routing, see Building Model-Agnostic AI Agents. For integrating cost checks into release gates, see Agent Release Gate Design.

7. Integration with Observability Platforms

The previous sections generated cost records — but they're still trapped in your Python process's memory or database tables. To make cost data truly observable, inject it into your observability infrastructure: OpenTelemetry traces (span-level cost context) and Prometheus metrics (real-time aggregation and alerting).

OpenTelemetry Span Attributes: Embedding Cost in Every Span

Attach cost-related span attributes to every LLM call and tool invocation:

from opentelemetry import trace
from opentelemetry.trace import SpanKind, Status, StatusCode
import time

tracer = trace.get_tracer("agent-cost-instrumentation")

"""
Custom attribute schema for agent cost observability spans:

  Standard OTel GenAI semantic conventions:
    gen_ai.provider.name         — provider name (e.g., "openai", "anthropic")
    gen_ai.request.model         — model name (e.g., "gpt-5.4")
    gen_ai.operation.name        — operation type ("chat", "tool_use")
    gen_ai.usage.input_tokens    — input token count
    gen_ai.usage.output_tokens   — output token count

  Application-namespace cost attributes (app.agent_cost.*):
    app.agent_cost.usd            — total cost in USD for this operation
    app.agent_cost.input_usd      — input cost
    app.agent_cost.output_usd     — output cost
    app.agent_cost.task_id        — task identifier
    app.agent_cost.user_id        — user identifier
    app.agent_cost.tool_name      — tool name (tool spans only)
    app.agent_cost.tool_usd       — tool L3 cost (tool spans only)

  Latency: span duration is the standard way to measure latency.
  The custom operation.duration_ms attribute below is an example;
  in production, use span.get_span_context() and the OTel SDK to
  compute duration from start_time / end_time on export.
"""


def instrumented_llm_call(
    provider: str, model: str,
    messages: list, tools: list | None = None,
    task_id: str = "", user_id: str = "",
) -> dict:
    """Wraps an LLM API call with cost-aware OpenTelemetry instrumentation."""
    with tracer.start_as_current_span(
        f"agent.llm.{provider}.{model}", kind=SpanKind.CLIENT,
    ) as span:
        # Standard GenAI semantic conventions
        span.set_attribute("gen_ai.provider.name", provider)
        span.set_attribute("gen_ai.request.model", model)
        span.set_attribute("gen_ai.operation.name", "chat")
        if tools:
            span.set_attribute("gen_ai.tool.count", len(tools))

        # Application-namespace attributes (app.agent_cost.*)
        span.set_attribute("app.agent_cost.task_id", task_id)
        span.set_attribute("app.agent_cost.user_id", user_id)

        start = time.time()
        try:
            response = _call_llm_api(provider, model, messages, tools)
            usage = response.get("usage", {})
            input_tokens = usage.get("prompt_tokens", 0)
            output_tokens = usage.get("completion_tokens", 0)

            cost = calculate_token_cost(provider, model, input_tokens, output_tokens)

            span.set_attribute("gen_ai.usage.input_tokens", input_tokens)
            span.set_attribute("gen_ai.usage.output_tokens", output_tokens)
            span.set_attribute("app.agent_cost.usd", float(cost.total_cost_usd))
            span.set_attribute("app.agent_cost.input_usd", float(cost.input_cost_usd))
            span.set_attribute("app.agent_cost.output_usd", float(cost.output_cost_usd))
            span.set_attribute("app.agent_cost.provider", provider)
            span.set_attribute("app.agent_cost.model", model)
            span.set_status(Status(StatusCode.OK))
            return response
        except Exception as e:
            span.set_status(Status(StatusCode.ERROR, str(e)))
            raise
        finally:
            # Custom millisecond latency — span duration is the standard way
            # to measure latency; this attribute is an example convenience.
            span.set_attribute("app.agent_cost.duration_ms",
                             (time.time() - start) * 1000)


def instrumented_tool_call(tool_name: str, tool_input: dict,
                           task_id: str = "", user_id: str = "") -> dict:
    """Wraps a tool call with cost-aware instrumentation."""
    with tracer.start_as_current_span(
        f"agent.tool.{tool_name}", kind=SpanKind.INTERNAL,
    ) as span:
        # Standard semantic convention
        span.set_attribute("gen_ai.operation.name", "tool_use")

        # Application-namespace attributes
        span.set_attribute("app.agent_cost.tool_name", tool_name)
        span.set_attribute("app.agent_cost.task_id", task_id)
        span.set_attribute("app.agent_cost.user_id", user_id)

        start = time.time()
        try:
            result = _execute_tool(tool_name, tool_input)
            tool_pricing = TOOL_COST_REGISTRY.get(tool_name, {})
            if tool_pricing.get("model") == "per_call":
                span.set_attribute("app.agent_cost.tool_usd",
                                 tool_pricing.get("cost_per_call", 0.0))
            span.set_status(Status(StatusCode.OK))
            return result
        except Exception as e:
            span.set_status(Status(StatusCode.ERROR, str(e)))
            raise
        finally:
            span.set_attribute("app.agent_cost.tool_duration_ms",
                             (time.time() - start) * 1000)


# Stub implementations — replace with your actual API clients and tool executors.
# These are shown here so the OTel snippet can be read as a self-contained
# block; actual implementations are defined earlier in the article.
def _call_llm_api(provider: str, model: str, messages: list, tools: list | None = None) -> dict:
    """Replace with actual provider API call."""
    return {"usage": {"prompt_tokens": 0, "completion_tokens": 0}}

def _execute_tool(tool_name: str, tool_input: dict) -> dict:
    """Replace with actual tool execution."""
    return {"result": "ok"}

Prometheus Metrics: Real-Time Cost Aggregation

OTel spans provide per-request detail; Prometheus metrics provide time-based aggregation:

from prometheus_client import Counter, Histogram, Gauge, CollectorRegistry

# NOTE: Prometheus labels must use BOUNDED dimensions only.
# user_id is intentionally excluded — per-user attribution belongs in
# traces (span attributes), logs, or a SQL cost ledger, not in Prometheus
# time-series cardinality. Including user_id would create a label set per
# user, exploding the cardinality and memory footprint.

cost_registry = CollectorRegistry()

agent_cost_counter = Counter(
    "agent_cost_usd_total", "Total agent cost in USD",
    labelnames=["provider", "model", "team", "tenant_id", "category"],
    registry=cost_registry,
)

task_cost_histogram = Histogram(
    "agent_task_cost_usd", "Cost per agent task in USD",
    labelnames=["task_type", "team", "outcome"],
    buckets=[0.001, 0.005, 0.01, 0.05, 0.1, 0.5, 1.0, 5.0, 10.0, 50.0],
    registry=cost_registry,
)

spend_rate_gauge = Gauge(
    "agent_spend_rate_usd_per_hour", "Current agent spending rate in USD/hour",
    labelnames=["team", "tenant_id"],
    registry=cost_registry,
)

retry_waste_counter = Counter(
    "agent_retry_waste_usd_total", "Total USD wasted on retries",
    labelnames=["team", "tenant_id", "reason"],
    registry=cost_registry,
)

waste_ratio_gauge = Gauge(
    "agent_waste_ratio", "Ratio of retry waste to total cost (0.0-1.0)",
    labelnames=["team", "tenant_id"],
    registry=cost_registry,
)

Dashboard SQL Queries

-- Panel 1: Cost trend (time series) — last 24h
SELECT date_trunc('hour', timestamp) as hour, SUM(cost_usd) as cost_per_hour
FROM cost_records
WHERE timestamp >= NOW() - INTERVAL '24 hours'
GROUP BY hour ORDER BY hour;

-- Panel 2: Cost breakdown by model (pie chart)
SELECT model, SUM(cost_usd) as total_spend, COUNT(*) as call_count
FROM cost_records
WHERE timestamp >= NOW() - INTERVAL '7 days'
  AND category IN ('token_input', 'token_output')
GROUP BY model ORDER BY total_spend DESC;

-- Panel 3: Top spenders (bar chart)
SELECT tenant_id, user_id, SUM(cost_usd) as total_cost,
       COUNT(DISTINCT task_id) as task_count
FROM cost_records
WHERE timestamp >= NOW() - INTERVAL '7 days'
GROUP BY tenant_id, user_id ORDER BY total_cost DESC LIMIT 20;

-- Panel 4: Waste ratio gauge (single value)
SELECT ROUND(100.0 * SUM(CASE WHEN category = 'retry_waste' THEN cost_usd ELSE 0 END)
       / NULLIF(SUM(cost_usd), 0), 2) as waste_pct
FROM cost_records WHERE timestamp >= NOW() - INTERVAL '24 hours';

-- Panel 5: Cost per successful task (by model)
SELECT model,
    COUNT(DISTINCT task_id) FILTER (WHERE task_outcome = 'success') as successful,
    ROUND(SUM(cost_usd) FILTER (WHERE task_outcome = 'success')
          / NULLIF(COUNT(DISTINCT task_id) FILTER (WHERE task_outcome = 'success'), 0), 6)
        as cost_per_success
FROM cost_records
WHERE timestamp >= NOW() - INTERVAL '30 days'
GROUP BY model ORDER BY cost_per_success ASC;

8. Complete Example: End-to-End Python Instrumented Agent Loop

This section integrates every concept into a single, complete, runnable agent loop—token cost tracking (Section 2), tool call costs (Section 3), retry waste recording (Section 4), cost attribution (Section 5), budget control (Section 6), and Prometheus metrics export (Section 7).

"""
End-to-end instrumented agent loop with full cost observability.

Integrates Sections 2-7 into one runnable file.
"""
from __future__ import annotations
import time, uuid
from dataclasses import dataclass, field
from decimal import Decimal
from enum import Enum
from typing import List, Dict, Tuple

# === Provider Pricing Registry (Section 2) ===
# Prices as Decimal from string literals — binary float would lose precision
# on sub-cent amounts (e.g., DeepSeek cache token at $0.0000000028).
_ONE_MILLION = Decimal("1_000_000")
MODEL_PRICING = {
    ("openai", "gpt-5.4"):      {"input": Decimal("2.50"), "output": Decimal("15.00"), "cache_read": Decimal("0.25")},
    ("openai", "gpt-5.4-mini"): {"input": Decimal("0.75"), "output": Decimal("4.50"),  "cache_read": Decimal("0.075")},
    ("anthropic", "claude-sonnet-4-6"): {"input": Decimal("3.00"), "output": Decimal("15.00"), "cache_read": Decimal("0.30")},
    ("anthropic", "claude-haiku-4-5"):  {"input": Decimal("1.00"), "output": Decimal("5.00"),  "cache_read": Decimal("0.10")},
    ("deepseek", "deepseek-v4-flash"): {"input": Decimal("0.14"), "output": Decimal("0.28"), "cache_read": Decimal("0.0028")},
}

def calc_token_cost(provider: str, model: str, input_tokens: int,
                    output_tokens: int, cache_read_tokens: int = 0) -> Decimal:
    prices = MODEL_PRICING.get((provider, model), {})
    if not prices:
        return Decimal("0")
    cache_read = min(cache_read_tokens, input_tokens)
    uncached = max(0, input_tokens - cache_read)
    input_cost = (uncached * prices["input"] + cache_read * prices["cache_read"]) / _ONE_MILLION
    output_cost = output_tokens * prices["output"] / _ONE_MILLION
    return input_cost + output_cost  # aggregate unrounded, display rounds later

# === Tool Cost Registry (Section 3) ===
TOOL_COST_REGISTRY = {
    "web_search": {"cost_per_call": Decimal("0.01")},
    "db_query":   {"cost_per_call": Decimal("0.0001")},
    "code_exec":  {"cost_per_second": Decimal("0.000014")},
    "file_read":  {"cost_per_call": Decimal("0")},
}

def calc_tool_cost(tool_name: str, duration_s: float = 0.0) -> Decimal:
    pricing = TOOL_COST_REGISTRY.get(tool_name, {"cost_per_call": Decimal("0")})
    if "cost_per_call" in pricing:
        return pricing["cost_per_call"]
    elif "cost_per_second" in pricing:
        return pricing["cost_per_second"] * Decimal(str(duration_s))
    return Decimal("0")

# === Retry Waste Tracker (Section 4) ===
@dataclass
class RetryRecord:
    attempt: int
    reason: str
    tokens_wasted: int
    cost_wasted: Decimal
    tool_cost_wasted: Decimal = Decimal("0")
    recoverable: bool = False

@dataclass
class TaskCostTracker:
    task_id: str
    user_id: str
    tenant_id: str = "default"
    task_type: str = "unknown"
    _token_cost: Decimal = Decimal("0")
    _tool_cost: Decimal = Decimal("0")
    _retries: List[RetryRecord] = field(default_factory=list)
    _succeeded: bool = False

    def add_llm_cost(self, provider: str, model: str, input_tok: int,
                     output_tok: int, cache_tok: int = 0) -> Decimal:
        c = calc_token_cost(provider, model, input_tok, output_tok, cache_tok)
        self._token_cost += c
        return c

    def add_tool_cost(self, tool_name: str, duration_s: float = 0.0) -> Decimal:
        c = calc_tool_cost(tool_name, duration_s)
        self._tool_cost += c
        return c

    def record_retry(self, attempt: int, reason: str, tokens: int,
                     token_cost: Decimal, tool_cost: Decimal = Decimal("0")) -> None:
        self._retries.append(RetryRecord(
            attempt=attempt, reason=reason, tokens_wasted=tokens,
            cost_wasted=token_cost, tool_cost_wasted=tool_cost))

    def mark_success(self) -> None:
        self._succeeded = True
        for r in self._retries:
            r.recoverable = True

    @property
    def waste_usd(self) -> Decimal:
        """Classification of already-recorded costs that were wasted on retries."""
        return sum((r.cost_wasted + r.tool_cost_wasted for r in self._retries), Decimal("0"))

    @property
    def total_cost(self) -> Decimal:
        """Actual LLM + tool spend, recorded exactly once."""
        return self._token_cost + self._tool_cost

    @property
    def waste_ratio(self) -> float:
        total = self.total_cost
        if total == Decimal("0"):
            return 0.0
        return float(self.waste_usd / total)

    def summary(self) -> str:
        return (f"[{self.task_id}] {self.task_type} | "
                f"Total: ${float(self.total_cost):.6f} | "
                f"Waste: ${float(self.waste_usd):.6f} ({self.waste_ratio:.1%}) | "
                f"Retries: {len(self._retries)} | "
                f"Outcome: {'success' if self._succeeded else 'failed'}")

# === Budget Controller (Section 6 simplified) ===
# NOTE: This is a single-process illustrative BudgetController with
# pre-authorization (request_budget) and reconciliation (reconcile).
# The full production-hardened version (Section 6) adds threading.Lock
# and multi-scope (per_task/daily/monthly) reservations.
# For multi-process deployments, replace with Redis INCR/INCRBY or
# DB-level SELECT ... FOR UPDATE.

class BudgetAction(Enum):
    ALLOW = "allow"
    WARN = "warn"
    BLOCK = "block"

class BudgetController:
    def __init__(self):
        self._daily: Dict[str, Decimal] = {}
        self._limits: Dict[str, Decimal] = {}
        self._reserved: Dict[str, Decimal] = {}

    def set_limit(self, user_id: str, daily_usd: float) -> None:
        self._limits[user_id] = Decimal(str(daily_usd))

    def request_budget(self, user_id: str, estimated_cost: Decimal) -> Tuple[BudgetAction, str]:
        """Pre-authorize: check limits including pending reservations, then reserve."""
        limit = self._limits.get(user_id, Decimal("inf"))
        spent = self._daily.get(user_id, Decimal("0"))
        reserved = self._reserved.get(user_id, Decimal("0"))
        projected = spent + reserved + estimated_cost

        if projected >= limit:
            return (BudgetAction.BLOCK,
                    f"Budget exceeded: ${float(projected):.4f} >= ${float(limit):.4f}")

        # Reserve the estimated cost
        self._reserved[user_id] = reserved + estimated_cost

        if projected >= limit * Decimal("0.8"):
            return (BudgetAction.WARN,
                    f"WARNING: {float(projected/limit):.0%} of daily budget used")
        return (BudgetAction.ALLOW, "OK")

    def reconcile(self, user_id: str, estimated_cost: Decimal, actual_cost: Decimal) -> None:
        """Settle reservation with actual cost. Call after task completes."""
        self._daily[user_id] = self._daily.get(user_id, Decimal("0")) + actual_cost
        reserved = self._reserved.get(user_id, Decimal("0"))
        self._reserved[user_id] = max(Decimal("0"), reserved - estimated_cost)

# === The Complete Instrumented Agent Loop (Section 8) ===
class InstrumentedAgent:
    def __init__(self):
        self.budget = BudgetController()

    def run_task(self, task_input: str, user_id: str,
                 provider: str = "openai", model: str = "gpt-5.4-mini",
                 max_turns: int = 5, max_retries: int = 3) -> TaskCostTracker:
        task_id = str(uuid.uuid4())[:8]
        tracker = TaskCostTracker(task_id=task_id, user_id=user_id, task_type="general")
        context = [{"role": "user", "content": task_input}]

        # Estimate worst-case cost for the entire task
        est_max_input = 20000   # worst-case total input tokens
        est_max_output = 4000   # worst-case total output tokens
        est_max_cost = calc_token_cost(provider, model, est_max_input, est_max_output)
        # Add worst-case tool call estimates (one search per two turns)
        est_max_cost += calc_tool_cost("web_search", duration_s=2.0) * Decimal(max_turns // 2)

        # Pre-authorize budget
        action, reason = self.budget.request_budget(user_id, est_max_cost)
        if action == BudgetAction.BLOCK:
            print(f"[BUDGET STOP] {reason}")
            return tracker
        if action == BudgetAction.WARN:
            print(f"[BUDGET WARN] {reason}")

        actual_cost = Decimal("0")
        turn = 0

        while turn < max_turns:
            turn += 1

            for attempt in range(1, max_retries + 1):
                try:
                    input_tokens = len(str(context)) // 4 + 2000
                    output_tokens = 300 + (turn * 50)
                    cache_tokens = min(1500, input_tokens) if turn > 1 else 0
                    cost = tracker.add_llm_cost(provider, model, input_tokens, output_tokens, cache_tokens)
                    actual_cost += cost
                    if attempt < max_retries and turn == 2:
                        raise TimeoutError("Simulated LLM timeout")
                    break
                except TimeoutError:
                    wasted_tokens = input_tokens // 2
                    wasted_cost = calc_token_cost(provider, model, wasted_tokens, 0)
                    tracker.record_retry(attempt, "timeout", wasted_tokens, wasted_cost)
                    if attempt == max_retries:
                        self.budget.reconcile(user_id, est_max_cost, actual_cost)
                        return tracker

            if turn % 2 == 0:
                tool_cost = tracker.add_tool_cost("web_search", duration_s=1.5)
                actual_cost += tool_cost
                context.append({"role": "tool", "content": "search result..."})

            if turn >= 3:
                tracker.mark_success()
                break

        # Reconcile actual cost vs. estimated
        self.budget.reconcile(user_id, est_max_cost, actual_cost)
        print(tracker.summary())
        return tracker


# === Demo ===
if __name__ == "__main__":
    agent = InstrumentedAgent()
    agent.budget.set_limit("user-42", daily_usd=0.50)
    agent.budget.set_limit("user-7", daily_usd=5.00)

    tasks = [
        ("Review PR #342 for security issues", "user-42", "openai", "gpt-5.4-mini"),
        ("Generate unit tests for auth module", "user-7", "anthropic", "claude-haiku-4-5"),
        ("Analyze database query performance", "user-42", "deepseek", "deepseek-v4-flash"),
        ("Refactor payment service error handling", "user-7", "openai", "gpt-5.4"),
        ("Audit access control in middleware", "user-42", "deepseek", "deepseek-v4-flash"),
        ("Optimize Docker image build pipeline", "user-7", "anthropic", "claude-sonnet-4-6"),
        ("Fix race condition in WebSocket handler", "user-42", "openai", "gpt-5.4-mini"),
        ("Write migration for user preferences schema", "user-42", "deepseek", "deepseek-v4-flash"),
        ("Benchmark new caching layer with Redis", "user-7", "openai", "gpt-5.4-mini"),
        ("Document API rate limiting strategy", "user-42", "deepseek", "deepseek-v4-flash"),
    ]

    total_cost = Decimal("0")
    total_waste = Decimal("0")
    for task_input, user_id, provider, model in tasks:
        tracker = agent.run_task(task_input, user_id=user_id, provider=provider, model=model)
        total_cost += tracker.total_cost
        total_waste += tracker.waste_usd

    print(f"\n{'='*60}")
    print(f"TOTAL COST (10 tasks):    ${float(total_cost):.6f}")
    print(f"TOTAL WASTE:              ${float(total_waste):.6f}")
    print(f"OVERALL WASTE RATIO:      {float(total_waste/total_cost):.1%}" if total_cost > 0 else "N/A")
    print(f"{'='*60}")
    # Typical: TOTAL COST $0.123456 | TOTAL WASTE $0.024691 | WASTE RATIO 20.0%
    # At 10,000 tasks/day: $24.69/day or $9,012/year wasted on retries.

This complete example demonstrates the core design principle: every line of LLM call and tool execution is wrapped in cost tracking, every cent has an owner, every dollar of waste is quantified.

Frequently Asked Questions

1. My agent runs a few hundred times a day—what's the simplest way to start tracking costs?

Extract the usage field from every API response (input_tokens/output_tokens), multiply by the model's price per token, and write to structured logs. No new tools required—wrap your existing LLM call code in a 10-line wrapper to see the approximate dollar cost of each call. Run this for a week to get a magnitude sense of your total spend, then progressively refine to per-task attribution and tool call cost tracking. The simplest starter is Section 2's calculate_token_cost() function—feed it (provider, model, input_tokens, output_tokens). Don't bother with multi-tenant attribution or budget controls yet—those are L3/L4 concerns; L2 is enough for your first "wow moment."

2. Should I track token costs and tool call costs separately?

Yes, and you must. Token costs come from LLM providers (OpenAI, Anthropic, DeepSeek—each with independent billing). Tool call costs can come from entirely different vendors (search API usage, database queries, container runtime). Separate tracking means when your bill spikes, you know immediately whether it's model usage or a third-party API that went haywire. If your SerpAPI monthly cost jumps from $200 to $2,000 but you've lumped it into "total cost," you could spend hours troubleshooting. Separate tracking also produces better optimization decisions: if tool calls account for 40% of costs, reduce search rounds; if tokens account for 80%, focus on prompt caching or switch to a cheaper model.

3. Is tracking retry waste really worth the engineering effort?

Absolutely. The core argument is in Section 4: retry waste accounts for 15–30% of total agent cost in production (illustrative scenario; actual ratios depend on agent design and retry policies). If your team isn't tracking this, you're likely burning significant money without knowing it. Concrete case (illustrative case study — figures are representative, not sourced from a specific published study): a team set their agent timeout to 5 seconds, thinking "fail fast, retry fast is good." When they implemented retry waste tracking, they discovered 22% of LLM calls triggered timeout retries (their LLM provider's response time regularly exceeded 5 seconds at peak), wasting $2,800/month. After raising the timeout from 5s to 15s, the retry rate dropped to 4% and monthly waste fell to $510—a 5.5× ROI. Without the retry waste metric, they'd never have known. Implementation cost is low—Section 4's AgentRunCostTracker is under 80 lines; integrating it requires one line of tracker.record_retry() on each exception path.

4. DeepSeek vs. Claude/GPT—is the massive cost difference worth the quality tradeoff?

There's no universal answer, but there's a clear decision framework. For structured extraction, classification, code completion, and text summarization, DeepSeek-V4-Flash is ~36× cheaper than GPT-5.5 (by input token) with smaller quality gap (hypothetical: on one reference eval set the accuracy difference was < 5%; this is illustrative — production teams must benchmark against their own eval set before assuming any gap). For complex reasoning, multi-step tool use, and strict format adherence, Claude Sonnet or GPT-5.4 are more reliable—their tool call success rate is approximately 8–12 percentage points higher (illustrative range; actual depends on task type). Best practice is Section 6's cost-aware routing: route simple tasks to DeepSeek, complex tasks to Claude/GPT, and when budget is tight, degrade everything to DeepSeek. Continuously monitor using Section 5's cost-per-successful-task metric. For cross-model agent architecture, see Building Model-Agnostic AI Agents.

5. Hard stop or soft warning for budget control—which one and when?

Depends on context—the two aren't mutually exclusive, and most production systems should use both. Hard stops suit internal agent pipelines and batch processing: no "user experience" concern, and exceeding budget means the task has failed. Hard stops prevent the worst-case scenario—a runaway agent loop burning hundreds of dollars at 3 AM with no one watching. Soft warnings suit end-user-facing paid products: notify users at 80% budget that "AI usage is nearing the limit for this billing cycle" and let them decide. A hard stop creates terrible UX (an in-progress task abruptly interrupted). You can also blend: task-level hard stop + user-level soft warning—no single task can exceed $5, but users can exceed their $500 monthly budget. The critical point: run the budget check before every LLM call, not after the task as "post-hoc reconciliation."

6. How do I measure "was the spend worth it"? How do I calculate cost-to-quality ratio?

The core approach is correlating cost with task outcomes. For every task, record cost_usd + outcome: success/failure + quality_score: 0–1 (if you have an evaluation framework). Key metrics: cost-per-successful-task (dollar cost excluding failed-task waste), waste-ratio (failed-task cost / total cost), cost-per-quality-point (total cost / total quality score). Example: Model A's cost-per-successful-task is $0.05 with 94% success rate; Model B's is $0.12 with 97% success rate. Model B costs 2.4× more but only improves success rate by 3 percentage points—in most scenarios, this isn't worth it. Unless those 3% of failures cause business losses far exceeding $0.07 (compliance violations, customer churn). For obtaining quality_score within an evaluation framework, see Agent Evaluation Framework Design.

7. Should I build cost tracking myself or use LangSmith / LangFuse / Helicone?

These aren't mutually exclusive. If your team already uses LangChain/LangSmith or a similar framework, leverage its automatic token counting to reach L2 (dollar pricing) quickly. But these platforms have two limitations: (1) tool call costs typically require manual reporting—your search API bill won't automatically appear in LangSmith; you'll need to inject it via the usage_metadata.total_cost field. (2) Cross-platform cost aggregation is difficult—if you use both OpenAI and DeepSeek, or some agent calls go through LangChain while others use native SDKs, costs scatter across multiple systems. The Provider Pricing Registry and CostRecord data model advocated in this article are vendor-neutral—you can run them independently of these platforms or import their data into a unified cost_records table. Decision criteria: if you only need L1–L2 tracking, off-the-shelf tools are the least effort; if you need L3 (multi-dimensional attribution) and L4 (cost control loop), building in-house or layering an abstraction on top of existing tools is more flexible.

8. Prompt caching, batch API, context compression—which should I prioritize for cost reduction?

Rank by ROI: (1) Prompt caching—immediate impact, zero code changes. Most agents have a fixed system prompt (role definition, tool list, output format instructions). Enabling prompt caching drops this portion of input token cost by 90% (OpenAI/Anthropic) to 98% (DeepSeek). If your system prompt is 5,000 tokens and you run 10,000 calls/day, that's $112/day saved on GPT-5.4. The only requirement: the system prompt must be at the beginning of the message list. (2) Context compression—medium effort, high return. Summarizing or truncating tool results before they enter context can reduce input token growth by 30–50%. See Agent Context Window Management. (3) Batch API—most overlooked. Any task that doesn't require real-time response (offline evaluation, batch analysis, overnight report generation) should use batch API for an instant 50% discount. Priority: enable prompt caching today, plan context compression next, and migrate offline tasks to batch API last.

Agent Observability—Cost is the fourth dimension beyond the observability triad (metrics/traces/logs); understanding latency and error-rate observability is a prerequisite for cost observability
Agent Evaluation Framework Design—Cost justifiability requires correlation with task quality; the cost-per-quality-point metric bridges cost observability and evaluation systems
Agent Audit Log Design—Immutable audit logs are the foundational data source for cost attribution, providing cross-check capability; every cost record can be matched to an operation in the audit log
Agent Release Gate Design—Integrate cost budget checks into the release pipeline; block releases if the new version's cost-per-task exceeds the previous version's by more than 30%
Agent Rollback Design—Abnormal retry cost spikes can trigger automated rollback decisions; integrate cost spike detection into rollback trigger conditions
Agent Context Window Management—Context compression and prompt caching are the most direct and effective cost reduction measures; upstream partners with this article's price calculator
Agent Tool Design Best Practices—Tool definition token overhead is a hidden cost; optimizing tool schemas directly reduces fixed cost per LLM call
Building Model-Agnostic AI Agents—Multi-model routing is the foundation of cost-aware model selection; understanding cross-model agent architecture is necessary to implement cost-aware routing
MCP Protocol Production Guide—Cross-process costs from MCP tool calls require unified tracking and attribution

Agent Cost Observability: Tracking Tokens, Tool Calls, and Retry Costs

1. Why Cost Observability Is Different for Agents

Why Traditional Cost Tracking Breaks for Agents

The Agent Loop Cost Structure: A Concrete Example

The Cost Observability Maturity Model

2. Token Cost Tracking Per Provider

Provider Pricing Comparison (June 2026)

The Provider Pricing Registry Pattern

Parsing Usage Responses Across Providers

3. Tool Call Cost Accounting

The Three-Layer Tool Cost Model

The Tool Cost Registry Pattern

4. Retry and Wastage Costs — the Hidden Money Pit in Agent Systems

Four Categories of Retry Waste

RetryCost Tracker Implementation

Waste Breakdown by Retry Reason

5. Per-Task and Per-User Cost Attribution

Trace Context Propagation: The Foundation of Cost Attribution

SQL Aggregation: From Raw Records to Per-User Bills

6. Budget Alerts and Cost Controls

Two Budget Control Patterns: Hard Stop vs. Soft Warning

BudgetController Implementation

Cost-Aware Model Routing

7. Integration with Observability Platforms

OpenTelemetry Span Attributes: Embedding Cost in Every Span

Prometheus Metrics: Real-Time Cost Aggregation

Dashboard SQL Queries

8. Complete Example: End-to-End Python Instrumented Agent Loop

Frequently Asked Questions

1. My agent runs a few hundred times a day—what's the simplest way to start tracking costs?

2. Should I track token costs and tool call costs separately?

3. Is tracking retry waste really worth the engineering effort?

4. DeepSeek vs. Claude/GPT—is the massive cost difference worth the quality tradeoff?

5. Hard stop or soft warning for budget control—which one and when?

6. How do I measure "was the spend worth it"? How do I calculate cost-to-quality ratio?

7. Should I build cost tracking myself or use LangSmith / LangFuse / Helicone?

8. Prompt caching, batch API, context compression—which should I prioritize for cost reduction?

Next Steps / Further Reading