What are the minimum metrics to start with for agent observability?

Start with five metrics: agent_requests_total (Counter, for volume and error rate), agent_request_duration_seconds (Histogram, for latency P50/P95/P99), agent_tokens_consumed_total (Counter, for cost tracking), agent_tool_calls_total (Counter, for tool usage and failure rate), and agent_reasoning_steps (Histogram, for reasoning round count). These five cover 80% of production debugging scenarios. Add hallucination counters and cost-per-task gauges incrementally.

OpenTelemetry vs LangSmith — which should I choose for agent observability?

They are complementary, not alternatives. Use OpenTelemetry as the instrumentation layer — it generates spans, metrics, and logs from your agent code using open standards. Use LangSmith (or LangFuse, or Datadog LLM) as the storage, visualization, and analysis layer. Instrument with OpenTelemetry first — OTel spans can be ingested by all major platforms, giving you permanent backend flexibility without re-instrumenting your code.

Agent Observability: Metrics, Tracing, and Real-Time Alerting for Production AI Agents

Q: How much does agent observability infrastructure cost?

Cost depends on scale and tooling: OSS self-hosted (Prometheus + Jaeger + Grafana + Loki) costs $200–800/month in compute for ~100 requests/minute. LangFuse self-hosted adds $50–200/month. LangFuse Cloud: free tier for 50K traces/month, $99–499/month for Pro/Team plans. LangSmith: $500–2,000/month at ~4.3M traces/month. Datadog LLM: $1,000–5,000/month. OSS is cheapest in dollars but most expensive in engineering time (2–4 weeks initial setup).

Q: How do I prevent alert fatigue with agent monitoring?

Four strategies: (1) Use the 'for' duration aggressively — minimum 5 minutes for P1, 30 minutes for P3 trend alerts. (2) Use composite conditions — add volume guards so low-traffic noise doesn't trigger alerts. (3) Route by severity — P1 to PagerDuty, P2 to Slack, P3 as dashboard annotation only. (4) Weekly alert review — delete alerts that have never fired or always fire falsely. An alert that fires 20 times a week with no action is worse than no alert.

Q: I already have Prometheus + Grafana — what changes for agents?

Your existing infrastructure is 70% of the solution. Add agent-specific metrics to your /metrics endpoint using a custom CollectorRegistry. Build two new Grafana dashboards: Agent Overview (L1+L2 signals) and Agent Deep-Dive (L3+L4 signals). Add the agent-alerts.yml rule file to your Prometheus config — it uses your existing Alertmanager. Add agent service endpoints as new scrape targets. Optionally add distributed tracing (Jaeger/Tempo) if you don't have it. Total effort: approximately 2 weeks.

May 24, 2026 · Intermediate-Advanced Part 7 of 7

⚡ 30-Second Takeaway

Traditional microservice monitoring (request count, latency, error rate) is completely insufficient for AI agents — agents make non-deterministic decisions, fail silently (wrong answers with HTTP 200), and require a fundamentally different signal taxonomy
Agent observability needs four layers of signals: Standard → LLM-native → Agent-native → Business. Miss any layer and you'll discover critical blind spots during your next incident
OpenTelemetry + Prometheus is zero-cost to start — use OTel spans to trace LLM calls and tool chains, expose Prometheus Counters/Gauges/Histograms, and build agent-specific dashboards in Grafana

1. Why Agent Observability Is Different From Traditional Service Monitoring

The Production Incident at 3:14 AM

3:14 AM. Your phone didn't ring. No PagerDuty, no Slack alert, no notification of any kind. When you opened Grafana the next morning, everything looked fine — QPS was steady, P99 latency within budget, error rate at 0%. It wasn't until the support team forwarded seven user complaints that you discovered the truth: for three consecutive hours overnight, your agent had been responding to every single refund request with "unable to process."

This wasn't a "failure" in the traditional sense. The API didn't return 500. The database didn't throw a connection timeout. No pod OOM-killed. Every layer of infrastructure was healthy — the LLM had simply made the wrong tool selection on a specific prompt path, and you had zero visibility into it.

This scenario exposes a fundamental problem: the three pillars of traditional service monitoring — Metrics, Tracing, and Alerting — need to be redefined for agent workloads.

Traditional Microservice Monitoring vs. Agent Observability

In traditional microservice architectures, a set of RED metrics (Rate, Errors, Duration) covers roughly 90% of monitoring needs: request rate tells you about traffic volume, error rate tells you whether the service is healthy, and latency tells you about user experience. This framework works because microservice execution paths are deterministic — the same HTTP request goes through the same code branches and produces the same result.

Agents fundamentally break all three of these assumptions:

Dimension	Traditional Microservice	AI Agent
Execution Path	Deterministic code branches (if/else → fixed path)	LLM reasoning → dynamic tool selection (non-deterministic)
Failure Mode	Explicit failure (500 error, exception, timeout)	Silent failure (wrong answer with HTTP 200, hallucination, wrong tool choice)
Latency Composition	Network IO + database queries + computation	LLM inference time + multiple tool call durations + multi-step reasoning chains
Cost Model	Fixed (CPU/memory/bandwidth)	Variable (tokens consumed × model pricing × reasoning steps)
Alert Conditions	error_rate > 1% → alert	error_rate > 1% AND hallucination_rate > 5% AND tool_call_failure > 3% — requires composite signals
Debugging Starting Point	Check error logs → locate code line	Check trace → expand LLM decision chain → determine "why wrong tool was chosen"

The critical difference lies in rows 3 and 4. In traditional services, latency comes from the network and database — both predictable and optimizable. In agents, LLM inference time can account for 80% of total latency, and the same prompt can have 3× variance in inference time across two calls (depending on model load, token count, and output length). Similarly, traditional service costs are fixed infrastructure expenses; agent costs are variable — each request consumes a different number of tokens, directly affecting your bill.

This is why understanding how AI agents fundamentally work is a prerequisite for observability — if you don't know what each step of the agent's reasoning loop does, you can't know what to monitor.

The Three Pillars: Metrics (What Happened) · Tracing (How It Happened) · Alerting (When to Act)

Agent observability rests on three pillars, each answering a different question:

1. Metrics — answers "What." What is the current task completion rate in production? What is the P99 latency? Is hourly token consumption within normal range? Is the tool call success rate trending downward? Metrics provide aggregated situational awareness. Without metrics, you're waiting for user complaints.

2. Tracing — answers "How." When metrics show a P99 latency spike, you need Tracing to tell you "which step is slow." The internal chain of a single agent request might look like: LLM reasoning (800ms) → tool call A (200ms) → LLM reasoning (1200ms) → tool call B (4500ms ← this is the bottleneck) → LLM reasoning (600ms). Without Tracing, you'd only know the total latency was 7300ms — you wouldn't know that 4500ms was spent on tool call B. As we'll explore in detail in the next section, span hierarchy is the core of agent tracing.

3. Alerting — answers "When." Not every metric fluctuation deserves waking up on-call. Alerting is especially challenging in agent scenarios: traditional alert logic is straightforward (error_rate > 1%), but agent alerting requires composite signals — error_rate alone being high might be a transient LLM hiccup, but error_rate being high AND token_cost spiking AND task_completion_rate dropping means something is genuinely wrong.

The relationship between these three pillars is: Metrics are the dashboard, Tracing is the microscope, Alerting is the sentry. Without Metrics, you don't know something happened. Without Tracing, you don't know why it happened. Without Alerting, nobody knows it happened even when it did.

Why You Can't Just Reuse Your Existing Monitoring Stack

If you already have a mature monitoring stack — Prometheus + Grafana + ELK — why can't you just point it at your agent? Three structural reasons:

1. Missing LLM-dimensional metrics. Your existing Prometheus metrics probably cover HTTP request counts, database connection pools, CPU/memory utilization — but none of them tell you how many tokens your agent consumes per hour, the token efficiency per task (tokens consumed to complete a task), or the model latency vs. tool latency ratio. These metrics are essential for cost optimization and performance tuning, but they simply don't exist in traditional monitoring stacks.

2. Missing agent-specific failure modes. A hallucination is not an exception. A tool call failure is not an HTTP 500. An agent choosing an operation it shouldn't (e.g., deleting data it wasn't supposed to) — from the API response code perspective it's a 200 success, but from a business outcome perspective it's a disaster. Traditional monitoring's definition of "error" cannot cover these scenarios.

3. Trace semantics don't map. Traditional distributed tracing has span hierarchies like: HTTP → Service → DB Query. Agent span hierarchies look like: request → LLM reasoning → tool selection → tool execution → LLM reasoning (loop). If you flatten every reasoning round into a peer span instead of a nested parent-child relationship, your trace visualization becomes a tangled mess — you can't distinguish "round 1 reasoning invoked tool A" from "round 3 reasoning invoked tool B."

Here is the full architecture of agent observability — from application code to final alerting:

┌─────────────────────────────────────────────────────────────────┐
│                     Agent Observability Architecture               │
├─────────────────────────────────────────────────────────────────┤
│                                                                   │
│  ┌──────────────┐    ┌──────────────────┐    ┌─────────────────┐ │
│  │  Agent Code   │───▶│ OpenTelemetry SDK │───▶│ OTLP Collector  │ │
│  │ (Python/TS)  │    │ (Span + Metrics)  │    │ (gRPC/HTTP)     │ │
│  └──────┬───────┘    └──────────────────┘    └───────┬─────────┘ │
│         │                                            │            │
│         │  ┌──────────────────┐          ┌───────────┴─────────┐ │
│         └─▶│ Prometheus       │          │                      │ │
│            │ Metrics Registry │          ▼                      ▼ │
│            └────────┬─────────┘   ┌───────────┐   ┌────────────┐ │
│                     │             │   Jaeger   │   │  Grafana    │ │
│                     ▼             │  / Tempo   │   │ (Dashboards)│ │
│            ┌───────────────┐      └───────────┘   └──────┬─────┘ │
│            │ /metrics      │                              │       │
│            │ (Prometheus)  │                     ┌────────▼─────┐ │
│            └───────────────┘                     │  Alertmanager │ │
│                                                   └──────┬───────┘ │
│                                                          │         │
│                                                  ┌───────▼───────┐ │
│                                                  │ Slack/PagerDuty│ │
│                                                  │ /Feishu        │ │
│                                                  └───────────────┘ │
└──────────────────────────────────────────────────────────────────┘

In the following three sections, we'll build this stack layer by layer. Section 2 defines "what to monitor" (signal taxonomy), Section 3 implements "how to trace" (OpenTelemetry), and Section 4 builds "how to expose and visualize" (Prometheus + Grafana). Later sections (Part 2) will cover alerting rules, trace-metric-log correlation, and incremental adoption paths.

2. The Agent Observability Signal Taxonomy: What to Monitor

Defining "what to monitor" is the most important step in any observability initiative — if you don't know what to look for, the best tooling in the world won't help you. Agent signals can't just be "add a few more Prometheus counters." They need systematic classification across four layers. Each layer answers different operational questions and serves different audiences.

The Four-Layer Signal Taxonomy

We classify agent observability signals into four layers, progressing from infrastructure fundamentals to top-level business value:

Layer	Focus	Typical Audience	Example Metrics
L1 · Standard Signals	"Is the agent service alive?"	SRE / On-call engineers	request_count, latency_p50/p95/p99, error_count
L2 · LLM-Native Signals	"How much does the LLM cost? How fast is it?"	Platform engineers / Cost owners	tokens_consumed, tokens_per_step, model_latency_ms
L3 · Agent-Native Signals	"How good are the agent's decisions?"	Agent developers / AI engineers	tool_call_success_rate, reasoning_step_count, hallucination_detected, recovery_attempt_count
L4 · Business Signals	"Is the agent delivering value?"	Product managers / Business stakeholders	task_completion_rate, user_feedback_score, cost_per_task

The four layers have a progressive dependency: L1 being healthy does not mean L3 is healthy (an agent can produce wrong answers at high QPS with zero HTTP errors), and L3 being healthy does not mean L4 is healthy (an agent can complete tasks correctly but deliver a poor user experience). Only when all four layers show green can you truly say "the agent is running correctly in production."

Let's revisit the 3:14 AM incident through this lens: L1 metrics were all normal (QPS steady, latency fine, error_count = 0). At L3, tool_call_success_rate was also normal — tool calls were succeeding — but the tool selection was wrong: the LLM was choosing the "return unable to process" termination path instead of the "process refund" tool. This means you need more than just "did the tool call succeed?" — you need tool selection distribution, a concept that simply doesn't exist in traditional monitoring.

This also explains why the design of agent tools is an upstream dependency for observability — if you haven't defined clear tool categories during design (read vs. write vs. high-risk operations), you can't distinguish "the refund tool was called" from "the refund tool was called correctly" in monitoring.

Python Signal Definitions

Below we define the complete four-layer signal taxonomy using Python dataclasses and Enums. This definition serves both as a precise specification of the signal categories and as the foundation for building your metrics registry in production code:

from __future__ import annotations

from dataclasses import dataclass, field
from enum import Enum
from typing import Optional


# ── L1: Standard Signals ──────────────────────────────────────────────

@dataclass
class StandardSignals:
    """Standard RED metrics every HTTP service needs. Agent services are no exception."""
    request_count: int = 0
    request_count_by_status: dict[str, int] = field(default_factory=dict)
    latency_p50_ms: float = 0.0
    latency_p95_ms: float = 0.0
    latency_p99_ms: float = 0.0
    error_count: int = 0


# ── L2: LLM-Native Signals ────────────────────────────────────────────

@dataclass
class LLMSignals:
    """LLM-specific signals — these dimensions don't exist in traditional microservices."""
    tokens_consumed: int = 0
    tokens_per_step: list[float] = field(default_factory=list)
    model_latency_ms: float = 0.0
    input_tokens: int = 0
    output_tokens: int = 0
    tokens_per_dollar: float = 0.0


# ── L3: Agent-Native Signals ──────────────────────────────────────────

class AgentStepOutcome(str, Enum):
    """Outcome of a single agent reasoning step — finer-grained than L1 success/failure."""
    COMPLETED = "completed"
    TOOL_CALL_FAILED = "tool_call_failed"
    HALLUCINATION_DETECTED = "hallucination"
    RECOVERY_ATTEMPTED = "recovery"
    MAX_STEPS_EXCEEDED = "max_steps"
    SAFETY_BLOCKED = "safety_blocked"

@dataclass
class AgentSignals:
    """Agent-specific signals — measuring decision quality and reliability."""
    tool_call_success_rate: float = 0.0
    tool_call_count_per_run: list[int] = field(default_factory=list)
    reasoning_step_count: list[int] = field(default_factory=list)
    hallucination_detected: int = 0
    recovery_attempt_count: int = 0
    step_outcome_distribution: dict[str, int] = field(default_factory=dict)
    tool_selection_distribution: dict[str, int] = field(default_factory=dict)
    approval_trigger_count: int = 0
    approval_timeout_count: int = 0


# ── L4: Business Signals ──────────────────────────────────────────────

@dataclass
class BusinessSignals:
    """Business-level signals — answering 'is the agent delivering value?'"""
    task_completion_rate: float = 0.0
    user_feedback_score: float = 0.0
    user_feedback_count: int = 0
    cost_per_task: float = 0.0
    cost_per_1k_requests: float = 0.0
    abandonment_rate: float = 0.0


# ── Complete SignalBundle ──────────────────────────────────────────────

@dataclass
class AgentSignalBundle:
    """Aggregates all four signal layers into a single structure — the data source
    for your metrics registry."""
    standard: StandardSignals = field(default_factory=StandardSignals)
    llm: LLMSignals = field(default_factory=LLMSignals)
    agent: AgentSignals = field(default_factory=AgentSignals)
    business: BusinessSignals = field(default_factory=BusinessSignals)

    def snapshot(self) -> dict:
        """Return a flattened dict of all signals, suitable for /metrics endpoint."""
        return {
            "agent_requests_total": self.standard.request_count,
            "agent_errors_total": self.standard.error_count,
            "agent_latency_p99_ms": self.standard.latency_p99_ms,
            "agent_tokens_consumed_total": self.llm.tokens_consumed,
            "agent_model_latency_ms": self.llm.model_latency_ms,
            "agent_tool_call_success_rate": self.agent.tool_call_success_rate,
            "agent_hallucination_total": self.agent.hallucination_detected,
            "agent_recovery_attempts_total": self.agent.recovery_attempt_count,
            "agent_task_completion_rate": self.business.task_completion_rate,
            "agent_user_feedback_score": self.business.user_feedback_score,
            "agent_cost_per_task_usd": self.business.cost_per_task,
        }

Several design decisions in this signal definition deserve attention:

1. Use dataclasses, not dicts. Strongly-typed definitions mean IDE autocompletion, type checking, and field validation when you later integrate with the Prometheus client. In production, you can use these dataclasses as internal state containers and export them to Prometheus Counters/Gauges/Histograms via the snapshot() method.

2. AgentStepOutcome is a critical enum. It adds agent-specific semantics on top of L1's binary success/failure — a tool call failure and a hallucination detection are two completely different failure modes requiring different response strategies. A tool call failure might be a network issue (just retry), while a hallucination might indicate a prompt design problem (roll back the prompt version). If you collapse both into "error," you lose the ability to do fine-grained alerting.

3. tool_selection_distribution is your silent-failure detector. Returning to the 3:14 AM incident — if you had been monitoring tool selection distribution, you would have immediately noticed that "process_refund" tool calls dropped to zero while "return_unable_to_process" termination calls spiked. This signal would have exposed the problem earlier than any error counter. This directly corresponds to the online evaluation metrics in the Agent Evaluation Framework — the evaluation framework tells you about answer quality, while the signal taxonomy tells you about behavioral patterns.

4. cost_per_task bridges engineering and business. Token consumption is a purely technical metric — engineers care about it. But product managers care about "how much did each customer support conversation cost?" Translating token consumption × model pricing ÷ task count = cost_per_task turns a technical metric into business language — this is the key to getting observability data out of the engineering silo.

In the next section, we'll see how these signals get embedded into OpenTelemetry's span system, making the leap from "defining signals" to "collecting signals."

3. Instrumenting Agent Traces with OpenTelemetry

If Metrics are the needles on your dashboard, Tracing is the high-speed camera under the hood — it can replay, frame by frame, the complete behavioral chain of an agent during a single request. For agents — with their multi-step reasoning and multi-tool invocations — Tracing is practically the only way to understand "what the agent actually did internally."

Span Hierarchy Design

Agent OpenTelemetry span hierarchies follow a natural nesting structure — each user request is a Root Span, containing multiple LLM call spans, each of which may contain multiple tool call spans. The hierarchy looks like this:

root_span (agent.request)
├── llm_call_span (agent.llm.reasoning)     ← Round 1 reasoning
│   ├── tool_call_span (agent.tool.execute) ← Tool call A
│   └── tool_call_span (agent.tool.execute) ← Tool call B (optional: multiple tools per round)
├── llm_call_span (agent.llm.reasoning)     ← Round 2 reasoning
│   └── tool_call_span (agent.tool.execute) ← Tool call C
└── llm_call_span (agent.llm.reasoning)     ← Final reasoning (produces the final reply)

Several key design points in this hierarchy:

1. Each LLM reasoning round is an independent span. A typical agent request may go through 3–8 reasoning rounds (ReAct / tool-use loop). Each round should be its own span — this is what lets you see "round 2 was slow" rather than just "the whole request was slow" in your tracing UI. Additionally, reasoning_step_count (the L3 signal defined in Section 2) can be calculated directly from the span count.

2. Tool call spans are children of LLM spans. This correctly expresses causality — a tool call is the result of LLM reasoning; without the LLM's tool_choice decision, there is no tool call. Nesting tool_call_span as a child of llm_call_span means that when you discover a slow tool call, you can immediately trace upward: "Which reasoning round triggered this tool call?"

3. The Root Span represents one complete user request. Its duration is the end-to-end latency — the source of your P50/P95/P99 latency metrics. The Root Span's status also indicates the overall request outcome (OK or ERROR).

Span Attributes

OpenTelemetry spans carry context through key-value attributes. For agent scenarios, here is the critical attribute design:

Attribute	Applies To	Type	Description
`agent.request.id`	root	string	Request ID (UUID v7)
`agent.agent.id`	root	string	Agent instance identifier
`agent.model.name`	llm_call	string	Model name (e.g., gpt-4o, claude-sonnet-4)
`agent.model.temperature`	llm_call	float	Temperature parameter
`agent.llm.tokens.input`	llm_call	int	Input token count
`agent.llm.tokens.output`	llm_call	int	Output token count
`agent.llm.duration_ms`	llm_call	int	LLM inference duration (ms)
`agent.tool.name`	tool_call	string	Tool name
`agent.tool.duration_ms`	tool_call	int	Tool execution duration (ms)
`agent.tool.status`	tool_call	string	success / failure / timeout
`agent.reasoning.step`	llm_call	int	Current reasoning round (1-indexed)
`agent.run.total_steps`	root	int	Total reasoning steps for this request

The design principle behind these attributes: precise enough to locate problems, but not excessive — don't put full LLM prompts and tool parameters into span attributes (those belong in audit logs — see Agent Audit Log Design). Span attributes are for "search and filter"; audit logs are for "full replay." The workflow is: filter traces in your tracing UI by agent.tool.name=refund_processor → get the trace_id → expand the full decision chain and parameters in your audit log.

Complete Code Implementation

Below is a runnable Python agent tracing implementation using opentelemetry-api and opentelemetry-sdk, exporting spans via OTLP to Jaeger or Grafana Tempo:

"""
Agent Observability — OpenTelemetry Distributed Tracing Complete Example
Dependencies: pip install opentelemetry-api opentelemetry-sdk opentelemetry-exporter-otlp

Start local Jaeger (dev environment):
    docker run -d --name jaeger \
      -p 16686:16686 \
      -p 4318:4318 \
      jaegertracing/all-in-one:latest
Then visit http://localhost:16686 to view traces.
"""

import time
import uuid
from contextlib import contextmanager
from typing import Any, Optional

from opentelemetry import trace
from opentelemetry.exporter.otlp.proto.http.trace_exporter import OTLPSpanExporter
from opentelemetry.sdk.resources import Resource, SERVICE_NAME
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.trace import SpanKind, Status, StatusCode


# ── 1. Initialize TracerProvider ───────────────────────────────────────

resource = Resource.create({
    SERVICE_NAME: "agent-service-prod",
    "deployment.environment": "production",
})

provider = TracerProvider(resource=resource)

# OTLP HTTP exporter → local Jaeger (port 4318)
# Replace with your OTel Collector address for production
otlp_exporter = OTLPSpanExporter(
    endpoint="http://localhost:4318/v1/traces",
)

provider.add_span_processor(BatchSpanProcessor(otlp_exporter))
trace.set_tracer_provider(provider)

tracer = trace.get_tracer(__name__)


# ── 2. OpenTelemetry Context Manager ────────────────────────────────────

@contextmanager
def traced_span(
    name: str,
    kind: SpanKind = SpanKind.INTERNAL,
    attributes: Optional[dict[str, Any]] = None,
    parent: Optional[trace.Span] = None,
):
    """Create a traced span as a context manager.

    Usage:
        with traced_span("agent.llm.reasoning",
                         attributes={"agent.model.name": "gpt-4o"}) as span:
            span.set_attribute("agent.llm.tokens.input", 1520)
    """
    ctx = trace.set_span_in_context(parent) if parent else None
    span = tracer.start_span(name, kind=kind, attributes=attributes, context=ctx)
    try:
        yield span
    except Exception as e:
        span.set_status(Status(StatusCode.ERROR, str(e)))
        span.record_exception(e)
        raise
    finally:
        span.end()


# ── 3. Simulated Agent Service ─────────────────────────────────────────

def simulate_llm_call(model: str, step: int, input_tokens: int) -> tuple[int, float]:
    """Simulate an LLM inference call. Returns (output_tokens, latency_ms)."""
    output_tokens = int(input_tokens * 0.6)
    latency_ms = input_tokens * 0.3 + 200
    time.sleep(latency_ms / 1000 * 0.01)  # Accelerated simulation (remove in real code)
    return output_tokens, latency_ms


def simulate_tool_call(tool_name: str) -> tuple[float, bool]:
    """Simulate a tool call. Returns (latency_ms, success)."""
    latency_ms = 50 + hash(tool_name) % 200
    time.sleep(latency_ms / 1000 * 0.01)
    success = True
    return latency_ms, success


def agent_run(request_id: str, user_query: str) -> str:
    """A complete agent request — simulates a ReAct loop."""
    total_steps = 3
    tools_to_call = ["search_knowledge_base", "calculate", "format_response"]

    # ── Root Span: represents the entire user request ──
    with traced_span(
        "agent.request",
        kind=SpanKind.SERVER,
        attributes={
            "agent.request.id": request_id,
            "agent.agent.id": "prod-agent-03",
        },
    ) as root_span:

        final_reply: str = ""

        for step in range(1, total_steps + 1):
            model_name = "gpt-4o"
            input_tokens = 800 + step * 200

            # ── LLM Span: Round N reasoning ──
            with traced_span(
                "agent.llm.reasoning",
                kind=SpanKind.INTERNAL,
                attributes={
                    "agent.model.name": model_name,
                    "agent.model.temperature": 0.3,
                    "agent.reasoning.step": step,
                },
                parent=root_span,
            ) as llm_span:

                output_tokens, llm_latency = simulate_llm_call(
                    model_name, step, input_tokens
                )

                llm_span.set_attributes({
                    "agent.llm.tokens.input": input_tokens,
                    "agent.llm.tokens.output": output_tokens,
                    "agent.llm.duration_ms": int(llm_latency),
                })

                # ── Tool Call Span: tool execution ──
                if step <= len(tools_to_call):
                    tool_name = tools_to_call[step - 1]
                    with traced_span(
                        "agent.tool.execute",
                        kind=SpanKind.CLIENT,
                        attributes={"agent.tool.name": tool_name},
                        parent=llm_span,
                    ) as tool_span:

                        tool_latency, tool_success = simulate_tool_call(tool_name)
                        tool_span.set_attributes({
                            "agent.tool.duration_ms": int(tool_latency),
                            "agent.tool.status": (
                                "success" if tool_success else "failure"
                            ),
                        })
                        if not tool_success:
                            tool_span.set_status(
                                Status(StatusCode.ERROR, f"{tool_name} failed")
                            )

                final_reply = (
                    "Task completed" if step == total_steps else "Continuing reasoning..."
                )

        root_span.set_attribute("agent.run.total_steps", total_steps)
        root_span.set_status(Status(StatusCode.OK))

    return final_reply


# ── 4. Demo Entry Point ────────────────────────────────────────────────

if __name__ == "__main__":
    for i in range(3):
        req_id = str(uuid.uuid4())
        print(f"[Request {req_id[:8]}...] Starting...")
        reply = agent_run(req_id, f"User query #{i+1}: Analyze recent sales data")
        print(f"[Request {req_id[:8]}...] Done → {reply}\n")

    provider.shutdown()
    print("✅ All spans sent to OTLP Collector (Jaeger: http://localhost:16686)")

Code Walkthrough

The core of this implementation is the three nested traced_span context managers — they precisely model the agent's span hierarchy:

1. The traced_span context manager. It encapsulates span creation, exception handling, and lifecycle management. The span is automatically ended via span.end() on __exit__, eliminating the pain of manual span lifecycle management. In production, you can extract this context manager into a standalone decorator or middleware.

2. Attribute setting timing. Notice that LLM span attributes (token count, latency) are set after the LLM call returns — because this data only exists in the API response. In your tracing UI, you can then filter requests by agent.llm.tokens.input > 2000 to find "large prompt" requests.

3. BatchSpanProcessor. Spans aren't sent in real time — they accumulate in memory and flush periodically or when a batch threshold is reached. This reduces network overhead, but also means if your process crashes before the batch flushes, recent spans are lost. For agent scenarios, consider a shorter batch interval (e.g., 1 second) to minimize data loss risk.

4. Coordination with audit logs. Note that this code only records span attributes (lightweight key-value pairs), not full LLM prompts and tool parameters. This is because OpenTelemetry span attributes have size limits (typically a few KB), and OTel backends and exporters also impose limits on attribute and event sizes. Complete data (prompts, tool parameters, tool return values, approval records) should be persisted separately through an audit log pipeline — exactly what's discussed in the Agent Audit Log Design article. Spans are for search and location; audit logs are for full replay — complementary, not substitutive.

5. OTLP Exporter configuration. The code above uses the HTTP exporter (port 4318), suitable for development. In production, you should configure the OTLP gRPC exporter (port 4317) and point it at your OTel Collector or directly at Grafana Tempo / Jaeger. Key parameters include endpoint, headers (auth tokens), and TLS configuration.

With Tracing in place, you have your "microscope." But a microscope needs to work alongside a "dashboard" — the next section builds the agent metrics pipeline, exposing the four-layer signals defined in Section 2 to Prometheus and Grafana.

4. Building a Metrics Pipeline with Prometheus

Tracing gives you a microscope — you can see the internal chain of every request. But a microscope can't tell you "how is the agent doing overall, across the last 5 minutes?" You need Metrics — aggregated, time-series, queryable numerical data. This section builds the agent metrics pipeline: from in-code metrics registries, through a Prometheus /metrics endpoint, to Grafana visualization panels.

Architecture Overview

The agent metrics pipeline flows in one direction, through distinct layers:

Agent Code (Python)
    │
    │  Counter.inc() / Gauge.set() / Histogram.observe()
    ▼
Prometheus Client Registry (in-memory)
    │
    │  HTTP GET /metrics (Prometheus text format)
    ▼
Prometheus Server (scrapes every 15s)
    │
    │  Time-series DB (TSDB) + PromQL queries
    ▼
Grafana (visualization + alerting panels)
    │
    │  Alert rules triggered
    ▼
Alertmanager → Slack / PagerDuty / Feishu

In this architecture, the agent code does exactly one thing: update the right metrics at the right time. The Prometheus server handles periodic scraping and storage, and Grafana handles visualization and alert triggering.

Metric Types: Counter / Gauge / Histogram

Prometheus provides four core metric types; agent scenarios use three of them:

Type	Semantics	Agent Scenario Examples
Counter	Cumulative, monotonic (only increases)	`agent_requests_total`, `agent_tokens_consumed_total`, `agent_hallucination_total`
Gauge	Instantaneous, can go up or down	`agent_tool_call_success_rate`, `agent_cost_per_task_usd`, `agent_active_sessions`
Histogram	Distribution of observations (auto-calculates quantiles)	`agent_request_duration_seconds` (yields P50/P95/P99), `agent_llm_tokens_per_request`

A common mistake is using a Gauge to calculate success rate (manually doing success/total × 100). The correct approach is to use two Counters — agent_tool_calls_total and agent_tool_calls_failed_total — and compute the rate in PromQL:

# PromQL: Agent tool call success rate (5-minute window)
rate(agent_tool_calls_total[5m]) - rate(agent_tool_calls_failed_total[5m])
  /
rate(agent_tool_calls_total[5m])

The advantage: Counters preserve their semantics across service restarts (starting from 0, with the rate() function computing instantaneous rate), while Gauges would jump on restart, causing false alert triggers.

Python Implementation: Prometheus Client + Decorator Pattern

Below is a complete agent metrics collection implementation using the prometheus_client library, with a decorator pattern that decouples metric collection from business logic:

"""
Agent Observability — Prometheus Metrics Pipeline Complete Example
Dependencies: pip install prometheus_client
After starting, visit http://localhost:9091/metrics to view metrics.
"""

import time
import functools
import threading
from typing import Callable, Any

from prometheus_client import (
    Counter, Gauge, Histogram, generate_latest,
    CollectorRegistry, REGISTRY,
)


# ── 1. Create Custom Registry (isolate agent metrics) ──────────────────

agent_registry = CollectorRegistry()

# ── 2. Define Agent Metrics ────────────────────────────────────────────

# L1 · Standard Signals
agent_requests_total = Counter(
    "agent_requests_total", "Total agent requests",
    labelnames=["agent_id", "status"],
    registry=agent_registry,
)

agent_request_duration_seconds = Histogram(
    "agent_request_duration_seconds", "Agent request latency (seconds)",
    labelnames=["agent_id"],
    buckets=[0.5, 1.0, 2.0, 5.0, 10.0, 20.0, 60.0, 120.0],
    registry=agent_registry,
)

# L2 · LLM-Native Signals
agent_tokens_consumed_total = Counter(
    "agent_tokens_consumed_total", "Cumulative token consumption",
    labelnames=["agent_id", "model", "direction"],
    registry=agent_registry,
)

agent_model_latency_seconds = Histogram(
    "agent_model_latency_seconds", "LLM model inference latency (seconds)",
    labelnames=["agent_id", "model"],
    buckets=[0.1, 0.3, 0.5, 1.0, 2.0, 5.0, 10.0],
    registry=agent_registry,
)

# L3 · Agent-Native Signals
agent_tool_calls_total = Counter(
    "agent_tool_calls_total", "Total tool calls",
    labelnames=["agent_id", "tool_name", "status"],
    registry=agent_registry,
)

agent_hallucination_total = Counter(
    "agent_hallucination_total", "Total detected hallucinations",
    labelnames=["agent_id", "hallucination_type"],
    registry=agent_registry,
)

agent_reasoning_steps = Histogram(
    "agent_reasoning_steps", "LLM reasoning steps per request",
    labelnames=["agent_id"],
    buckets=[1, 2, 3, 5, 8, 12, 20],
    registry=agent_registry,
)

agent_recovery_attempts_total = Counter(
    "agent_recovery_attempts_total", "Total error recovery attempts",
    labelnames=["agent_id", "recovery_strategy"],
    registry=agent_registry,
)

# L4 · Business Signals
agent_task_completion_rate = Gauge(
    "agent_task_completion_rate", "Task completion rate (0.0 ~ 1.0)",
    labelnames=["agent_id", "task_type"],
    registry=agent_registry,
)

agent_cost_per_task_usd = Gauge(
    "agent_cost_per_task_usd", "Average cost per task (USD)",
    labelnames=["agent_id", "model"],
    registry=agent_registry,
)

# ── 3. Decorator: Automatic Metric Collection ──────────────────────────

def observe_agent_request(agent_id: str = "prod-agent-03"):
    """Decorator: automatically collect agent request metrics for the decorated function."""
    def decorator(func: Callable) -> Callable:
        @functools.wraps(func)
        def wrapper(*args, **kwargs) -> Any:
            start = time.time()
            status = "success"
            try:
                result = func(*args, **kwargs)
                return result
            except Exception:
                status = "failure"
                raise
            finally:
                duration = time.time() - start
                agent_requests_total.labels(
                    agent_id=agent_id, status=status
                ).inc()
                agent_request_duration_seconds.labels(
                    agent_id=agent_id
                ).observe(duration)
        return wrapper
    return decorator


def emit_tool_call_metric(
    agent_id: str, tool_name: str, status: str, duration_ms: float
):
    """Call this after every tool invocation to update metrics."""
    agent_tool_calls_total.labels(
        agent_id=agent_id, tool_name=tool_name, status=status
    ).inc()


def emit_llm_metric(
    agent_id: str, model: str, input_tokens: int,
    output_tokens: int, latency_seconds: float,
):
    """Call this after every LLM call to update metrics."""
    agent_tokens_consumed_total.labels(
        agent_id=agent_id, model=model, direction="input"
    ).inc(input_tokens)
    agent_tokens_consumed_total.labels(
        agent_id=agent_id, model=model, direction="output"
    ).inc(output_tokens)
    agent_model_latency_seconds.labels(
        agent_id=agent_id, model=model
    ).observe(latency_seconds)


# ── 4. Example Agent Request Handler ───────────────────────────────────

@observe_agent_request(agent_id="prod-agent-03")
def handle_customer_query(query: str) -> str:
    """Simulate an agent handling a customer query."""
    agent_id = "prod-agent-03"
    model = "gpt-4o"

    # Round 1: LLM reasoning → tool call
    emit_llm_metric(agent_id, model,
                    input_tokens=1200, output_tokens=350,
                    latency_seconds=0.8)
    emit_tool_call_metric(agent_id, "search_knowledge_base",
                          "success", duration_ms=180)

    # Round 2: LLM reasoning → tool call
    emit_llm_metric(agent_id, model,
                    input_tokens=1800, output_tokens=400,
                    latency_seconds=1.1)
    emit_tool_call_metric(agent_id, "calculate",
                          "success", duration_ms=95)

    # Round 3: Final LLM reasoning
    emit_llm_metric(agent_id, model,
                    input_tokens=1500, output_tokens=200,
                    latency_seconds=0.6)

    # Update business-level signal
    agent_task_completion_rate.labels(
        agent_id=agent_id, task_type="customer_query"
    ).set(0.92)

    return "Here are your latest sales figures: Q2 revenue up 12.4% year-over-year."


# ── 5. Start Metrics HTTP Server ───────────────────────────────────────

def start_metrics_server(port: int = 9091) -> threading.Thread:
    """Start a standalone /metrics HTTP endpoint for Prometheus scraping."""
    from http.server import HTTPServer, BaseHTTPRequestHandler

    class MetricsHandler(BaseHTTPRequestHandler):
        def do_GET(self):
            if self.path == "/metrics":
                self.send_response(200)
                self.send_header(
                    "Content-Type", "text/plain; charset=utf-8"
                )
                self.end_headers()
                self.wfile.write(generate_latest(agent_registry))
            elif self.path == "/health":
                self.send_response(200)
                self.send_header("Content-Type", "text/plain")
                self.end_headers()
                self.wfile.write(b"ok")
            else:
                self.send_response(404)
                self.end_headers()

    server = HTTPServer(("0.0.0.0", port), MetricsHandler)
    thread = threading.Thread(target=server.serve_forever, daemon=True)
    thread.start()
    print(
        f"📊 Agent metrics server on http://0.0.0.0:{port}/metrics"
    )
    return thread


if __name__ == "__main__":
    start_metrics_server(port=9091)

    print("Simulating agent requests...")
    for i in range(5):
        query = f"User query #{i+1}: Look up the latest figures"
        reply = handle_customer_query(query)
        print(f"  [{i+1}/5] Done → {reply}")

    print("\n✅ Metrics ready. Visit http://localhost:9091/metrics to inspect.")
    import signal
    signal.pause()

Key Design Decisions

1. Custom Registry isolates agent metrics. If your service has both agent and non-agent components (e.g., a FastAPI service with both REST endpoints and an agent handler), using a custom CollectorRegistry separates agent metrics from application metrics, avoiding naming collisions and semantic confusion.

2. Label design is an art form. Each metric's labels determine how you can slice and aggregate in PromQL. For agent_tool_calls_total above, we chose [agent_id, tool_name, status] — this lets you aggregate by tool name (sum by(tool_name)), filter by status (status="failure"), or compare across agent instances (sum by(agent_id)). But avoid high-cardinality labels like user_id — every unique label combination creates a new time series in Prometheus, and too many series cause memory bloat. User-level metrics are better tracked through tracing and audit logs.

3. Histogram buckets must match your actual data distribution. Agent latency distributions are typically much wider than traditional APIs — P50 might be 3 seconds, P99 might be 60 seconds. The agent_request_duration_seconds histogram above uses [0.5, 1.0, ..., 120.0] buckets, covering everything from fast responses to long-running tasks. If your agent regularly has 3+ minute tasks, add 180.0 and 300.0 buckets.

4. The decorator pattern minimizes invasiveness. The observe_agent_request decorator makes metric collection nearly transparent to business code — add one line of decorator and request counting plus latency are automatically collected. For LLM calls and tool calls, we use explicit emit_llm_metric() and emit_tool_call_metric() functions because, within the agent's reasoning loop, these calls carry step-specific context (model name, tool name, token count) that's hard to handle uniformly with a decorator. The recovery retry count is also a critical observability signal — Agent Error Recovery discusses retry strategies in depth.

Prometheus Scrape Configuration

To have Prometheus periodically pull metrics from your agent service, add a scrape job to prometheus.yml:

# prometheus.yml — Agent metrics scrape config

global:
  scrape_interval: 15s
  evaluation_interval: 15s

scrape_configs:
  - job_name: "agent-service"
    metrics_path: "/metrics"
    scrape_interval: 15s
    scrape_timeout: 10s
    static_configs:
      - targets:
          - "agent-prod-01:9091"
          - "agent-prod-02:9091"
          - "agent-prod-03:9091"
        labels:
          cluster: "production"
          service: "agent-service"

  - job_name: "agent-service-staging"
    metrics_path: "/metrics"
    scrape_interval: 60s
    static_configs:
      - targets:
          - "agent-staging:9091"
        labels:
          cluster: "staging"
          service: "agent-service"

Key configuration notes:

scrape_interval: 15s — The recommended value for production. For latency-sensitive agent workloads, 15 seconds is enough to catch P99 latency spikes (spikes typically last 30–60 seconds). If your agent QPS is high (> 100), consider a 10-second interval.
scrape_timeout: 10s — The timeout should be less than the scrape_interval to prevent a slow scrape from blocking the next one.
static_configs — Suitable for fixed-IP deployments. If your agents run in Kubernetes, use kubernetes_sd_configs for service discovery to automatically track pod creation and destruction.
cluster and service external labels — Invaluable for switching dashboards by environment in Grafana and aggregating metrics across multi-cluster deployments.

Once Prometheus is running, you can validate correct metric collection by running PromQL queries in the Prometheus UI (http://localhost:9090):

# Validate: agent request rate (past 5 minutes)
rate(agent_requests_total{cluster="production"}[5m])

# Validate: tool call failure rate by tool
rate(agent_tool_calls_total{status="failure"}[5m])
  /
rate(agent_tool_calls_total[5m])

# Validate: P99 latency
histogram_quantile(0.99,
  rate(agent_request_duration_seconds_bucket[5m]))

At this point, the agent metrics pipeline is established: agent code updates Counters/Gauges/Histograms via prometheus_client → Prometheus periodically scrapes the /metrics endpoint → metrics are stored in TSDB and queryable via PromQL. The next phase (Sections 5 and 6, to be covered in Part 2) will build alerting rules and the trace-metric-log correlation system on top of this foundation.

It's worth emphasizing that the observability infrastructure itself needs to be observed. As discussed in the MCP Protocol Production Guide — MCP calls, like any other tool calls, need full tracing coverage. Every MCP tool invocation in your agent stack should also produce an agent.tool.execute span and corresponding Prometheus metrics, ensuring your observability coverage has no blind spots.

5. Real-Time Alerting for Agent Failures

Metrics and Tracing answer "what happened" and "how it happened." But if nobody is watching the dashboard at 3:14 AM, you're still blind. Alerting is the sentry that never sleeps — and for AI agents, alerting rules must be fundamentally redesigned because the failure modes don't look anything like traditional service failures.

Recall the opening incident: error_rate was 0%, latency was normal, QPS was steady — yet the agent was silently failing every refund request. A traditional alerting rule (error_rate > 1%) would have slept right through it. This is the silent failure problem: agents fail in ways that produce HTTP 200 responses, making HTTP-status-based alerting useless for catching them.

Agent alerting requires composite signals — combinations of metrics that, together, indicate genuine degradation — and agent-native alert conditions that go beyond error codes.

What to Alert On: The Five Agent Alert Categories

We define five alert categories, each with specific PromQL expressions and severity assignments. Every category targets a distinct failure mode:

#	Alert Category	Threshold	Severity	What It Catches
1	Error Rate Spike	>5% over 5 min	P1	LLM API returning errors, tool execution failures, downstream dependency outage
2	P99 Latency Spike	>3× baseline over 10 min	P1	LLM inference slowdown, tool timeout cascade, model overload
3	Token Cost Spike	>2× daily avg over 15 min	P2	Prompt injection causing verbose outputs, reasoning loop explosion, cost runaway
4	Tool Call Failure Spike	>3% over 5 min	P2	Tool API degradation, permission changes, schema mismatch after tool update
5	Hallucination Spike	>0.05/s sustained 30 min	P3	Prompt degradation, model version regression, RAG pipeline corruption

Let's examine each category in detail, including why each threshold was chosen:

1. Error Rate Spike (P1). While traditional services also alert on error rate, agent errors include LLM API failures (rate limiting, context length exceeded, content filter blocks) in addition to tool call failures. The 5% threshold over 5 minutes is intentionally stricter than typical microservice alerting (which often uses 1% over 10 minutes) — agent errors compound, because one erroring request might mean a user gets a completely incorrect answer, not just a retryable failure. The for: 5m duration prevents flapping from transient LLM hiccups.

2. P99 Latency Spike (P1). A spike that is 3× the 1-hour-ago baseline over a sustained 10-minute window. Why P99 and not average? Because agent latency distributions are heavy-tailed — a single slow LLM call can drag the average up while most users are fine. P99 tells you "1% of users are experiencing very poor latency." The 10-minute sustain period is critical: LLM inference latency has inherent variance (the same prompt can take 800ms or 1200ms depending on model load), so we need a window long enough to filter out noise.

3. Token Cost Spike (P2). This alert has no equivalent in traditional monitoring. Token costs can spike for two reasons: (a) a prompt injection attack causing the model to produce excessively long outputs, or (b) a reasoning loop explosion where the agent gets stuck in a cycle. The threshold — 2× the 24-hour rolling average over a 15-minute window — catches abrupt cost anomalies while tolerating normal diurnal patterns (higher traffic during business hours). This is a P2 because it's a cost issue, not a user-experience outage — important but not wake-up-at-3AM urgent.

4. Tool Call Failure Spike (P2). When a specific tool's failure rate spikes above 3%, it typically indicates a downstream dependency problem — the database is slow, the API changed its schema, or an auth token expired. Unlike generic error rate, this alert is labeled by tool_name, allowing the on-call engineer to immediately identify which tool is broken. For recovery strategies when tools fail, see Agent Error Recovery.

5. Hallucination Spike (P3). This is the most agent-native of all alerts — it tracks the rate at which your hallucination detection system flags outputs as fabricated or incorrect. A sustained elevation over 30 minutes (above 0.05 hallucinations per second) suggests a systemic issue: the model version changed, the system prompt was modified incorrectly, or the RAG knowledge base is returning corrupted data. This is P3 because hallucinations are a quality issue, not a service outage — it warrants investigation during business hours, not immediate paging. For a complete framework on detecting and measuring hallucinations, see Agent Evaluation Framework.

Alert Severity: P1 / P2 / P3

Severity	Meaning	Response	Notification Channel	Example
P1 · Critical	User-facing outage or severe degradation	Wake up on-call, immediate response (target: <15 min)	PagerDuty + Slack @channel	Error rate > 5%, P99 latency > 3× baseline
P2 · Warning	Partial degradation or cost anomaly	Investigate during business hours (target: <4 hr)	Slack @here + ticket auto-creation	Token cost spike, tool call failure spike
P3 · Trend	Quality degradation or emerging pattern	Observe and plan (target: next sprint)	Slack channel (no ping) + dashboard annotation	Hallucination trend, task completion rate drift

The severity assignment follows a simple rule: if it affects users right now, it's P1; if it affects costs, it's P2; if it affects quality trends, it's P3.

Alertmanager YAML Rules

Below is a complete Alertmanager rule file covering all five alert categories with the PromQL expressions and thresholds defined above. Deploy this alongside your Prometheus configuration:

# agent-alerts.yml — Prometheus Alert Rules for Agent Observability
# Place in /etc/prometheus/rules/agent-alerts.yml and reference in prometheus.yml:
#   rule_files:
#     - "/etc/prometheus/rules/agent-alerts.yml"

groups:
  - name: agent_observability
    interval: 30s

    # ── P1: Error Rate Spike ─────────────────────────────────────────
    - alert: AgentErrorRateSpike
      expr: |
        (
          rate(agent_requests_total{status="failure"}[5m])
          /
          rate(agent_requests_total[5m])
        ) > 0.05
      for: 5m
      labels:
        severity: P1
        component: agent
      annotations:
        summary: "Agent error rate exceeds 5%"
        description: |
          Agent {{ $labels.agent_id }} error rate:
          {{ $value | humanizePercentage }} over 5 minutes.
          Runbook: https://wiki.internal/agent-error-spike

    # ── P1: P99 Latency Spike ────────────────────────────────────────
    - alert: AgentP99LatencySpike
      expr: |
        (
          histogram_quantile(0.99,
            rate(agent_request_duration_seconds_bucket[5m]))
          /
          histogram_quantile(0.99,
            rate(agent_request_duration_seconds_bucket[5m] offset 1h))
        ) > 3
        and
        histogram_quantile(0.99,
          rate(agent_request_duration_seconds_bucket[5m])) > 5
      for: 10m
      labels:
        severity: P1
        component: agent
      annotations:
        summary: "Agent P99 latency > 3× baseline for 10 minutes"
        description: |
          Agent {{ $labels.agent_id }} P99: {{ $value | humanizeDuration }}.
          Baseline (1h ago): {{ with query (printf "histogram_quantile(0.99, rate(agent_request_duration_seconds_bucket{agent_id=\"%s\"}[5m] offset 1h))" $labels.agent_id) }}{{ . | first | value | humanizeDuration }}{{ end }}.
          Check Jaeger for slow traces: agent_id={{ $labels.agent_id }}.

    # ── P2: Token Cost Spike ─────────────────────────────────────────
    - alert: AgentTokenCostSpike
      expr: |
        rate(agent_tokens_consumed_total[30m])
        >
        2 * rate(agent_tokens_consumed_total[24h])
      for: 15m
      labels:
        severity: P2
        component: agent
      annotations:
        summary: "Agent token consumption > 2× daily average"
        description: |
          30-minute token rate is {{ $value | humanize }}/s,
          exceeding 2× the 24-hour rolling average.
          Check for prompt injection or reasoning loop explosion.

    # ── P2: Tool Call Failure Spike ──────────────────────────────────
    - alert: AgentToolCallFailureSpike
      expr: |
        (
          rate(agent_tool_calls_total{status="failure"}[5m])
          /
          rate(agent_tool_calls_total[5m])
        ) > 0.03
      for: 5m
      labels:
        severity: P2
        component: agent
      annotations:
        summary: "Agent tool call failure rate > 3%"
        description: |
          Tool {{ $labels.tool_name }} failure rate:
          {{ $value | humanizePercentage }}.
          Verify downstream API health for {{ $labels.tool_name }}.

    # ── P3: Hallucination Rate Trend ─────────────────────────────────
    - alert: AgentHallucinationTrend
      expr: |
        rate(agent_hallucination_total[30m]) > 0.05
      for: 30m
      labels:
        severity: P3
        component: agent
      annotations:
        summary: "Agent hallucination rate trending up (P3 — observe)"
        description: |
          Hallucination rate: {{ $value | humanize }}/s over 30 minutes.
          Review recent prompt changes and RAG pipeline health.

Key configuration notes:

for: 5m / 10m / 15m / 30m — The "for" duration is your defense against flapping. P1 alerts use 5-10 minute windows (you want to know fast); P3 uses 30 minutes (you want to filter noise). Never set for: 0m on agent alerts — LLM latency variance and token count jitter will generate false positives.
P99 latency guard condition (and ... > 5) — The ratio comparison alone can trigger when the baseline is tiny (e.g., 3× 0.1s = 0.3s, which is not a problem). The guard > 5 ensures we only alert when P99 is actually above 5 seconds, filtering out trivial spikes.
rate(...[24h]) for token cost baseline — Using a 24-hour window for the denominator captures full diurnal patterns. A 1-hour window would cause false alerts during normal traffic ramps (morning vs. midnight).
Prometheus rule_files reference — Don't forget to add the path to your prometheus.yml, or use a Prometheus Operator PrometheusRule CRD if you're on Kubernetes.

Python Webhook Handler: Alertmanager → Slack / PagerDuty

Alertmanager fires alerts, but you need to route them to the right people through the right channels. Below is a Python webhook handler that receives Alertmanager POST notifications and forwards them to Slack (all severities) and PagerDuty (P1 only):

"""
Alertmanager Webhook Forwarder → Slack + PagerDuty
Receives Alertmanager webhook POST at :9093, routes by severity.

Start:  python alert_webhook.py
Config: Set SLACK_WEBHOOK_URL and PAGERDUTY_ROUTING_KEY env vars.
"""
import json
import os
from http.server import HTTPServer, BaseHTTPRequestHandler

import requests

SLACK_WEBHOOK = os.getenv("SLACK_WEBHOOK_URL", "")
PD_ROUTING_KEY = os.getenv("PAGERDUTY_ROUTING_KEY", "")

EMOJI = {"P1": "🔴", "P2": "🟡", "P3": "🔵"}
COLOR = {"P1": "danger", "P2": "warning", "P3": "#439FE0"}


def route_alert(alert: dict) -> None:
    """Route a single alert: P1 → PagerDuty + Slack; P2/P3 → Slack only."""
    severity = alert.get("labels", {}).get("severity", "P3")
    summary = alert.get("annotations", {}).get("summary", "No summary")
    desc = alert.get("annotations", {}).get("description", "")
    starts_at = alert.get("startsAt", "")
    status = alert.get("status", "firing")

    # P1 → PagerDuty incident
    if severity == "P1" and PD_ROUTING_KEY:
        requests.post("https://events.pagerduty.com/v2/enqueue", json={
            "routing_key": PD_ROUTING_KEY,
            "event_action": "trigger",
            "payload": {
                "summary": summary,
                "severity": "critical",
                "source": "alertmanager",
                "custom_details": alert.get("annotations", {}),
            },
        }, timeout=10)

    # All severities → Slack
    if SLACK_WEBHOOK:
        requests.post(SLACK_WEBHOOK, json={
            "text": f"{EMOJI.get(severity, '⚪')} *[{severity}] {summary}*",
            "attachments": [{
                "color": COLOR.get(severity, "#CCCCCC"),
                "fields": [
                    {"title": "Description", "value": desc or "N/A", "short": False},
                    {"title": "Starts At", "value": starts_at, "short": True},
                    {"title": "Status", "value": status, "short": True},
                ],
            }],
        }, timeout=10)


class WebhookHandler(BaseHTTPRequestHandler):
    def do_POST(self) -> None:
        length = int(self.headers.get("Content-Length", 0))
        body = json.loads(self.rfile.read(length))
        for alert in body.get("alerts", []):
            route_alert(alert)
        self.send_response(200)
        self.end_headers()

    def do_GET(self) -> None:
        """Health check for load balancer."""
        self.send_response(200)
        self.end_headers()
        self.wfile.write(b"alert-webhook-ok")


if __name__ == "__main__":
    server = HTTPServer(("0.0.0.0", 9093), WebhookHandler)
    print("🚨 Alert webhook listening on :9093 → Slack + PagerDuty")
    server.serve_forever()

Alertmanager configuration: To connect Alertmanager to this webhook, add to your alertmanager.yml:

# alertmanager.yml — Route alerts to the webhook handler
receivers:
  - name: "agent-webhook"
    webhook_configs:
      - url: "http://agent-webhook:9093/alert"
        send_resolved: true

route:
  group_by: ["alertname", "severity"]
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h
  receiver: "agent-webhook"

The group_interval: 5m and repeat_interval: 4h settings are critical for agent workloads: alert grouping prevents a single incident from flooding Slack with 50 individual messages (all P1 alerts within a 5-minute window are batched into one notification), and the 4-hour repeat interval prevents alert fatigue from a persistent condition. For additional strategies on managing alert fatigue, see the FAQ section at the end of this article.

The Silent Failure Problem Revisited

Even with all five alert categories above, there is one class of failure that no PromQL rule can detect: when the agent chooses the wrong tool consistently but all tool calls return success. This is what happened in the 3:14 AM incident — the agent selected return_unable_to_process instead of process_refund, and every tool call returned HTTP 200 with status=success.

The defense against silent failures is tool selection distribution monitoring — tracking the ratio of which tools are called over time and alerting on distribution drift. This is implemented as a P3 trend alert (observe, don't page):

# PromQL: Detect tool selection distribution drift
# Compares the current 30-min tool distribution with the 24-hour baseline
(
  rate(agent_tool_calls_total{tool_name="process_refund"}[30m])
  /
  rate(agent_tool_calls_total[30m])
)
<
0.5 * (
  rate(agent_tool_calls_total{tool_name="process_refund"}[24h])
  /
  rate(agent_tool_calls_total[24h])
)

If the process_refund tool's share of total tool calls drops below 50% of its historical proportion, something has changed in the agent's decision-making — investigate. This pattern should be replicated for every high-stakes tool in your agent. For a deeper discussion of tool categorization and permission boundaries, see Agent Tool Permission Control.

6. Correlating Traces, Metrics, and Logs

In isolation, Metrics tell you "P99 latency is spiking," Tracing tells you "this specific request took 7.2 seconds," and Logs tell you "the payment API returned a timeout at 03:14:22." But in production debugging, none of these signals are useful alone — the power is in correlation. The canonical debug workflow for agent incidents is a chain: alert → trace → span → log. Each link in the chain carries an identifier that bridges to the next.

The Debug Workflow: Alert → Trace → Span → Log

Here is the complete correlation workflow, illustrated with a concrete example:

┌─────────────────────────────────────────────────────────────────┐
│  DEBUG WORKFLOW: Agent P99 Latency Spike at 03:14 AM            │
├─────────────────────────────────────────────────────────────────┤
│                                                                  │
│  1. Alert fires: "AgentP99LatencySpike — P99=18.2s (baseline    │
│     was 5.1s)" → Slack notification received.                   │
│                                                                  │
│  2. Open Grafana → Agent Latency Dashboard → Identify time      │
│     window (03:10–03:25 UTC) → Switch to "Slowest Traces"       │
│     panel, which lists trace_ids sorted by duration.             │
│                                                                  │
│  3. Top slow trace: trace_id=4f8a2c1b... → Click through to     │
│     Jaeger → Expand span tree:                                   │
│                                                                  │
│     agent.request (18.2s)                                        │
│     ├── llm_call #1 (1.1s)                                      │
│     │   └── tool: search_kb (0.3s)                              │
│     ├── llm_call #2 (1.4s)                                      │
│     │   └── tool: calculate_price (15.1s) ← BOTTLENECK          │
│     └── llm_call #3 (0.4s)                                      │
│                                                                  │
│  4. Jaeger shows trace_id and span_id for the slow span.        │
│     Copy span_id: a1b2c3d4e5f6...                                │
│                                                                  │
│  5. Open Loki / ELK → Query:                                    │
│     {service="agent"} |= "a1b2c3d4e5f6"                         │
│     → Returns the structured log for this tool call:            │
│                                                                  │
│     {"timestamp":"03:14:22","level":"ERROR",                    │
│      "message":"calculate_price timeout after 15s",              │
│      "trace_id":"4f8a2c1b...","span_id":"a1b2c3d4...",          │
│      "tool_name":"calculate_price","tool_params":{...}}          │
│                                                                  │
│  6. Root cause identified: pricing API upstream degraded,       │
│     causing 15-second timeouts on all calculate_price calls.     │
└─────────────────────────────────────────────────────────────────┘

This workflow works because of one critical design decision: trace_id and span_id are injected into every log entry, and trace_ids are exposed as Prometheus exemplars. Without these cross-signal identifiers, you'd be manually correlating timestamps across three different UIs — a process that takes 20 minutes for what should take 2.

Prometheus Exemplars: Bridging Metrics and Traces

Prometheus exemplars attach a trace_id to a specific metric data point. When you see a latency spike on a Grafana dashboard, you can click the data point and jump directly to the corresponding trace in Jaeger. Exemplars are enabled by adding a trace_id to your Histogram observations in the agent code:

# Python: Attach trace_id as an exemplar to a Prometheus Histogram
from opentelemetry import trace
from prometheus_client import Histogram

span = trace.get_current_span()
trace_id_hex = format(span.get_span_context().trace_id, "032x")

agent_request_duration_seconds.labels(agent_id="prod-03").observe(
    duration_seconds,
    exemplar={"trace_id": trace_id_hex}
)

With exemplars enabled, your Grafana dashboard's latency panel becomes interactive: hover over any anomalous spike → click "View Trace" → Jaeger opens with the exact span tree for that slow request. This eliminates the "which trace caused this spike?" guessing game.

Structured Logging with OTel Trace Injection

The final piece of the correlation puzzle is structured logging that automatically captures trace context. Below is a Python logging setup that injects trace_id and span_id into every log entry via a custom JSON formatter:

"""
Structured JSON Logging with OpenTelemetry Trace Context Injection

Every log line automatically includes trace_id and span_id from the
active OpenTelemetry span. Compatible with Loki, ELK, Datadog, and
any log aggregator that indexes JSON fields.

Usage:
    from agent_logging import setup_logging, get_logger
    setup_logging()
    logger = get_logger(__name__)
    logger.info("LLM reasoning completed", extra={"step": 3, "tokens": 450})
"""
import json
import logging
import sys
from datetime import datetime, timezone

from opentelemetry.trace import get_current_span


class OtelJsonFormatter(logging.Formatter):
    """JSON log formatter that injects OTel trace context into every record."""

    def format(self, record: logging.LogRecord) -> str:
        span = get_current_span()
        span_ctx = span.get_span_context() if span else None

        log_entry = {
            "timestamp": datetime.now(timezone.utc).isoformat(),
            "level": record.levelname,
            "logger": record.name,
            "message": record.getMessage(),
            "module": record.module,
            "line": record.lineno,
            # OTel trace context — empty strings if no active span
            "trace_id": (
                format(span_ctx.trace_id, "032x")
                if span_ctx and span_ctx.is_valid else ""
            ),
            "span_id": (
                format(span_ctx.span_id, "016x")
                if span_ctx and span_ctx.is_valid else ""
            ),
        }
        if record.exc_info and record.exc_info[0]:
            log_entry["exception"] = self.formatException(record.exc_info)
        return json.dumps(log_entry, default=str)


def setup_logging(level: int = logging.INFO) -> None:
    """Configure root logger with JSON formatting and OTel trace injection."""
    handler = logging.StreamHandler(sys.stdout)
    handler.setFormatter(OtelJsonFormatter())
    logging.basicConfig(level=level, handlers=[handler], force=True)


def get_logger(name: str) -> logging.Logger:
    """Get a logger for the given module name (convenience wrapper)."""
    return logging.getLogger(name)


# ── Usage example ─────────────────────────────────────────────────
if __name__ == "__main__":
    setup_logging()
    logger = get_logger("agent.reasoning")
    logger.info(
        "LLM call completed",
        extra={"agent_step": 3, "input_tokens": 1520, "output_tokens": 480},
    )
    # Output:
    # {"timestamp":"2026-05-24T10:15:22.123Z","level":"INFO",
    #  "logger":"agent.reasoning","message":"LLM call completed",
    #  "module":"agent_logging","line":58,
    #  "trace_id":"4f8a2c1b9d3e5f7a8b1c2d3e4f5a6b7c",
    #  "span_id":"a1b2c3d4e5f6a7b8"}

Design decisions in this logging setup:

get_current_span() is called inside format(), not stored at logger creation time. This ensures every log entry captures the span that is active at the moment the log is emitted, not the span that was active when the logger was initialized. In an agent's multi-step reasoning loop, the active span changes with each LLM call — if you cached the span reference, you'd get stale trace_ids.
Empty strings for missing trace context. When logging outside an active span (e.g., during startup or in a background thread), trace_id and span_id are empty strings rather than omitted entirely. This preserves JSON schema consistency, making it safe to query trace_id:"" in Loki to find logs that are outside any trace context.
JSON output to stdout. In containerized deployments, stdout JSON logs are collected by the container runtime (Docker/containerd) and forwarded to your log aggregator. No file rotation, no disk I/O concerns — the simplest and most reliable logging pipeline.
Compatibility with the audit logging pipeline. Note that this structured logger captures operational context (trace IDs, timestamps, log levels), but does not capture full LLM prompts and tool parameters. Those belong in the dedicated audit log — as detailed in Agent Audit Log Design. The trace_id in both systems is the bridge: find the incident in operational logs, then expand the full decision chain in audit logs using the same trace_id.

With this setup in place, the full correlation chain is operational: Alert (Prometheus) → Trace (Jaeger, via exemplar) → Log (Loki, via trace_id/span_id) → Audit Log (via trace_id). Every signal type is connected by a common identifier, turning a 20-minute debugging session into a 2-minute one.

7. Incremental Adoption Path

Reading this article, you might feel overwhelmed — four layers of signals, OpenTelemetry instrumentation, Prometheus metrics, Alertmanager rules, structured logging, and three-way correlation. But you don't need to implement everything at once. Agent observability is best adopted incrementally, with each phase delivering concrete debugging value before the next phase begins.

The adoption path below is designed for a team of 2–4 engineers with existing Python/Prometheus experience. Each phase has a clear deliverable and immediate benefit — you start getting value from Phase 0 in the first week.

Phase	Timeline	Deliverable	Immediate Benefit
Phase 0	Week 1	Structured JSON logging with trace_id injection. Every agent request generates a UUID v7 request_id. Every log line includes trace_id.	You can trace a single request across all log lines. Debugging time for production issues drops from "guess which logs are mine" to "grep trace_id."
Phase 1	Week 2–3	Basic OTel spans: root span per request + LLM call spans. Export to local Jaeger (Docker). Span attributes: model name, token counts, latency.	You can see the full span tree for any request. "Where is time being spent?" becomes a visual question, not a code-instrumentation question.
Phase 2	Week 4	Prometheus metrics pipeline: agent_requests_total, agent_tokens_consumed_total, agent_tool_calls_total. Two Grafana dashboards (L1+L2 overview, L3+L4 deep-dive). Prometheus scrape config in production.	You have a dashboard you can look at. "How is the agent doing?" becomes a 10-second dashboard glance. Aggregated metrics reveal trends invisible in individual traces.
Phase 3	Week 5–6	Alerting rules for the five agent alert categories. Alertmanager → Slack webhook. On-call runbooks written for each alert type.	You no longer need to watch the dashboard. The system tells you when something is wrong. On-call engineers know exactly what to do for each alert type.
Phase 4	Month 2+	Full correlation: Prometheus exemplars (metrics → traces), structured logging in Loki with trace_id indexing (traces → logs), audit log pipeline with shared trace_id (logs → full replay). Harden: tune alert thresholds, add silent-failure detection rules, expand to all agent instances.	The complete debug workflow (alert → trace → span → log → audit log) is operational. Mean time to resolution (MTTR) for agent incidents drops from hours to minutes.

Why this order? The phases follow a deliberate progression from observing individual requests (Phase 0–1) to observing aggregate behavior (Phase 2–3) to connecting everything (Phase 4). You can't build dashboards without metrics, you can't write alerts without dashboards telling you what "normal" looks like, and you can't correlate without all three signals in place.

Phase 0 is non-negotiable. If you only do one thing from this entire article, do Phase 0: structured JSON logging with trace_id. It costs one afternoon of work and gives you the ability to trace individual requests — the foundation upon which everything else is built. Without trace_id in your logs, every subsequent phase is building on sand.

For teams with zero observability infrastructure, Phase 0 and Phase 1 can be done entirely locally (Docker Compose: your agent + Jaeger). No production changes, no infrastructure approvals — just pip install opentelemetry-api opentelemetry-sdk opentelemetry-exporter-otlp and 50 lines of span instrumentation.

8. OSS Tools vs. Platforms: Choosing Your Observability Stack

Every section so far has used the OSS stack (OpenTelemetry + Prometheus + Jaeger + Grafana + Loki) as the reference implementation. But there are now dedicated platforms — both open-source and commercial — purpose-built for LLM and agent observability. This section provides a decision framework to help you choose.

Decision Matrix

Your choice of observability stack depends on four factors. Use this matrix to narrow your options:

Factor	OSS Stack (OTel+Prom+Jaeger+Grafana+Loki)	OSS Platforms (LangFuse, Arize Phoenix)	Commercial (LangSmith, Datadog LLM)
Team size	2+ engineers comfortable with infra	1–2 engineers, prefer managed	Any — platform handles ops
Monthly budget	Infra cost only ($200–800/mo for medium scale)	Free tier → self-hosted; cloud plans $0–500/mo	$500–$5,000+/mo (per-seat or per-token pricing)
Compliance	Full data sovereignty — everything runs in your VPC	Self-hosted option available; cloud plans share data with vendor	Data leaves your environment; SOC 2, GDPR compliance varies by vendor
Scale	No inherent limits; scales with your Prometheus/Jaeger cluster	Self-hosted scales with your infra; cloud tiers have rate limits	Enterprise tiers handle high volume; pricing scales linearly
Agent-specific features	None natively — you instrument everything manually	LLM call auto-instrumentation, prompt versioning, evaluation pipelines	Full agent lifecycle: tracing, evaluation, prompt management, A/B testing
Integration effort	High — 2–4 weeks for full setup	Medium — SDK integration, 1–2 weeks	Low — drop-in SDK, hours to days
Vendor lock-in	None — OTel is an open standard	Low — OTel-compatible export; can migrate data	Moderate — proprietary SDK; data export limited

OSS Stack: OTel + Jaeger + Prometheus + Grafana + Loki

Best for: Teams that already have Prometheus/Grafana in production, have infrastructure engineering capacity, and require complete data sovereignty (finance, healthcare, defense).

This is the stack we've used throughout this article. Its primary advantage is zero vendor dependency — every component is open-source, every signal format is an open standard (OpenTelemetry for traces, Prometheus text format for metrics, JSON for logs), and you can swap any component (replace Jaeger with Grafana Tempo, replace Loki with Elasticsearch) without changing your instrumentation code.

The trade-off is integration effort: you need to define span hierarchies, metric labels, log schemas, and dashboard panels yourself — nothing is pre-built for agent workloads. The four-layer signal taxonomy and all code examples in this article are designed to fill exactly this gap.

OSS Platforms: LangFuse, Arize Phoenix

Best for: Teams that want agent-specific observability features (automatic LLM call tracing, prompt version tracking, evaluation dashboards) without the commercial price tag.

LangFuse (langfuse.com, MIT license) is the most mature OSS LLM observability platform. It provides automatic tracing for LangChain, LlamaIndex, and direct OpenAI/Anthropic API calls; prompt versioning and management; evaluation pipelines; and a built-in UI for exploring traces and scores. You can self-host with Docker or use their cloud tier.

Arize Phoenix (github.com/Arize-AI/phoenix) focuses on embedding-based analysis — it can detect embedding drift, cluster problematic responses, and surface semantic anomalies that metric-based alerting would miss. It exports data in OTel format, so it integrates with your existing Jaeger/Grafana stack rather than replacing it.

Both platforms provide Python SDKs that wrap the LLM call, automatically capturing input/output tokens, latency, model name, and prompt templates — eliminating the need to manually instrument every LLM call with traced_span as we did in Section 3.

Commercial Platforms: LangSmith, Datadog LLM

Best for: Teams that want the fastest time-to-value, have budget for commercial tooling, and prefer a unified platform over assembling OSS components.

LangSmith (by LangChain) provides end-to-end agent observability: tracing, evaluation, prompt management, dataset curation, and A/B testing — all in one platform. Its pricing is per-trace or per-seat, and it offers deep integration with the LangChain/LangGraph ecosystem. If your agents are built on LangChain, LangSmith is the path of least resistance.

Datadog LLM Observability extends Datadog's APM platform with LLM-specific features: token usage tracking, prompt and response clustering, cost attribution per model/team/task, and integration with Datadog's existing alerting and dashboard infrastructure. If your organization already uses Datadog for infrastructure monitoring, adding LLM observability is a configuration change, not a new platform deployment.

Recommendation Flowchart

┌─────────────────────────────────────────────────────────────┐
│            Which Agent Observability Stack Should You Use?   │
├─────────────────────────────────────────────────────────────┤
│                                                              │
│  Do you already have Prometheus + Grafana in production?     │
│      │                                                       │
│      ├── YES ──▶ Do you have strict data sovereignty reqs?   │
│      │              │                                        │
│      │              ├── YES ──▶ Use OSS Stack (this article) │
│      │              │                                        │
│      │              └── NO ───▶ Add LangFuse/Phoenix         │
│      │                          as agent-specific layer on   │
│      │                          top of your existing stack   │
│      │                                                       │
│      └── NO ────▶ What's your monthly budget?                │
│                     │                                        │
│                     ├── < $500 ──▶ Self-host LangFuse        │
│                     │               + Prometheus/Grafana     │
│                     │                                        │
│                     ├── $500–2k ──▶ LangSmith or             │
│                     │               Datadog LLM (trial tier) │
│                     │                                        │
│                     └── > $2k ────▶ LangSmith Enterprise     │
│                                     or Datadog LLM full      │
│                                                              │
│  UNIVERSAL RECOMMENDATION:                                    │
│  Regardless of platform choice, always emit OTel spans.      │
│  OTel is the common denominator — it lets you switch         │
│  platforms without re-instrumenting your agent code.         │
└─────────────────────────────────────────────────────────────┘

The universal recommendation — always emit OTel spans — cannot be overstated. Whether you choose the full OSS stack, LangFuse, or LangSmith, OpenTelemetry instrumentation is a one-time investment that protects you from platform lock-in. All major platforms (LangFuse, Phoenix, Datadog, LangSmith) accept OTel spans via OTLP. Instrument once with OTel, and you can switch backends with a configuration change.

Frequently Asked Questions

1. What are the minimum metrics to start with?

Start with exactly five metrics, grouped into two categories:

Must-have (Day 1):

agent_requests_total{status} — Counter. Tells you volume and error rate. Without this, you don't know if the agent is even running.
agent_request_duration_seconds — Histogram. Tells you latency distribution (P50/P95/P99). The single most useful metric for detecting degradation.
agent_tokens_consumed_total{direction} — Counter. Tells you cost. Token consumption directly maps to your LLM bill.

Next priority (Week 2):

agent_tool_calls_total{tool_name, status} — Counter. Tells you which tools are being used and whether they're failing.
agent_reasoning_steps — Histogram. Tells you how many reasoning rounds each request takes. A sudden increase means the agent is struggling.

These five metrics cover 80% of production debugging scenarios. Everything else — hallucination counters, cost-per-task gauges, tool distribution ratios — can be added incrementally as you discover which signals matter for your specific agent.

2. OpenTelemetry vs. LangSmith — which should I choose?

This is a false dichotomy — they're complementary, not alternatives. The real question is what layer of your stack you use each for:

OpenTelemetry is the instrumentation layer. It's the open standard for generating spans, metrics, and logs from your application code. Use OTel for: defining span hierarchies (Section 3), emitting Prometheus metrics (Section 4), and injecting trace context into logs (Section 6).

LangSmith (or LangFuse, or Datadog LLM) is the storage, visualization, and analysis layer. It receives OTel data and provides agent-specific UIs, evaluation pipelines, and prompt management. LangSmith can ingest OTel spans via its OTLP endpoint — meaning you can instrument with OTel and visualize in LangSmith.

Our recommendation: instrument with OpenTelemetry, choose your backend later. The OTel instrumentation code you write today works with Jaeger, Grafana Tempo, LangFuse, Phoenix, LangSmith, and Datadog — it's a one-time investment that gives you permanent backend flexibility.

3. How much does agent observability infrastructure cost?

Cost depends on scale and tooling choice. Here are realistic estimates for a production agent handling ~100 requests/minute (~4.3M requests/month):

Stack	Monthly Cost	What's Included
OSS (self-hosted)	$200–800	Compute for Prometheus (8GB RAM, 200GB SSD), Jaeger/Tempo (4GB RAM, 500GB SSD), Grafana (2GB RAM), Loki (4GB RAM, 300GB SSD). All run on your existing Kubernetes cluster or VMs.
LangFuse (self-hosted)	$50–200	One additional VM (4GB RAM, 100GB SSD) + PostgreSQL. No per-trace pricing.
LangFuse Cloud	$0–500	Free tier: 50K traces/month. Pro: $99+/month. Team: $499+/month.
LangSmith	$500–2,000	Per-trace pricing. At 4.3M traces/month, roughly $0.0001–0.0005/trace depending on plan.
Datadog LLM	$1,000–5,000	Per-host or per-million-spans pricing. Includes infrastructure monitoring, APM, and LLM observability in one bill.

The OSS stack is the cheapest in dollars but the most expensive in engineering time (2–4 weeks initial setup, ongoing maintenance). Commercial platforms reverse this: higher dollar cost, lower engineering cost. Choose based on whether your constraint is budget or engineering bandwidth.

4. How do I prevent alert fatigue with agent monitoring?

Alert fatigue — when engineers start ignoring alerts because too many are false positives — is the #1 killer of observability programs. For agent workloads, four strategies are especially important:

Use the for duration aggressively. Never fire an alert on the first data point that crosses a threshold. For P1 alerts, for: 5m minimum. For P3 trend alerts, for: 30m. LLM latency and token counts have inherent variance — a single 2-second spike is not an incident.
Composite conditions, not single thresholds. Don't alert on error_rate > 5% alone. Add a volume guard: AND rate(agent_requests_total[5m]) > 0.1. A 5% error rate at 2 requests/hour is noise; at 200 requests/minute it's an incident.
Severity-based routing. P1 alerts go to PagerDuty (wake someone up). P2 alerts go to a Slack channel (look during business hours). P3 alerts are dashboard annotations only (no notification). The key insight: not every alert deserves a human waking up at 3 AM.
Weekly alert review. Schedule 15 minutes every Monday to review the past week's alerts: Which ones fired? Were they actionable? Which thresholds need tuning? Delete alerts that have never fired or always fire falsely. An alert that fires 20 times a week with no action taken is worse than no alert — it trains engineers to ignore the channel.

5. How does observability relate to audit logging and evaluation?

Observability, audit logging, and evaluation are three distinct but interconnected systems. Understanding the boundaries between them prevents scope creep and architectural confusion:

System	Purpose	Data Stored	Retention	Article
Observability (this article)	Answer "is the agent working right now?"	Metrics (aggregated), span attributes (key-value), operational logs (ERROR/WARN)	Days to weeks (metrics: 30d, traces: 7d)	—
Audit Logging	Answer "what exactly did the agent do for request X?"	Full LLM prompts, tool parameters, tool return values, approval decisions, user context	Months to years (compliance: 7yr)	Audit Log Design
Evaluation	Answer "how good are the agent's answers?"	Human ratings, LLM-as-judge scores, task success/failure labels, golden dataset results	Permanent (store evaluation results alongside model versions)	Evaluation Framework

The three systems are linked by trace_id: observability tells you which requests are problematic, audit logs give you the full replay of those requests, and evaluation tells you how good the answers were. Together, they form a complete understanding of agent behavior in production.

6. I already have Prometheus + Grafana — what changes for agents?

Great news: your existing infrastructure is 70% of the solution. Here's what you need to add (not replace):

New metric definitions. Add the agent-specific metrics from Section 4 to your existing /metrics endpoint. Use a custom CollectorRegistry to keep agent metrics separate from your existing service metrics — this prevents naming collisions and makes it easy to build agent-specific dashboards.
New Grafana dashboards. Build two new dashboards (don't clutter your existing service dashboard with agent panels): (a) Agent Overview — L1+L2 signals (request rate, latency, token cost), and (b) Agent Deep-Dive — L3+L4 signals (tool success rate, reasoning steps, task completion). Import the JSON dashboard templates referenced in Section 4.
New alert rules. Add the agent-alerts.yml rule file from Section 5 to your Prometheus config. These rules use your existing Alertmanager — no new infrastructure needed.
New scrape targets. Add your agent service endpoints to prometheus.yml as additional scrape targets alongside your existing microservice targets.
Optional: Tracing. If you don't already have distributed tracing (Jaeger/Tempo), add it. Agent debugging without traces is like debugging microservices without stack traces — possible, but painful. Start with a single Docker Jaeger instance (Phase 1 in Section 7).

The total effort for an existing Prometheus+Grafana user: approximately 2 weeks (Phase 0–2 from Section 7). You're not building a new observability stack — you're extending your existing one with agent-specific signals.

Next Steps

Agent observability is one piece of the production AI agent puzzle. These articles cover the adjacent concerns you'll need to address:

Agent Audit Log Design — Full- fidelity logging of LLM prompts, tool parameters, and approval decisions for compliance and debugging. The audit log is the "full replay" companion to the observability "live dashboard."
Agent Evaluation Framework — Systematic evaluation of agent answer quality, including hallucination detection, task completion scoring, and online vs. offline evaluation strategies.
Agent Error Recovery — Retry strategies, fallback chains, and graceful degradation patterns for when LLM calls and tool invocations fail. Directly feeds into the recovery_attempt_count metric defined in this article.
Agent Tool Permission Control — Defining which tools an agent can call under which conditions. Tool call observability (Section 4) and tool permission enforcement are two sides of the same coin.
Agent Security Evaluation — Prompt injection testing, tool abuse detection, and security boundary validation. Security incidents often first surface as observability anomalies — a hallucination spike can indicate a prompt injection attack.
MCP Protocol Production Guide — Running Model Context Protocol servers in production with full observability coverage. Every MCP tool call should produce an agent.tool.execute span and Prometheus metrics.

This article is Part 7 of the AI Agent Production Engineering series. Each article is self-contained but they form a complete operational manual when read together.