LLM Observability in 2026: Tracing, Evals & Cost Tracking for AI Agents
The definitive guide to observability for LLM and agent applications in 2026 -- distributed tracing with traces, spans and generations, the OpenTelemetry GenAI semantic conventions, instrumenting agents with the Langfuse SDK, tracking tokens, cost and latency, LLM-as-judge evaluation, drift and regression detection, prompt management, and production dashboards and alerting. Covers Langfuse, LangSmith, Arize Phoenix and Helicone.
Table of Contents
- What Is LLM Observability?
- Why LLMs Break Traditional Monitoring
- Core Concepts: Traces, Spans, Generations, Scores
- Instrumenting with the Langfuse SDK
- OpenTelemetry GenAI Semantic Conventions
- Instrumenting Agents & Multi-Step Workflows
- Cost, Latency & Token Tracking
- Evals & LLM-as-Judge
- Tool Comparison
- Dashboards, Alerting & Drift Detection
1. What Is LLM Observability?
LLM observability is the practice of instrumenting, capturing, and analyzing what happens inside an LLM or agent application so you can debug it, control its cost, and prove its quality. A single user request to a modern agent can fan out into dozens of steps: prompt construction, retrieval, multiple model calls, tool executions, retries, and post-processing. Traditional application monitoring records the HTTP request and a status code; LLM observability records the full non-deterministic execution tree -- every prompt, every completion, every token count, every latency, and a quality score for the final answer.
The discipline rests on three pillars. Tracing reconstructs the causal tree of an execution so you can see exactly which prompt produced which output and where time and money were spent. Metrics aggregate cost, latency, token usage, and error rates across thousands of requests. Evaluation attaches quality scores -- from LLM-as-judge, code assertions, or human review -- to traces so you can measure whether the system is actually getting better or worse. Together they turn a black-box model call into a debuggable, measurable system.
The 2026 ecosystem has consolidated around a small set of tools -- Langfuse (open-source, MIT), LangSmith (LangChain's managed platform), and Arize Phoenix (open-source, OpenInference) -- plus a fast-maturing standard, the OpenTelemetry GenAI semantic conventions, that lets any tracing backend ingest LLM telemetry. This composability means observability is a cross-cutting layer that sits alongside your agent framework rather than being locked to it: you instrument once and can send the same traces to multiple backends.
2. Why LLMs Break Traditional Monitoring
Classic APM tooling assumes deterministic code: the same input yields the same output, latency is stable, and a 200 status code means success. LLM applications violate every one of those assumptions. The same prompt can return different answers on each call, latency swings with output length and model load, and a request can return HTTP 200 while the content is a hallucination, a refusal, or malformed JSON. A green dashboard tells you nothing about whether the model is actually doing its job.
Three properties make LLM systems uniquely hard to observe. Non-determinism means you cannot reproduce a bug by replaying an input -- you need the exact captured prompt, model, parameters, and completion. Cost is per-token and unbounded: a runaway agent loop or a bloated context window can multiply spend by 10x with no code change, so token counts and cost must be first-class telemetry, not an afterthought. Quality is subjective and drifts: a prompt that worked last month can silently degrade when a provider updates a model behind the same version string, and there is no exception to catch -- only a slow decline in answer quality that you can measure only if you are scoring outputs.
This is why LLM observability adds a dedicated data model on top of ordinary tracing: a generation (a single model call with its prompt, completion, token usage, and cost), a trace (the full request), scores (quality signals attached to any span), and sessions and users (to group multi-turn conversations). The rest of this guide shows how to capture that data with real SDKs, standardize it with OpenTelemetry, evaluate it, and alert on it in production.
3. Core Concepts: Traces, Spans, Generations, Scores
Every observability tool shares the same underlying data model, borrowed from distributed tracing and specialized for LLMs. Master these four primitives and you can read any Langfuse, LangSmith, or Phoenix trace view. They map cleanly onto OpenTelemetry: a trace is a trace, a span is a span, a generation is a span with LLM-specific attributes, and a score is an attribute or event attached to a span.
Trace
The top-level record of one request through your application -- a chat turn, an API call, or an agent run. A trace has an input, an output, a total latency, an aggregated cost, and metadata (user, session, tags, environment). It is the unit you filter, search, and share. Everything else nests inside it. In Langfuse a trace also carries a stable trace_id you can attach to your own logs to jump from a business event straight to the execution.
Span (Observation)
A span is any nested unit of work inside a trace: a retrieval step, a tool call, a parsing function, or a whole sub-agent. Spans form a tree via parent-child relationships and each records its own start/end time, input, output, and status. This is how you localize a bug -- you drill from the slow or failing trace down to the exact span that caused it, seeing the input that broke it.
Generation
A generation is a specialized span for a single model call. Beyond timing it captures the model name, parameters (temperature, max tokens, tools), the full prompt messages, the completion, and token usage (input, output, cached, reasoning). From token counts and a price table the platform derives cost automatically. Generations are where prompt-engineering debugging happens: you see the exact rendered prompt, not your template.
Scores
A score is a quality signal attached to a trace or span: a numeric value, a boolean pass/fail, or a categorical label, with an optional comment. Scores come from LLM-as-judge evaluators, deterministic code checks (valid JSON, contains citation), explicit user feedback (thumbs up/down), or human annotation. Scores are what make quality measurable over time and are the foundation for evals, regression gates, and drift alerts.
# pip install langfuse (v3 Python SDK, OpenTelemetry-based)
from langfuse import get_client, observe
langfuse = get_client() # reads LANGFUSE_PUBLIC_KEY / SECRET_KEY / HOST from env
@observe() # this function becomes the root trace
def answer_question(question: str) -> str:
docs = retrieve(question) # a nested span (see @observe below)
completion = call_llm(question, docs)
# attach a quality score to the current trace
langfuse.score_current_trace(name="has_citation",
value=1 if "[" in completion else 0)
return completion
@observe() # nested span, auto-parented via OTEL context
def retrieve(question: str) -> list[str]:
return vector_store.search(question, k=4)
@observe(as_type="generation") # mark as an LLM generation
def call_llm(question: str, docs: list[str]) -> str:
prompt = build_prompt(question, docs)
resp = openai.chat.completions.create(
model="gpt-5.1", messages=prompt, temperature=0.2)
# record model + token usage so cost is computed automatically
langfuse.update_current_generation(
model="gpt-5.1",
usage_details={"input": resp.usage.prompt_tokens,
"output": resp.usage.completion_tokens})
return resp.choices[0].message.content
4. Instrumenting with the Langfuse SDK
Langfuse is the most widely adopted open-source (MIT) LLM observability platform, and its Python SDK v3 is built directly on OpenTelemetry -- so instrumenting your code also produces standard OTEL spans. There are three ways to instrument, and you can freely mix them in the same application. Pick the lightest one that captures what you need, then drop to the lower levels for the spans that matter most.
Three instrumentation styles, from least to most code:
Decorators (@observe)
The fastest path: annotate any function with @observe() and it becomes a span; the top-most one becomes the trace. Nesting is automatic via OpenTelemetry context propagation, so you get a correct tree without passing IDs around. Add as_type="generation" for model calls. Works in sync and async code. Best for your own business logic and glue functions.
Integrations & auto-instrumentation
Use the wrapped OpenAI client (from langfuse.openai import openai) or the LangChain CallbackHandler to capture model calls, prompts, and token usage with zero manual code. Because the SDK is OTEL-native, any OpenTelemetry or OpenInference instrumentation (Anthropic, LlamaIndex, the Vercel AI SDK) also flows into Langfuse. Best for framework and provider calls you do not want to wrap by hand.
Low-level context managers
For maximum control use langfuse.start_as_current_span() / start_as_current_generation() context managers to create spans explicitly, set inputs/outputs, usage, and metadata, and link scores. This is what you reach for inside hot paths, custom retry logic, or streaming handlers where you need to record time-to-first-token separately from total latency.
import os
os.environ["LANGFUSE_PUBLIC_KEY"] = "pk-lf-..."
os.environ["LANGFUSE_SECRET_KEY"] = "sk-lf-..."
os.environ["LANGFUSE_HOST"] = "https://cloud.langfuse.com" # or your self-hosted URL
# 1) Drop-in: the wrapped client traces every call automatically
from langfuse.openai import openai
resp = openai.chat.completions.create(
model="gpt-5.1",
messages=[{"role": "user", "content": "Summarize this ticket."}],
)
# 2) Low-level: explicit span with manual usage + a score
from langfuse import get_client
langfuse = get_client()
with langfuse.start_as_current_generation(
name="classify", model="claude-sonnet-4-5") as gen:
out = call_model(...)
gen.update(output=out,
usage_details={"input": 812, "output": 47})
gen.score(name="valid_label", value=1)
langfuse.flush() # ensure spans are exported before the process exits
5. OpenTelemetry GenAI Semantic Conventions
The OpenTelemetry GenAI semantic conventions are the emerging vendor-neutral standard for LLM telemetry. They define a common set of span names and attribute keys so that a model call instrumented once can be understood by any compliant backend -- Langfuse, Phoenix, Grafana, Datadog, Honeycomb -- without custom mapping. As of 2026 the conventions remain in Development status overall, but the client (model-call) spans stabilized in early 2026 and agent/tool spans, while still experimental, have been stable in practice through the year. Opt into the newest attributes with the environment variable OTEL_SEMCONV_STABILITY_OPT_IN=gen_ai_latest_experimental.
The convention is organized around a small vocabulary of attributes and span kinds:
gen_ai.* attributes
The core keys are gen_ai.system (the provider, e.g. openai, anthropic), gen_ai.operation.name (chat, embeddings), gen_ai.request.model and gen_ai.response.model, request parameters like gen_ai.request.temperature and max_tokens, and the all-important usage counters gen_ai.usage.input_tokens and gen_ai.usage.output_tokens. Standardizing these keys is what lets one dashboard aggregate cost across every provider.
Span kinds
The conventions name three levels of span: inference spans for a single model call (named chat {model}), execute_tool spans for tool/function calls, and invoke_agent spans for an agent step. Nesting these correctly reproduces the agent's reasoning tree, so a reader can follow which model call decided to call which tool with which arguments.
Events & content capture
Prompt and completion content is verbose and sensitive, so the conventions carry it as structured span events (or, in newer revisions, as the gen_ai.input.messages / gen_ai.output.messages attributes) that you can disable for PII. This separation lets you keep cheap metric attributes (model, tokens, latency) always-on while gating expensive full-payload capture behind a sampling or redaction policy.
# Emit GenAI-convention spans with the vanilla OpenTelemetry SDK.
# Any OTEL backend (incl. Langfuse's /otel endpoint) can ingest these.
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.http.trace_exporter import OTLPSpanExporter
provider = TracerProvider()
provider.add_span_processor(BatchSpanProcessor(
OTLPSpanExporter(endpoint="https://cloud.langfuse.com/api/public/otel/v1/traces")))
trace.set_tracer_provider(provider)
tracer = trace.get_tracer("my-app")
with tracer.start_as_current_span("chat gpt-5.1") as span:
span.set_attribute("gen_ai.system", "openai")
span.set_attribute("gen_ai.operation.name", "chat")
span.set_attribute("gen_ai.request.model", "gpt-5.1")
resp = call_openai(...)
span.set_attribute("gen_ai.usage.input_tokens", resp.usage.prompt_tokens)
span.set_attribute("gen_ai.usage.output_tokens", resp.usage.completion_tokens)
6. Instrumenting Agents & Multi-Step Workflows
Agents are the hardest thing to observe because a single request explodes into a variable-depth tree of reasoning steps, tool calls, and sub-agents. The goal is a trace whose shape mirrors the agent's actual control flow, so a slow run or a wrong answer can be traced to the exact step that caused it. Three techniques cover most agent stacks, and they layer on top of the SDK styles from section 4.
Nested spans for steps & tools
Wrap the agent loop in a root span and each iteration -- model call, tool execution, sub-agent -- in a child span. Record the tool name, arguments, and result on execute_tool spans and the reasoning on the model span. The resulting tree lets you answer the two questions that matter for agents: why did it pick that tool, and which step burned the tokens or the time.
Framework auto-instrumentation
Most agent frameworks emit telemetry you can capture without hand-wiring. LangGraph and LangChain flow through the Langfuse CallbackHandler; CrewAI, the OpenAI Agents SDK, and LlamaIndex are covered by OpenInference/OTEL instrumentors; the Claude Agent SDK exposes its own hooks. Enable the instrumentor once and every agent step becomes a span.
Sessions, users & metadata
Multi-turn agents need traces grouped into sessions and tagged with a user id so you can replay a whole conversation and compute per-user cost. Attach environment, release version, and business identifiers as metadata. This is also how distributed agents stay coherent: propagate the trace context across service and process boundaries so a sub-agent on another machine nests under the same trace.
# LangGraph / LangChain agent -> Langfuse with zero manual spans
from langfuse.langchain import CallbackHandler
handler = CallbackHandler()
result = agent.invoke(
{"messages": [("user", "Book the cheapest flight to Bogota")]},
config={
"callbacks": [handler],
# group turns + attribute cost per user
"metadata": {
"langfuse_session_id": "conv-2291",
"langfuse_user_id": "u_4417",
"langfuse_tags": ["prod", "travel-agent"],
},
},
)
# Manual nesting for a custom loop (OTEL context auto-parents children)
from langfuse import get_client
langfuse = get_client()
with langfuse.start_as_current_span(name="agent-run", input=user_msg) as root:
for step in range(max_steps):
with langfuse.start_as_current_span(name=f"tool:{tool.name}",
input=args) as tool_span:
tool_span.update(output=tool.run(args))
7. Cost, Latency & Token Tracking
Because LLM spend is metered per token and grows without any code change, cost and latency telemetry has to be built in from day one, not bolted on after the first surprise invoice. Once every generation records its model and token usage, the platform derives cost, and you can slice it by user, feature, prompt version, or environment to find exactly where the money and the milliseconds go.
Four signals to capture on every generation:
Token usage
Record input, output, cached, and (for reasoning models) reasoning tokens separately. Provider SDKs return these in the response usage object; auto-instrumentation captures them for you. Cached input tokens are billed at a fraction of the full price, so tracking them separately is what makes your cost numbers actually match the invoice.
Cost from a price model
Cost is derived, not measured: the platform maps (model, token type) to a per-token price and multiplies. Langfuse ships a maintained price table and lets you define custom models and prices for self-hosted or fine-tuned deployments. Because cost is computed from the recorded model string, keeping that string accurate is the whole game.
Latency & time-to-first-token
Total span duration is not enough for streaming UIs. Capture time-to-first-token (TTFT) separately from total generation time, since TTFT is what users perceive as responsiveness. Track output tokens per second to spot a degraded provider, and watch p95/p99 latency, not the average -- tail latency is where agent workflows time out.
Aggregation & attribution
The payoff is grouping: cost per user to find your whales, cost per feature to justify a model swap, cost per prompt version to catch a regression that doubled context size. Dashboards aggregate these from the raw generations, and metric APIs export them to Grafana or a data warehouse for finance-grade reporting.
8. Evals & LLM-as-Judge
Tracing tells you what happened; evaluation tells you whether it was any good. Evals run in two modes. Offline evals run before you ship: you assemble a dataset of inputs (and often expected outputs), run your app over every item as an experiment, score the results, and compare against the last version -- a quality gate for CI/CD. Online evals run continuously on live production traces, sampling real traffic and scoring it so you catch regressions the moment they reach users. Every mature platform (Langfuse, LangSmith, Phoenix) supports both.
Scores come from three kinds of evaluator. Deterministic code checks the cheap, objective things -- valid JSON, contains a citation, exact match, latency under budget. LLM-as-judge uses a strong model with a rubric to score the subjective things -- faithfulness to the retrieved context, relevance, tone, helpfulness -- and returns a score plus a written rationale you can audit. Human annotation is the ground truth you calibrate the judge against and reserve for the ambiguous cases. Combine all three: use code where you can, the judge where you must, and humans to keep the judge honest.
LLM-as-judge is powerful but has known failure modes you must design around: position bias (favoring the first answer shown), verbosity bias (favoring longer answers), and self-preference (a model rating its own family higher). Mitigate by using a different, strong model as judge, giving it a concrete rubric with few-shot examples, asking for a rationale before the score, and periodically checking judge-vs-human agreement. Treat the judge itself as a system you evaluate, not an oracle you trust blindly.
# Offline experiment: run a dataset, score each item with an LLM judge (Langfuse)
from langfuse import get_client
langfuse = get_client()
dataset = langfuse.get_dataset("qa-regression-v3")
def faithfulness_judge(query, answer, context) -> float:
verdict = openai.chat.completions.create(
model="gpt-5.1", temperature=0,
messages=[{"role": "system",
"content": "Score 0-1 how fully the ANSWER is supported "
"by CONTEXT. Give a one-line reason, then the number."},
{"role": "user",
"content": f"Q:{query}\nCONTEXT:{context}\nANSWER:{answer}"}])
return parse_score(verdict.choices[0].message.content)
for item in dataset.items:
with item.run(run_name="prompt-v12") as root: # links trace to the dataset run
out = my_app(item.input)
root.score(name="faithfulness",
value=faithfulness_judge(item.input, out, out_context))
# Compare run "prompt-v12" vs "prompt-v11" in the UI to catch regressions.
9. Tool Comparison
The observability landscape has consolidated in 2026 around a few tools with different trade-offs on licensing, instrumentation model, and eval depth. Here is how the major options compare:
| Dimension | Langfuse | LangSmith | Arize Phoenix | Helicone | MLflow Tracing |
|---|---|---|---|---|---|
| License / hosting | MIT open source; self-host or cloud | Proprietary; managed cloud + self-host enterprise | Elastic License 2.0; self-host or Arize cloud | Open source; cloud + self-host (maintenance mode) | Apache 2.0; self-host or managed |
| Instrumentation | SDK (@observe), integrations, OTEL-native | LangChain callbacks, SDK, OTEL ingest | OpenInference/OTEL auto-instrumentors | Proxy/gateway (swap base URL) + async | Autolog + OTEL-compatible tracing |
| Evals & LLM-judge | Built-in judge, datasets, experiments | Deep: datasets, evaluators, experiments | 50+ research-backed metrics, judge | Basic scoring / feedback | GenAI eval + scorers |
| Prompt management | Versioned prompts + playground | Prompt Hub + playground | Prompt playground / versioning | Prompt tracking | Prompt registry |
| Cost tracking | Maintained price table + custom models | Per-run token & cost | Token & cost per span | Strong (its original focus) | Token usage; cost via config |
| OTEL / standards | Native OTEL + GenAI conventions | OTEL ingest supported | Owns OpenInference conventions | OTEL export available | OTEL-compatible |
| Best for | Self-hosted, full-stack OSS + prompt mgmt | LangChain/LangGraph-native teams | Notebook & eval-heavy, drift analysis | Quick gateway-level cost/log visibility | Teams already on the MLflow platform |
These tools are increasingly interoperable because they converge on the same OpenTelemetry/OpenInference wire format -- you can instrument once and fan traces out to more than one backend. Choose by your primary constraint: Langfuse for a permissive (MIT) self-hosted stack with tracing, evals, and prompt management in one place; LangSmith if your team lives in LangGraph/LangChain and wants the deepest native integration; Arize Phoenix for eval- and drift-heavy, notebook-driven workflows; Helicone for the fastest gateway-level cost and log visibility (now in maintenance mode after its 2026 acquisition, so weigh long-term support); and MLflow Tracing if you already run the MLflow platform. Whichever you pick, instrumenting to the GenAI conventions keeps you portable.
10. Dashboards, Alerting & Drift Detection
Instrumentation and evals are inputs; the operational payoff is dashboards that surface health at a glance, alerts that page you before users complain, and drift detection that catches the slow quality decay unique to LLMs. These close the loop from raw traces back to action.
Production dashboards
Track the vital signs on one screen: request volume, error rate, p50/p95/p99 latency and TTFT, total and per-user cost, token throughput, and average eval scores over time -- sliced by model, prompt version, and environment. Langfuse and Phoenix ship built-in dashboards; for a single pane of glass, export metrics via the OTEL/metrics API into Grafana alongside the rest of your infrastructure.
Alerting & SLOs
Define SLOs and alert on breaches: cost per hour above a budget (runaway-loop guard), error or timeout rate above threshold, p95 latency regression, or an online eval score dropping below a floor. Route alerts to Slack/PagerDuty. The most valuable LLM-specific alert is a quality alert -- a sustained drop in judge scores -- because nothing else will tell you the model got worse.
Drift detection
LLM systems degrade silently when input distributions shift or a provider updates a model behind a stable version string. Detect it by monitoring embedding-space drift of inputs and outputs, tracking eval-score trends, and comparing production distributions to your reference dataset. Phoenix specializes in embedding drift analysis; a rising drift metric is your early warning to re-evaluate and re-tune before users notice.
Regression gates in CI
Wire offline evals into your pipeline so a prompt or model change cannot merge if it lowers scores on the golden dataset. Run the dataset experiment on every PR, fail the build when the aggregate score drops beyond a tolerance, and post the diff of newly-failing cases. This turns "did my prompt tweak break anything?" from a vibe check into an automated quality gate.
# Regression gate: fail CI when a prompt change lowers the eval score
import statistics, sys
from langfuse import get_client
langfuse = get_client()
dataset = langfuse.get_dataset("qa-regression-v3")
scores = []
for item in dataset.items:
with item.run(run_name=f"ci-{GIT_SHA}") as root:
out = my_app(item.input)
s = faithfulness_judge(item.input, out, out_context)
root.score(name="faithfulness", value=s)
scores.append(s)
mean = statistics.mean(scores)
BASELINE, TOLERANCE = 0.86, 0.03
if mean < BASELINE - TOLERANCE:
print(f"FAIL: faithfulness {mean:.3f} < {BASELINE - TOLERANCE:.3f}")
sys.exit(1) # blocks the merge
print(f"OK: faithfulness {mean:.3f}")
11. Prompt Management & Versioning
Prompts are the source code of an LLM app, yet they are too often hard-coded and edited in production without a paper trail. Prompt management pulls them into a versioned registry that lives next to your observability data. You author and label prompts (for example production vs staging), fetch them at runtime by name and label, and -- crucially -- every trace records which prompt version produced it. That link is what makes the rest of observability actionable: when eval scores drop or cost spikes, you can see exactly which prompt version is responsible and roll back a label without shipping code.
In Langfuse the client caches fetched prompts and refreshes them in the background, so runtime overhead is negligible and your app keeps serving the last known-good prompt even if the API is briefly unavailable. Combined with the dataset experiments from section 8, prompt versioning gives you a full change-management workflow: edit a prompt in the playground, run it against the golden dataset offline, compare scores to the current production version, promote the label if it wins, and watch the online eval scores confirm the improvement in production -- all without a redeploy. This is the loop that separates teams that iterate on prompts safely from teams that break production with a "quick" wording change.
from langfuse import get_client
langfuse = get_client()
# Fetch the version currently labelled "production" (cached + auto-refreshed)
prompt = langfuse.get_prompt("support-classifier", label="production")
# Compile the template with variables
compiled = prompt.compile(ticket=ticket_text, categories=categories)
# Link the generation to this exact prompt version so the trace shows it
with langfuse.start_as_current_generation(
name="classify",
model="gpt-5.1",
prompt=prompt, # associates trace <-> prompt version
) as gen:
resp = openai.chat.completions.create(
model="gpt-5.1", messages=compiled)
gen.update(output=resp.choices[0].message.content)
# In the UI: filter traces by prompt version, compare cost/latency/scores across versions.