Advancing AI Observability: From Metrics to Meaningful Insights

27th Jun 2025
7 min read
Tags:
AI,
Observability,
LLM

Updated on 26th Mar 2026
See changes

AI observability means instrumenting your LLM applications to track token usage, latency, cost per request, quality scores and model drift in real time. Using OpenTelemetry with custom metrics (Counters for token counts, Histograms for latency, Gauges for quality scores), you can build Grafana dashboards that answer “how much is this AI costing us?” and “is output quality degrading?”

Earlier I shared a Production Readiness Checklist outlining the essential steps before rolling an AI system into production. One key section focused on observability and tracing. This post expands on that foundation, adding hands‑on examples, code snippets, and real‑world war stories so you can move from aspirational slides to shipping dashboards.

    flowchart LR
Users --> AI-Application --> Observability

Why Observability Matters

Modern AI applications (from Retrieval‑Augmented Generation (RAG) pipelines to fine‑tuned LLMs) are complex socio‑technical systems. Standard application metrics alone aren’t enough when a slight model‑drift or data‑quality issue can torpedo business outcomes.

Headline stats worth sharing with the board
• 30 % of generative‑AI pilots will be abandoned after proof‑of‑concept by 2025 because of poor data, inadequate risk controls, or unclear value ^[1]
• 88 % of enterprises say a single hour of downtime now exceeds $300 k, and 34 % report $1–5 million per hour ^[2]
• Large organisations already spend a median $1.4 million per year on AIOps & observability tooling ^[3]
• Early adopters are realising $3.7 in value for every $1 invested in generative‑AI programmes ^[4]

Real‑World Failure Case

Zillow Offers famously wrote down $569 million and shuttered its home‑flipping arm after its price‑prediction model drifted in the frothy 2021 housing market ^[5]. Had Zillow continuously monitored prediction error against market movement and flagged drift early, it could have throttled purchases before the losses snowballed. Share this cautionary tale whenever someone claims “it’s just a proof of concept.”

    flowchart LR
Production --> Observation --> Business-KPI

Pillars of AI Observability

1. Instrumentation (Deep Dive)

Instrumentation is the groundwork. Capture the right signals from your application code, model‑serving layers, and data pipelines at the point they happen, not in an after‑the‑fact batch.

# FastAPI + LangChain example with OpenTelemetry
from fastapi import FastAPI
from langchain.chains import RetrievalQA
from opentelemetry import trace, metrics
from opentelemetry.instrumentation.fastapi import FastAPIInstrumentor
from opentelemetry.instrumentation.requests import RequestsInstrumentor
from opentelemetry.instrumentation.langchain import LangChainInstrumentor

provider = trace.get_tracer_provider()
FastAPIInstrumentor().instrument_app(app := FastAPI())
RequestsInstrumentor().instrument()
LangChainInstrumentor().instrument()

tracer = trace.get_tracer(__name__)

@app.post("/ask")
async def ask(q: str):
    with tracer.start_as_current_span("rag_query") as span:
        chain = RetrievalQA.from_chain_type("gpt-4o")
        answer = chain.run(q)
        span.set_attribute("rag.tokens", answer["token_usage"])
        return answer

The snippet auto‑instruments FastAPI, Requests, and LangChain using OpenTelemetry so every RAG span, latency, and token count lands in your tracing backend in <5 lines of code ^[6]. Enable log/trace correlation with a single env‑var (OTEL_PYTHON_LOG_CORRELATION=true) so JSON logs include the active trace‑ID ^[7].

Tip Create a lightweight observability internal package that wraps the boilerplate so teams can add a single import and two lines of config.

Example Structured Log (truncated)

{
  "timestamp": "2025‑06‑27T17:04:13.217Z",
  "trace_id": "5f1c…",
  "span_id": "6b3f…",
  "route": "/ask",
  "user_id": "42",
  "query": "What are AIOps best practices?",
  "retrieved_chunks": 5,
  "model": "gpt‑4o‑2025‑06‑20",
  "latency_ms": 412,
  "tokens_prompt": 77,
  "tokens_completion": 121,
  "cost_usd": 0.014
}

2. Monitoring and Dashboards (Deep Dive)

Metrics are only useful when actionable. Build dashboards that combine system health with AI‑specific indicators.

Layer	Key Metrics	Example PromQL	Alert Threshold
API & Gateway	`http_request_duration_seconds{route="/ask"}` (p95)	`histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))`	>2 s for 5 m
Model Serving	`rag_tokens_total` by `model` & `customer`	`sum by(model)(rate(rag_tokens_total[1h]))`	20 % ↑ from 30‑day baseline
Retrieval Store	`vector_query_latency_ms` p99	`histogram_quantile(0.99, rate(vector_query_latency_ms_bucket[5m]))`	>300 ms
Cost	`rag_cost_usd_total`	`increase(rag_cost_usd_total[1d])`	>$1k/day budget

Grafana Cloud ships a plug‑and‑play OpenAI usage dashboard that graphs prompt vs. completion tokens, cost, and latency by model ^[8]. Pair it with Azure OpenAI’s built‑in “Tokens‑Based Usage” workbook if you are on Microsoft’s cloud ^[9].

3. Closing the Loop with Evaluation

In The Hidden Cost of LLM‑as‑a‑Judge we discussed smart approaches to LLM evaluation. Observability ties into that by storing evaluation results alongside live metrics.

# arize‑evals.yaml – evaluate 5 % of prod traffic nightly
project: ai‑assistant
source: prod‑llm‑traces
sample_rate: 0.05
checks:
  - name: factuality
    type: rouge
    threshold: 0.25
  - name: toxicity
    type: perspective
    threshold: 0.10
sink: arize‑cloud
schedule: "0 2 * * *"

Arize’s online evaluation feature attaches these scores directly to each trace so you can slice dashboards by factuality, toxicity, or “hallucination probability” in real time ^[10].

For open‑source stacks, Evidently AI 0.4+ now supports LLM testing suites with >100 metrics, drift reports, and a self‑hosted monitoring UI ^[11].

    flowchart LR
Metrics --> Evaluation
Evaluation -->|compare| Newer
Newer --> Retrain

Putting It All Together

Plan instrumentation early. Decide which traces, metrics, and logs map to business KPIs. Bake observability into your Definition of Done and Terraform modules.
Automate monitoring. Dashboards and alerts should deploy with every model push via CI/CD so you never ship blind code.
Review & retrain. Use evaluation dashboards in weekly triage and quarterly OKR reviews. Promote only models whose eval scores beat the current champion by ≥ X % on guarded benchmarks.
Optimise cost. Track tokens * price_per_token for every request. Kill features that deliver low ROI. Auto‑scale GPU pods based on real‑time throughput.

AI observability is a living discipline. By moving beyond simple metrics and focusing on actionable insights, you ensure your models stay reliable and aligned with business goals, and you build the confidence executives need to keep funding your roadmap.

    flowchart LR
Plan --> Automate --> Review --> Improve

Production Readiness Checklist – Deployment best practices
LLMOps: Introduction – Where observability fits into the LLM lifecycle
The Hidden Cost of LLM‑as‑a‑Judge – Smarter evaluation strategies

Need expert support? Contact me to discuss how tailored observability practices can accelerate your AI roadmap.

Frequently Asked Questions

What metrics should I track for AI/LLM applications in production?

Track cost (token usage per request, API spend), latency (time-to-first-token, total response time), quality (evaluation scores, hallucination rates) and reliability (error rates, timeout rates). These extend the standard RED method (Rate, Errors, Duration) with AI-specific dimensions.

How do I instrument an LLM application with OpenTelemetry?

Use OpenTelemetry’s Python SDK to create custom spans for each LLM call. Add attributes for model, tokens_in, tokens_out, latency_ms and cost. Use the auto-instrumentation for HTTP clients, and add manual spans for business logic like RAG retrieval and prompt construction.

What is model drift and how do I detect it?

Model drift occurs when an LLM’s output quality degrades over time, often due to changes in the model provider’s weights, shifting user queries or stale retrieval data. Detect it by tracking evaluation scores over time and alerting when scores drop below your SLO threshold.

How much does AI observability infrastructure cost?

The observability infrastructure itself (Prometheus, Grafana, OTel Collector) is open-source and free. The main cost is storage for traces and metrics. With proper sampling (keep 100% of errors, sample 1-10% of successes), most teams spend less than 5% of their AI API costs on observability.

Should I use a dedicated AI observability platform or build my own?

Start with OpenTelemetry + Grafana for full control and no vendor lock-in. Consider dedicated platforms (Arize, Langsmith, Evidently) when you need specialized features like automatic drift detection, prompt versioning or built-in evaluation pipelines that would take months to build yourself.