A focused review for teams shipping LLM, RAG, and agent systems: trace coverage, evaluation gaps, token cost visibility, failure modes, and OpenTelemetry instrumentation plan.
Your AI system is in production. Now the hard questions start:
Why did the agent take that action?
Which retrieval step caused the wrong answer?
What did this request cost?
Which model/provider degraded?
Can you explain failures without reading raw logs for two hours?
Are evaluations catching regressions before users do?
I help teams turn fragile LLM, RAG, and agent systems into observable, evaluated, and cost-aware production systems.
Who this is for
This is for teams that already have or are about to ship:
RAG pipelines
AI agents
LLM-powered internal tools
customer-facing GenAI features
evaluation pipelines
OpenTelemetry-based observability stacks
It is not for teams looking for a generic chatbot demo.
What I review
1. Trace coverage
Which spans exist?
Which spans are missing?
Are retrieval, generation, tool calls, routing, and guardrails visible?
Are trace IDs propagated across services?
2. Evaluation readiness
Are there test sets?
Are evals tied to deployment gates?
Are LLM-as-judge results trusted blindly?
Are regressions tracked over time?
3. Token cost and latency visibility
Can the team attribute cost by model, use case, team, or customer?
Are latency spikes tied to model, retrieval, tool call, or network layers?
Is sampling preserving expensive or failing traces?
4. RAG failure modes
Retrieval miss
bad chunking
stale index
prompt/context mismatch
hallucinated answer despite correct retrieval
poor citation grounding
5. OpenTelemetry instrumentation plan
Span naming
semantic attributes
resource attributes
collector pipeline
export path to Grafana, Arize, Langfuse, Phoenix, Honeycomb, or another backend
Deliverable
You get a short, implementation-ready report:
current observability map
missing traces and metrics
evaluation gaps
cost/latency blind spots
prioritized fixes
OpenTelemetry instrumentation plan
7-day and 30-day action plan
Why me
I am an AI Observability Architect. I work on production observability for GenAI, AI agents, RAG, computer vision, and ML systems. I am OpenTelemetry certified and have built LLMOps/MLOps systems across enterprise environments.
I write and build in public around AI observability, OpenTelemetry, RAG, and LLM evaluation.