Observability & FinOps
LLM Observability & FinOps Implementation Guide
Sección titulada «LLM Observability & FinOps Implementation Guide»Companion del AI-First Engineering Framework v6.5
1. Por Qué Observar
Sección titulada «1. Por Qué Observar»Sin observabilidad, las aplicaciones AI son cajas negras:
- No sabes cuánto cuesta cada feature AI.
- No detectas degradación de calidad hasta que un usuario reporta.
- No puedes optimizar lo que no mides.
- No cumples EU AI Act (Aug 2026) sin audit trail.
“AI is an amplifier — it magnifies an organization’s existing strengths and weaknesses.” — DORA 2025
2. Stack de Observabilidad
Sección titulada «2. Stack de Observabilidad»2.1 Capas
Sección titulada «2.1 Capas»┌─────────────────────────────────────────────────┐│ DASHBOARDS ││ (Grafana / Datadog / Custom) │├─────────────────────────────────────────────────┤│ LLMOps PLATFORM ││ (Langfuse / LangSmith / Arize / Helicone) │├─────────────────────────────────────────────────┤│ TRACING / METRICS ││ (OpenTelemetry / Prometheus) │├─────────────────────────────────────────────────┤│ APPLICATION CODE ││ (spans, token counters, cost attribution) │└─────────────────────────────────────────────────┘2.2 Landscape Completo de Herramientas (15+ plataformas, Mar 2026)
Sección titulada «2.2 Landscape Completo de Herramientas (15+ plataformas, Mar 2026)»GAP-18: mapa exhaustivo del ecosistema — no solo las 6 principales, sino todas las herramientas relevantes que un equipo enterprise deberia conocer.
2.2.1 Plataformas LLMOps (tracing + eval + cost)
Sección titulada «2.2.1 Plataformas LLMOps (tracing + eval + cost)»| Plataforma | Mejor para | Tracing | Evals | Cost | OSS | Pricing (Mar 2026) |
|---|---|---|---|---|---|---|
| Langfuse | Startups, self-hosted | ✅ Full | ✅ Built-in | ✅ | ✅ Sí | Free tier + self-host |
| LangSmith | Equipos LangChain | ✅ Full | ✅ Built-in | ✅ | ❌ No | $39/seat/mo |
| Braintrust | Evals + observability | ✅ | ✅ Best-in-class | ✅ | ❌ No | Free tier, $25/seat pro |
| Arize Phoenix | ML teams, model monitoring | ✅ Full | ✅ LLM judges | ✅ | ✅ Parcial | Free tier + enterprise |
| Helicone | Cost tracking rapido | ✅ Proxy | ❌ Basico | ✅ Best | ✅ Proxy | Free tier generous |
| Weights & Biases (Weave) | ML teams existentes con W&B | ✅ Full | ✅ Built-in | ✅ | ❌ No | $50/seat/mo |
| Galileo | LLM quality scoring | ✅ | ✅ Guardrail Studio | ✅ | ❌ No | Enterprise pricing |
| HoneyHive | Pipeline debugging | ✅ Full | ✅ Data-driven | ✅ | ❌ No | Free tier + enterprise |
2.2.2 Plataformas de Infra-Observability (APM + AI)
Sección titulada «2.2.2 Plataformas de Infra-Observability (APM + AI)»| Plataforma | Mejor para | LLM Tracing | APM Nativo | AI Cost | Pricing |
|---|---|---|---|---|---|
| Datadog AI Observability | Enterprise, ya usa Datadog | ✅ LLM traces | ✅ Full APM | ✅ | Incluido en plan DD |
| New Relic AI Monitoring | Enterprise, ya usa NR | ✅ LLM traces | ✅ Full APM | ✅ | Incluido en plan NR |
| Dynatrace AI Observability | Enterprises con auto-instrumentation | ✅ Auto-detect | ✅ Full APM | ✅ | Enterprise license |
| Elastic AI Observability | Self-hosted enterprise | ✅ LLM spans | ✅ Full APM | ✅ | Free (self) / Cloud |
2.2.3 Herramientas Especializadas
Sección titulada «2.2.3 Herramientas Especializadas»| Herramienta | Enfoque | Diferenciador | OSS | Pricing |
|---|---|---|---|---|
| OpenLLMetry | OTel auto-instrumentation para LLMs | SDK drop-in: 1 linea instrumenta Anthropic/OpenAI/Cohere | ✅ Sí | Gratis |
| Traceloop | Developer experience para OTel LLM | Builds on OpenLLMetry, UI included | ✅ Core | Free + cloud |
| Prompt Layer | Prompt versioning + monitoring | Prompt registry + A/B testing | ❌ No | Free tier |
| Log10 | LLM feedback + debugging | Human feedback loops integrados | ❌ No | Free tier |
| Portkey AI Gateway | AI gateway con observability built-in | Load balancing + caching + observability | ✅ Sí | Free tier + enterprise |
2.2.4 Matriz de Decision Rapida
Sección titulada «2.2.4 Matriz de Decision Rapida»¿Ya usas APM enterprise (Datadog/NR/Dynatrace)? ├── SI → Activar modulo AI de tu APM existente └── NO → ¿Necesitas self-hosting? ├── SI → Langfuse (self-hosted) + OpenLLMetry └── NO → ¿Prioridad? ├── Cost → Helicone (proxy, 5 min setup) ├── Evals → Braintrust (evals + traces) ├── Tracing → Langfuse Cloud (traces + dashboard) └── Todo → Langfuse + Portkey (gateway + traces)2.2.5 Instrumentacion Base: OpenLLMetry (1 linea)
Sección titulada «2.2.5 Instrumentacion Base: OpenLLMetry (1 linea)»# La forma mas rapida de instrumentar CUALQUIER app LLM# Funciona con Langfuse, Datadog, Jaeger, cualquier backend OTLP
# pip install openllmetry-sdkfrom openllmetry import init_tracing
# Esto instrumenta automaticamente: Anthropic, OpenAI, Cohere,# LangChain, LlamaIndex, Pinecone, ChromaDB, etc.init_tracing( exporter_endpoint="http://localhost:4317", # OTLP endpoint # o: exporter_endpoint="https://cloud.langfuse.com" # o: exporter_endpoint="https://api.helicone.ai")
# Tu codigo normal — ya esta instrumentado automaticamenteimport anthropicclient = anthropic.Anthropic()response = client.messages.create( model="claude-sonnet-4-5", max_tokens=1024, messages=[{"role": "user", "content": "Hello"}])# → Span creado automaticamente con: model, tokens, cost, latency2.3 Recomendación por Track
Sección titulada «2.3 Recomendación por Track»| Track | Recomendado | Razón |
|---|---|---|
| Solo | Helicone (proxy) | Setup en 5 min, cost tracking gratuito |
| Lean | Langfuse (self-hosted) | Open source, traces completos, gratis |
| Full | Langfuse + Datadog | Traces AI + infra monitoring integrado |
3. Instrumentación con OpenTelemetry
Sección titulada «3. Instrumentación con OpenTelemetry»3.1 Span Attributes Obligatorios
Sección titulada «3.1 Span Attributes Obligatorios»# Attributes que DEBEN estar en cada span LLMrequired_span_attributes: - agent_id # Identificador del agente - llm.system # "anthropic", "openai", "google" - llm.model # "claude-sonnet-4-5", "gpt-4o" - llm.tokens.input # Tokens de entrada - llm.tokens.output # Tokens de salida - llm.tokens.total # Total tokens - llm.cost.usd # Costo en USD - llm.latency_ms # Latencia en ms - trace_id # ID de trace OpenTelemetry - tenant_id # ID del tenant/proyecto - feature_id # Feature que originó la llamada - spec_version # Versión del spec/contract
optional_span_attributes: - rag.recall_at_5 # Recall del RAG - rag.precision # Precisión del RAG - cache.hit # Si fue cache hit - cache.similarity # Score de similaridad semántica - user_id # Para cost attribution per-user - session_id # Sesión del agente3.2 Implementación Python
Sección titulada «3.2 Implementación Python»"""Instrumentación OpenTelemetry para LLM calls."""from opentelemetry import tracefrom opentelemetry.sdk.trace import TracerProviderfrom opentelemetry.sdk.trace.export import BatchSpanProcessorfrom opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporterimport time
# Setupprovider = TracerProvider()processor = BatchSpanProcessor( OTLPSpanExporter(endpoint="http://localhost:4317"))provider.add_span_processor(processor)trace.set_tracer_provider(provider)tracer = trace.get_tracer("ai-first-framework")
# Pricing por modelo (USD per 1K tokens, Mar 2026)MODEL_PRICING = { "claude-sonnet-4-5": {"input": 0.003, "output": 0.015}, "claude-haiku-3-5": {"input": 0.00025, "output": 0.00125}, "gpt-4o": {"input": 0.0025, "output": 0.01}, "gpt-4o-mini": {"input": 0.00015, "output": 0.0006}, "gemini-2.5-pro": {"input": 0.00125, "output": 0.005}, "gemini-2.0-flash": {"input": 0.0001, "output": 0.0004},}
def calculate_cost(model: str, input_tokens: int, output_tokens: int) -> float: """Calcula costo en USD.""" prices = MODEL_PRICING.get(model, {"input": 0.01, "output": 0.03}) return (input_tokens * prices["input"] / 1000 + output_tokens * prices["output"] / 1000)
def traced_llm_call(func): """Decorator para instrumentar llamadas LLM.""" def wrapper(*args, **kwargs): with tracer.start_as_current_span("llm.call") as span: model = kwargs.get("model", "unknown") span.set_attribute("llm.model", model) span.set_attribute("llm.system", _get_system(model)) span.set_attribute("agent_id", kwargs.get("agent_id", "default")) span.set_attribute("feature_id", kwargs.get("feature_id", "unknown"))
start = time.time() result = func(*args, **kwargs) latency = (time.time() - start) * 1000
# Extract token usage from response usage = getattr(result, "usage", None) if usage: input_tokens = getattr(usage, "input_tokens", 0) output_tokens = getattr(usage, "output_tokens", 0) span.set_attribute("llm.tokens.input", input_tokens) span.set_attribute("llm.tokens.output", output_tokens) span.set_attribute("llm.tokens.total", input_tokens + output_tokens) span.set_attribute("llm.cost.usd", calculate_cost(model, input_tokens, output_tokens))
span.set_attribute("llm.latency_ms", latency) return result return wrapper
def _get_system(model: str) -> str: if "claude" in model: return "anthropic" if "gpt" in model: return "openai" if "gemini" in model: return "google" return "unknown"3.3 Implementación con Langfuse
Sección titulada «3.3 Implementación con Langfuse»"""Setup Langfuse para observabilidad LLM."""from langfuse import Langfusefrom langfuse.decorators import observe, langfuse_context
# Inicializaciónlangfuse = Langfuse( public_key="pk-...", secret_key="sk-...", host="https://cloud.langfuse.com" # o self-hosted)
@observe(as_type="generation")def call_llm(prompt: str, model: str = "claude-sonnet-4-5"): """Llamada LLM instrumentada con Langfuse.""" import anthropic client = anthropic.Anthropic()
langfuse_context.update_current_observation( model=model, metadata={"feature": "search", "version": "1.0"} )
response = client.messages.create( model=model, max_tokens=4096, messages=[{"role": "user", "content": prompt}] )
langfuse_context.update_current_observation( usage={ "input": response.usage.input_tokens, "output": response.usage.output_tokens, } ) return response.content[0].text
@observe()def rag_pipeline(query: str): """Pipeline RAG instrumentada.""" # Retrieve langfuse_context.update_current_trace( tags=["rag", "search"], metadata={"tenant_id": "project-x"} )
with langfuse_context.observe(name="retrieval") as span: docs = retrieve_documents(query) span.update(metadata={"doc_count": len(docs)})
# Generate context = "\n".join(docs) result = call_llm(f"Context: {context}\n\nQuestion: {query}")
return result4. FinOps: Control de Costos
Sección titulada «4. FinOps: Control de Costos»4.1 Budget Model
Sección titulada «4.1 Budget Model»# project/project-config.yaml (sección finops)finops: daily_budget_usd: 100 alert_threshold_pct: 80
cost_allocation: by_feature: true # Costo por feature/endpoint by_user: true # Costo por usuario by_model: true # Costo por modelo by_environment: true # dev vs staging vs prod
optimization: semantic_cache: enabled: true similarity_threshold: 0.92 ttl_seconds: faq: 86400 # 24h documentation: 43200 # 12h dynamic: 3600 # 1h target_hit_rate: 0.30 target_cost_savings: 0.40
model_routing: enabled: true rules: - condition: "token_count < 500 AND complexity == 'low'" model: "claude-haiku-3-5" - condition: "task_type == 'code_review'" model: "claude-sonnet-4-5" - condition: "task_type == 'summarization'" model: "gemini-2.0-flash"4.2 Semantic Caching
Sección titulada «4.2 Semantic Caching»"""Semantic cache para reducir costos de LLM calls."""import numpy as npfrom typing import Optional
class SemanticCache: def __init__(self, similarity_threshold: float = 0.92, ttl: int = 3600): self.threshold = similarity_threshold self.ttl = ttl self.cache = {} # En prod: Redis + vector store
def get(self, query: str, embedding: list[float]) -> Optional[str]: """Busca respuesta similar en cache.""" for key, entry in self.cache.items(): similarity = self._cosine_similarity(embedding, entry["embedding"]) if similarity >= self.threshold: if not self._is_expired(entry): return entry["response"] return None
def set(self, query: str, embedding: list[float], response: str): """Almacena respuesta en cache.""" import time self.cache[hash(query)] = { "embedding": embedding, "response": response, "timestamp": time.time(), "ttl": self.ttl }
def _cosine_similarity(self, a, b) -> float: a, b = np.array(a), np.array(b) return float(np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b)))
def _is_expired(self, entry) -> bool: import time return (time.time() - entry["timestamp"]) > entry["ttl"]4.3 Cost-per-Session Tracking
Sección titulada «4.3 Cost-per-Session Tracking»"""Tracking de costo por sesión de agente."""
class SessionCostTracker: """Tracks cost per agent session."""
# Thresholds (USD) THRESHOLDS = { "warning": 5.0, # Alerta amarilla "critical": 20.0, # Alerta roja, requiere review "hard_limit": 50.0, # Detiene la sesión }
def __init__(self, session_id: str): self.session_id = session_id self.calls = [] self.total_cost = 0.0
def record(self, model: str, input_tokens: int, output_tokens: int, cost: float): self.calls.append({ "model": model, "input_tokens": input_tokens, "output_tokens": output_tokens, "cost": cost }) self.total_cost += cost self._check_thresholds()
def _check_thresholds(self): if self.total_cost >= self.THRESHOLDS["hard_limit"]: raise SessionBudgetExceeded( f"Session {self.session_id} exceeded hard limit: " f"${self.total_cost:.2f}" ) elif self.total_cost >= self.THRESHOLDS["critical"]: print(f"🔴 CRITICAL: Session cost ${self.total_cost:.2f}") elif self.total_cost >= self.THRESHOLDS["warning"]: print(f"🟡 WARNING: Session cost ${self.total_cost:.2f}")
def report(self) -> dict: return { "session_id": self.session_id, "total_cost_usd": round(self.total_cost, 4), "total_calls": len(self.calls), "by_model": self._group_by_model(), "avg_cost_per_call": round( self.total_cost / max(len(self.calls), 1), 4 ), }
def _group_by_model(self) -> dict: groups = {} for call in self.calls: m = call["model"] if m not in groups: groups[m] = {"calls": 0, "cost": 0, "tokens": 0} groups[m]["calls"] += 1 groups[m]["cost"] += call["cost"] groups[m]["tokens"] += call["input_tokens"] + call["output_tokens"] return groups
class SessionBudgetExceeded(Exception): pass5. Dashboard Templates
Sección titulada «5. Dashboard Templates»5.1 Métricas Esenciales (Grafana/Prometheus)
Sección titulada «5.1 Métricas Esenciales (Grafana/Prometheus)»# Prometheus queries para dashboard LLMpanels: - title: "Cost per Day" query: sum(rate(llm_cost_usd_total[24h])) type: stat thresholds: [50, 80, 100]
- title: "Tokens per Hour" query: sum(rate(llm_tokens_total[1h])) type: graph
- title: "Latency p50/p95/p99" queries: - histogram_quantile(0.50, rate(llm_latency_seconds_bucket[5m])) - histogram_quantile(0.95, rate(llm_latency_seconds_bucket[5m])) - histogram_quantile(0.99, rate(llm_latency_seconds_bucket[5m])) type: graph
- title: "Cache Hit Rate" query: | rate(semantic_cache_hits_total[1h]) / (rate(semantic_cache_hits_total[1h]) + rate(semantic_cache_misses_total[1h])) type: gauge thresholds: [0.20, 0.30, 0.50]
- title: "Cost by Feature" query: sum by (feature_id) (rate(llm_cost_usd_total[24h])) type: bar
- title: "Error Rate" query: | rate(llm_errors_total[5m]) / rate(llm_calls_total[5m]) type: stat thresholds: [0.01, 0.05, 0.10]
- title: "RAG Recall@5" query: avg(rag_recall_at_5) type: gauge thresholds: [0.70, 0.85, 0.95]
- title: "Model Distribution" query: sum by (llm_model) (rate(llm_calls_total[24h])) type: pie5.2 Alertas
Sección titulada «5.2 Alertas»groups: - name: llm_cost_alerts rules: - alert: DailyBudgetWarning expr: sum(increase(llm_cost_usd_total[24h])) > 80 for: 5m labels: severity: warning annotations: summary: "Daily LLM cost approaching budget (>$80)"
- alert: DailyBudgetCritical expr: sum(increase(llm_cost_usd_total[24h])) > 100 for: 1m labels: severity: critical annotations: summary: "Daily LLM budget exceeded ($100)"
- alert: HighLatency expr: histogram_quantile(0.95, rate(llm_latency_seconds_bucket[5m])) > 5 for: 10m labels: severity: warning
- alert: HighErrorRate expr: rate(llm_errors_total[5m]) / rate(llm_calls_total[5m]) > 0.05 for: 5m labels: severity: critical
- alert: LowCacheHitRate expr: | rate(semantic_cache_hits_total[1h]) / (rate(semantic_cache_hits_total[1h]) + rate(semantic_cache_misses_total[1h])) < 0.15 for: 30m labels: severity: warning6. Agent-Legible Observability
Sección titulada «6. Agent-Legible Observability»Fuente: OpenAI “Harness Engineering” (Feb 2026) — agentes que consultan metricas y logs directamente para debugging autonomo.
6.1 El Concepto
Sección titulada «6.1 El Concepto»Agent-Legible Observability es la practica de exponer logs, metricas y traces directamente a los agentes para que puedan razonar sobre el comportamiento del sistema, diagnosticar problemas y tomar decisiones informadas — sin intervencion humana.
“Expusimos LogQL y PromQL a los agentes. Cuando un test falla, el agente consulta los logs, identifica el error, lo correlaciona con metricas, y propone un fix — todo autonomamente.” — OpenAI Harness Engineering, Feb 2026
6.2 Patron: Observability-as-Tool
Sección titulada «6.2 Patron: Observability-as-Tool»Exponer la observabilidad como herramientas MCP o CLI wrappers que los agentes pueden invocar:
"""MCP Server que expone metricas y logs a agentes."""from mcp.server.fastmcp import FastMCP
mcp = FastMCP("observability-tools", version="1.0.0")
@mcp.tool()async def query_logs(service: str, level: str = "error", duration: str = "1h", limit: int = 20) -> str: """Consulta logs del servicio. Retorna logs recientes filtrados.
Args: service: Nombre del servicio (ej: 'api', 'worker', 'frontend') level: Nivel minimo (debug, info, warn, error) duration: Ventana de tiempo (ej: '1h', '30m', '24h') limit: Maximo de entradas a retornar """ # LogQL query query = f'{{app="{service}"}} |= "{level}" | json | line_format "{{.timestamp}} {{.level}} {{.message}}"' # En produccion: llamar a Loki API return f"# Logs for {service} (last {duration}, level>={level})\n..."
@mcp.tool()async def query_metrics(metric: str, duration: str = "5m", quantile: float = 0.95) -> str: """Consulta metricas de Prometheus. Retorna valores actuales.
Args: metric: Nombre de la metrica (ej: 'http_duration_seconds', 'llm_cost_usd_total') duration: Ventana para rate() (ej: '5m', '1h') quantile: Percentil para histogramas (0.50, 0.95, 0.99) """ # PromQL query query = f'histogram_quantile({quantile}, rate({metric}_bucket[{duration}]))' return f"# Metric: {metric} p{int(quantile*100)} over {duration}\n..."
@mcp.tool()async def check_golden_principles() -> str: """Verifica el estado de todos los Golden Principles del proyecto. Retorna un reporte de cumplimiento.""" checks = [ {"id": "GP-01", "name": "Startup < 800ms", "status": "PASS", "value": "623ms"}, {"id": "GP-02", "name": "No span > 2s", "status": "PASS", "value": "max 1.2s"}, {"id": "GP-03", "name": "No direct DB from handlers", "status": "PASS"}, ] report = "# Golden Principles Status\n" for c in checks: emoji = "✅" if c["status"] == "PASS" else "❌" report += f"{emoji} {c['id']}: {c['name']} — {c['status']}\n" return report
@mcp.tool()async def get_cost_report(period: str = "today") -> str: """Obtiene reporte de costos LLM del periodo.
Args: period: 'today', 'week', 'month' """ return f"# Cost Report ({period})\nTotal: $47.23\nBy model: claude-sonnet-4-5: $32.10, gpt-4o: $15.13\n..."
if __name__ == "__main__": mcp.run(transport="stdio")6.3 Configuracion en Claude Code
Sección titulada «6.3 Configuracion en Claude Code»{ "mcpServers": { "observability-tools": { "command": "python", "args": ["mcp_servers/observability_tools.py"], "env": { "LOKI_URL": "http://localhost:3100", "PROMETHEUS_URL": "http://localhost:9090", "LANGFUSE_PUBLIC_KEY": "pk-...", "LANGFUSE_HOST": "https://cloud.langfuse.com" } } }}6.4 CLI Wrappers (Alternativa sin MCP)
Sección titulada «6.4 CLI Wrappers (Alternativa sin MCP)»Para equipos que no usan MCP, exponer como scripts que el agente puede ejecutar:
#!/bin/bash# scripts/agent-query-logs.sh — Wrapper para agentes# Uso: ./scripts/agent-query-logs.sh <service> [level] [duration]SERVICE=${1:?"Uso: agent-query-logs.sh <service> [level] [duration]"}LEVEL=${2:-"error"}DURATION=${3:-"1h"}
curl -s "http://localhost:3100/loki/api/v1/query_range" \ --data-urlencode "query={app=\"$SERVICE\"} |= \"$LEVEL\"" \ --data-urlencode "start=$(date -d "-$DURATION" +%s)000000000" \ --data-urlencode "end=$(date +%s)000000000" \ --data-urlencode "limit=20" | \ jq -r '.data.result[].values[][1]' | head -30#!/bin/bash# scripts/agent-query-metrics.sh — Wrapper para agentes# Uso: ./scripts/agent-query-metrics.sh <promql_query>QUERY=${1:?"Uso: agent-query-metrics.sh '<promql_query>'"}
curl -s "http://localhost:9090/api/v1/query" \ --data-urlencode "query=$QUERY" | \ jq -r '.data.result[] | "\(.metric | to_entries | map("\(.key)=\(.value)") | join(", ")): \(.value[1])"'6.5 Casos de Uso
Sección titulada «6.5 Casos de Uso»| Escenario | Query del agente | Accion autonoma |
|---|---|---|
| Test falla con timeout | query_logs("api", "error", "5m") | Identifica slow query, sugiere indice |
| Costo diario sube | get_cost_report("today") | Detecta feature con uso excesivo, sugiere cache |
| Latencia p95 degradada | query_metrics("http_duration_seconds", "10m", 0.95) | Correlaciona con deployment reciente |
| Golden Principle violation | check_golden_principles() | Identifica span > 2s, abre fix |
| RAG recall baja | query_metrics("rag_recall_at_5", "1h") | Revisa embeddings, sugiere re-indexar |
6.6 Recomendacion por Track
Sección titulada «6.6 Recomendacion por Track»| Track | Implementacion | Esfuerzo |
|---|---|---|
| Solo | CLI wrappers (scripts/agent-*.sh) | 1 hora |
| Lean | MCP Server basico (3-4 tools) | 1 dia |
| Full | MCP Server completo + Langfuse API + alertas proactivas | 1 semana |
6.7 Anti-patrones
Sección titulada «6.7 Anti-patrones»| Anti-patron | Problema | Solucion |
|---|---|---|
| Exponer todos los logs sin filtrar | Inunda context window | Limitar a 20-30 lineas, solo nivel relevante |
| Queries sin timeout | Agente espera indefinidamente | Timeout de 5s en todas las queries |
| Metricas sin contexto | Agente no sabe que es “normal” | Incluir thresholds y baselines en la respuesta |
| Sin rate limiting | Agente en loop consulta 100x | Max 10 queries por sesion |
7. Compaction & Context Window Management
Sección titulada «7. Compaction & Context Window Management»7.1 Estrategias de Compaction
Sección titulada «7.1 Estrategias de Compaction»| Estrategia | Cuándo usar | Cómo |
|---|---|---|
| Tool result clearing | Siempre | Limpiar outputs de tools antiguos del historial |
| Summarize & restart | Context > 80% capacidad | Resumir conversación, reiniciar con summary |
| HANDOFF.md automation | Cambio de sesión/agente | Escribir estado en archivo, nuevo agente lee |
| Progressive pruning | Sesiones largas (>1h) | Eliminar mensajes intermedios, mantener decisiones |
7.2 HANDOFF.md Automatizado
Sección titulada «7.2 HANDOFF.md Automatizado»<!-- HANDOFF.md — Auto-generated at session end --># Session Handoff
## Timestamp2026-03-19T15:30:00Z
## Current State- Phase: F06_Build- Task: Implementing search endpoint- Branch: feature/ai-search- Tests: 14/16 passing (2 pending for edge cases)
## Key Decisions Made1. Chose vector search over keyword (ADR-002)2. Using pgvector extension for embeddings3. Batch indexing every 15 min (not real-time)
## Open Issues- [ ] Edge case: empty query string returns 500- [ ] Performance test needed for >10K documents- [ ] Cache invalidation strategy TBD
## Files Modified- src/api/search.ts (new)- src/services/embedding.ts (new)- tests/search.test.ts (14 tests)- project-config.yaml (updated phase)
## Context for Next SessionThe search API is functional but needs edge case handling.Start by fixing the empty query bug in search.ts:42.Then run the full test suite: `npm test -- --grep search`7.3 Compact Instructions (Prioridad de Preservación)
Sección titulada «7.3 Compact Instructions (Prioridad de Preservación)»Cuando el agente ejecuta /compact, preservar en este orden:
- 🔴 Nunca resumir: ADRs, decisiones de arquitectura, stack choices
- 🟠 Mantener key facts: Archivos modificados, estado de tests, errores abiertos
- 🟡 Resumir: Outputs de herramientas, logs intermedios, exploraciones
- 🟢 Descartar: Conversación casual, intentos fallidos ya resueltos
8. EU AI Act Compliance (Aug 2026)
Sección titulada «8. EU AI Act Compliance (Aug 2026)»Para el workflow completo de clasificacion y conformity assessment, ver OWASP Agentic Security Guide §8.
8.1 Requisitos de Logging
Sección titulada «8.1 Requisitos de Logging»Para sistemas AI de alto riesgo, el EU AI Act requiere:
| Requisito | Implementación | Artículo |
|---|---|---|
| Audit trail | Langfuse traces con full context | Art. 12 |
| Decision explainability | Guardar reasoning en span attributes | Art. 13 |
| Data provenance | Tag de source en cada document RAG | Art. 10 |
| Human oversight records | Log de approvals/rejections humanas | Art. 14 |
| Performance monitoring | Dashboards con métricas continuas | Art. 15 |
| Incident reporting | Alertas + runbook documentado | Art. 62 |
| AI literacy training records | Registro de capacitaciones completadas | Art. 4 |
8.2 ISO 42001 Alignment
Sección titulada «8.2 ISO 42001 Alignment»# Mapeo de nuestro framework a ISO 42001iso_42001_mapping: "4.1 Understanding context": F01_Strategy "5.1 Leadership": project-config.yaml governance "6.1 Risk assessment": F08_Security compliance_matrix "7.1 Resources": finops budget tracking "8.1 AI system lifecycle": F01-F10 phases "9.1 Monitoring": Observability dashboards "10.1 Continual improvement": F10_Evolution9. Agent Production Monitoring — Health Score & KPIs
Sección titulada «9. Agent Production Monitoring — Health Score & KPIs»GAP-19: metricas de produccion especificas para agentes AI — health score compuesto, tool success rate, context utilisation, cost efficiency.
9.1 El Problema
Sección titulada «9.1 El Problema»Monitorear un agente AI en produccion es diferente a monitorear un servicio HTTP:
- No basta con latency/error-rate: un agente puede “funcionar” (200 OK) pero dar respuestas terribles.
- El costo es variable y puede explotar sin warning.
- Las herramientas (tools) pueden fallar silenciosamente.
- El context window puede saturarse causando degradacion gradual (no errores).
9.2 Agent Health Score (AHS) — Metrica Compuesta
Sección titulada «9.2 Agent Health Score (AHS) — Metrica Compuesta»Un unico numero (0-100) que indica la salud general de un agente en produccion:
agent_health_score: name: "Agent Health Score (AHS)" range: 0-100 target: ">= 80 (Green), 60-79 (Yellow), < 60 (Red)"
components: task_success_rate: weight: 0.30 description: "% de tareas completadas correctamente" metric: "agent_tasks_success_total / agent_tasks_total" thresholds: green: ">= 0.90" yellow: ">= 0.70" red: "< 0.70"
tool_success_rate: weight: 0.20 description: "% de tool calls exitosas" metric: "agent_tool_calls_success / agent_tool_calls_total" thresholds: green: ">= 0.95" yellow: ">= 0.85" red: "< 0.85"
context_utilisation: weight: 0.15 description: "Eficiencia de uso del context window" metric: "agent_context_tokens_used / agent_context_tokens_available" thresholds: green: "0.30 - 0.70" # sweet spot yellow: "0.70 - 0.90" # acercandose al limite red: "> 0.90 OR < 0.10" # saturado o desaprovechado
cost_efficiency: weight: 0.15 description: "Costo por tarea exitosa (USD)" metric: "agent_cost_total_usd / agent_tasks_success_total" thresholds: green: "< budget_per_task * 1.0" yellow: "< budget_per_task * 1.5" red: ">= budget_per_task * 1.5"
latency_p95: weight: 0.10 description: "Latencia p95 end-to-end del agente" metric: "histogram_quantile(0.95, agent_task_duration_seconds)" thresholds: green: "< 10s" yellow: "< 30s" red: ">= 30s"
error_escalation_rate: weight: 0.10 description: "% de tareas que requirieron escalacion humana" metric: "agent_escalations_total / agent_tasks_total" thresholds: green: "< 0.10" yellow: "< 0.25" red: ">= 0.25"
# AHS = sum(component_score * weight) # component_score: green=100, yellow=60, red=209.3 Implementacion Python — Health Score Calculator
Sección titulada «9.3 Implementacion Python — Health Score Calculator»"""Agent Health Score calculator for production monitoring."""from dataclasses import dataclass, fieldfrom enum import Enumfrom typing import Optionalimport time
class HealthLevel(Enum): GREEN = "green" # score >= 80 YELLOW = "yellow" # score 60-79 RED = "red" # score < 60
@dataclassclass AgentMetrics: """Metricas en tiempo real de un agente.""" tasks_total: int = 0 tasks_success: int = 0 tool_calls_total: int = 0 tool_calls_success: int = 0 context_tokens_used: int = 0 context_tokens_available: int = 200_000 # default 200K total_cost_usd: float = 0.0 budget_per_task_usd: float = 0.50 # configurable latencies_ms: list = field(default_factory=list) escalations: int = 0
@property def task_success_rate(self) -> float: return self.tasks_success / max(self.tasks_total, 1)
@property def tool_success_rate(self) -> float: return self.tool_calls_success / max(self.tool_calls_total, 1)
@property def context_utilisation(self) -> float: return self.context_tokens_used / max(self.context_tokens_available, 1)
@property def cost_per_success(self) -> float: return self.total_cost_usd / max(self.tasks_success, 1)
@property def latency_p95_ms(self) -> float: if not self.latencies_ms: return 0 sorted_lat = sorted(self.latencies_ms) idx = int(len(sorted_lat) * 0.95) return sorted_lat[min(idx, len(sorted_lat) - 1)]
@property def escalation_rate(self) -> float: return self.escalations / max(self.tasks_total, 1)
def calculate_health_score(metrics: AgentMetrics) -> dict: """Calcula el Agent Health Score (0-100)."""
def _score(value: float, green_check, yellow_check) -> int: if green_check(value): return 100 if yellow_check(value): return 60 return 20
components = { "task_success_rate": { "value": metrics.task_success_rate, "score": _score(metrics.task_success_rate, lambda v: v >= 0.90, lambda v: v >= 0.70), "weight": 0.30, }, "tool_success_rate": { "value": metrics.tool_success_rate, "score": _score(metrics.tool_success_rate, lambda v: v >= 0.95, lambda v: v >= 0.85), "weight": 0.20, }, "context_utilisation": { "value": metrics.context_utilisation, "score": _score(metrics.context_utilisation, lambda v: 0.30 <= v <= 0.70, lambda v: 0.10 <= v <= 0.90), "weight": 0.15, }, "cost_efficiency": { "value": metrics.cost_per_success, "score": _score(metrics.cost_per_success, lambda v: v < metrics.budget_per_task_usd, lambda v: v < metrics.budget_per_task_usd * 1.5), "weight": 0.15, }, "latency_p95": { "value": metrics.latency_p95_ms / 1000, # to seconds "score": _score(metrics.latency_p95_ms / 1000, lambda v: v < 10, lambda v: v < 30), "weight": 0.10, }, "escalation_rate": { "value": metrics.escalation_rate, "score": _score(metrics.escalation_rate, lambda v: v < 0.10, lambda v: v < 0.25), "weight": 0.10, }, }
ahs = sum(c["score"] * c["weight"] for c in components.values()) level = (HealthLevel.GREEN if ahs >= 80 else HealthLevel.YELLOW if ahs >= 60 else HealthLevel.RED)
return { "agent_health_score": round(ahs, 1), "level": level.value, "components": components, "timestamp": time.time(), }9.4 Dashboard Panels — Agent Production Monitoring
Sección titulada «9.4 Dashboard Panels — Agent Production Monitoring»# Grafana panels especificos para agentes AI en produccionpanels: - title: "Agent Health Score (AHS)" type: gauge query: agent_health_score thresholds: [60, 80, 100] colors: ["red", "yellow", "green"] description: "Metrica compuesta 0-100. Target >= 80."
- title: "Task Success Rate (rolling 1h)" type: stat query: | rate(agent_tasks_success_total[1h]) / rate(agent_tasks_total[1h]) thresholds: [0.70, 0.90] format: "percent"
- title: "Tool Call Success Rate by Tool" type: bar query: | sum by (tool_name) (rate(agent_tool_calls_success[1h])) / sum by (tool_name) (rate(agent_tool_calls_total[1h])) description: "Identifica tools problematicas rapidamente"
- title: "Context Window Utilisation" type: timeseries queries: - label: "tokens_used" query: agent_context_tokens_used - label: "70% threshold" query: agent_context_tokens_available * 0.70 - label: "90% critical" query: agent_context_tokens_available * 0.90 description: "Monitorea saturacion de context window"
- title: "Cost per Successful Task" type: timeseries query: | rate(agent_cost_total_usd[1h]) / rate(agent_tasks_success_total[1h]) thresholds: [0.50, 0.75] # budget_per_task * 1.0 / 1.5
- title: "Agent Latency Distribution" type: heatmap query: | rate(agent_task_duration_seconds_bucket[5m]) description: "Distribucion de latencia end-to-end"
- title: "Escalation Rate (rolling 24h)" type: stat query: | increase(agent_escalations_total[24h]) / increase(agent_tasks_total[24h]) thresholds: [0.10, 0.25] format: "percent"
- title: "Tool Usage Heatmap (by hour)" type: heatmap query: | sum by (tool_name, hour) (increase(agent_tool_calls_total[1h])) description: "Patrones de uso de herramientas a lo largo del dia"
- title: "Agent Backtracking Rate" type: stat query: | rate(agent_backtrack_total[1h]) / rate(agent_steps_total[1h]) thresholds: [0.15, 0.30] description: "Indica ineficiencia en el planning del agente"9.5 Alertas de Produccion para Agentes
Sección titulada «9.5 Alertas de Produccion para Agentes»groups: - name: agent_health_alerts rules: - alert: AgentHealthScoreCritical expr: agent_health_score < 60 for: 10m labels: severity: critical annotations: summary: "Agent {{ $labels.agent_id }} health score < 60 ({{ $value }})" runbook: "1) Check tool_success_rate 2) Check context_utilisation 3) Check cost"
- alert: AgentHealthScoreWarning expr: agent_health_score < 80 and agent_health_score >= 60 for: 30m labels: severity: warning annotations: summary: "Agent {{ $labels.agent_id }} health degraded ({{ $value }})"
- alert: ToolFailureSpike expr: | (rate(agent_tool_calls_total[5m]) - rate(agent_tool_calls_success[5m])) / rate(agent_tool_calls_total[5m]) > 0.15 for: 5m labels: severity: critical annotations: summary: "Tool failure rate > 15% for agent {{ $labels.agent_id }}" action: "Check MCP server health, verify tool permissions"
- alert: ContextWindowSaturation expr: agent_context_tokens_used / agent_context_tokens_available > 0.90 for: 2m labels: severity: warning annotations: summary: "Context window > 90% for agent {{ $labels.agent_id }}" action: "Trigger /compact or start new session"
- alert: AgentCostRunaway expr: | rate(agent_cost_total_usd[1h]) > (agent_budget_per_task_usd * rate(agent_tasks_total[1h]) * 2.0) for: 15m labels: severity: critical annotations: summary: "Agent cost 2x over budget" action: "Check for infinite loops, excessive tool calls, model routing"
- alert: HighBacktrackingRate expr: | rate(agent_backtrack_total[1h]) / rate(agent_steps_total[1h]) > 0.30 for: 15m labels: severity: warning annotations: summary: "Agent backtracking > 30% — possible planning inefficiency" action: "Review agent instructions, check context quality"
- alert: EscalationRateHigh expr: | increase(agent_escalations_total[24h]) / increase(agent_tasks_total[24h]) > 0.25 for: 1h labels: severity: warning annotations: summary: "Agent escalation rate > 25% in 24h" action: "Review failure patterns, consider expanding agent capabilities"9.6 Production Runbook — Agent Degradation
Sección titulada «9.6 Production Runbook — Agent Degradation»# RUNBOOK: Agent Health Score Degradation
## Triage Rapido (< 5 min)
1. **Check AHS components** — cual esta en rojo? - task_success_rate baja → ir a §A - tool_success_rate baja → ir a §B - context_utilisation alta → ir a §C - cost_efficiency mala → ir a §D - latency alta → ir a §E
## §A — Task Success Rate Baja- [ ] Revisar logs de tareas fallidas: `query_logs("agent", "error", "1h")`- [ ] Verificar si hay cambios recientes en prompts/instructions- [ ] Comprobar si los datos de entrada cambiaron (nuevo schema, encoding)- [ ] Ejecutar golden dataset de evaluacion para detectar regresion- [ ] Rollback si hay deployment reciente
## §B — Tool Success Rate Baja- [ ] Identificar tool problematica: dashboard "Tool Call Success by Tool"- [ ] Verificar MCP server health: `curl http://mcp-server/health`- [ ] Comprobar permisos y rate limits del tool- [ ] Revisar si hubo cambios en APIs externas (breaking changes)- [ ] Failover: deshabilitar tool y usar fallback
## §C — Context Window Saturada- [ ] Forzar compaction: trigger `/compact` automatico- [ ] Revisar si hay tool outputs gigantes (logs sin truncar)- [ ] Verificar que .claudeignore excluye archivos pesados- [ ] Iniciar nueva sesion si > 90% persistente
## §D — Cost Efficiency Degradada- [ ] Revisar model routing: ¿se usa el modelo mas caro innecesariamente?- [ ] Verificar cache hit rate: debe ser > 30%- [ ] Buscar loops: agente repitiendo tool calls identicas- [ ] Revisar token usage: ¿context window demasiado grande?
## §E — Latency Alta- [ ] Revisar latencia por componente: model, tool calls, retrieval- [ ] Verificar colas de MCP servers- [ ] Comprobar si hay throttling del provider (Anthropic/OpenAI)- [ ] Considerar model routing a modelo mas rapido9.7 Metricas por Track
Sección titulada «9.7 Metricas por Track»| Metrica | Solo | Lean | Full |
|---|---|---|---|
| AHS | Manual check semanal | Dashboard basico | Dashboard real-time + alerts |
| Task success | Log review | Langfuse traces | Automated + golden dataset regression |
| Tool success | --verbose flag | MCP server logs | Per-tool dashboards + auto-failover |
| Context util | /context command | Session logging | Real-time gauge + auto-compact |
| Cost | Helicone free tier | Langfuse cost tracking | Full FinOps + anomaly detection |
| Latency | Manual timing | p50/p95 dashboard | Full distribution + SLO alerts |
| Escalation | Count in notes | Tag in tracker | Automated classification + trend analysis |
10. AGENT DRIFT DETECTION & BEHAVIORAL BASELINE
Sección titulada «10. AGENT DRIFT DETECTION & BEHAVIORAL BASELINE»GAP-20: deteccion de drift comportamental en agentes de larga duracion — baseline de comportamiento, context quality score, eficiencia de tokens, y deteccion de regresion.
10.1 Que es Agent Drift
Sección titulada «10.1 Que es Agent Drift»Agent drift es la degradacion gradual del comportamiento de un agente durante sesiones largas o a lo largo de multiples sesiones. A diferencia de un error discreto, el drift es silencioso — el agente sigue funcionando pero sus outputs se alejan progresivamente del comportamiento esperado.
Causas principales:
- Context pollution: informacion irrelevante o contradictoria acumulada en el context window
- Model updates: cambios en el modelo subyacente que alteran el comportamiento
- Prompt rot: instrucciones que dejan de ser efectivas al cambiar el contexto del proyecto
- Data drift: los datos de entrada cambian en distribucion pero las instrucciones no se adaptan
- Tool evolution: APIs externas cambian su formato de respuesta
10.2 Behavioral Baseline
Sección titulada «10.2 Behavioral Baseline»Establecer un golden behavior dataset: outputs esperados para N inputs estandar que se ejecutan periodicamente para detectar desviaciones.
behavioral_baseline: version: "1.0.0" schedule: "weekly" # Frecuencia de ejecucion similarity_threshold: 0.85 # Alerta si cae debajo
golden_behaviors: - id: "BB-001" description: "Generar test unitario para funcion de utilidad" input: task: "Escribir test para una funcion sum(a, b) que retorna la suma" context: "Proyecto TypeScript con Jest" expected_patterns: - "describe(" - "expect(" - "toBe(" expected_structure: has_imports: true has_describe_block: true has_at_least_n_assertions: 2 quality_score_baseline: 0.92
- id: "BB-002" description: "Revisar codigo con problema de seguridad" input: task: "Revisar este codigo: const query = `SELECT * FROM users WHERE id = ${userId}`" context: "Proyecto Node.js" expected_patterns: - "SQL injection" - "parameterized" - "prepared statement" quality_score_baseline: 0.95
- id: "BB-003" description: "Crear ADR para decision de arquitectura" input: task: "Documentar la decision de usar PostgreSQL en vez de MongoDB" context: "Proyecto con datos relacionales y transacciones ACID" expected_patterns: - "Contexto" - "Decision" - "Consecuencias" quality_score_baseline: 0.88
# Comparar outputs actuales vs baseline semanalmente comparison: method: "semantic_similarity + pattern_match" alert_if_similarity_below: 0.85 alert_channels: ["slack:#agent-quality", "email:tech-lead"]Workflow semanal:
1. Ejecutar golden behaviors contra agente actual2. Comparar outputs con baseline (semantic similarity + patterns)3. Si similarity < 0.85 → alerta + investigar causa4. Si similarity > 0.95 → actualizar baseline (el agente mejoro)5. Documentar resultados en .factory/drift-reports/10.3 Context Quality Score (CQS)
Sección titulada «10.3 Context Quality Score (CQS)»Formula para medir la calidad del contexto en una sesion activa:
CQS = (useful_tokens / total_tokens) × (1 - error_rate) × recency_factorDonde:
- useful_tokens: tokens que contribuyeron a outputs correctos (estimado via tool call success y task completion)
- total_tokens: tokens totales consumidos en la sesion
- error_rate: ratio de tool calls fallidas / total tool calls
- recency_factor: factor de decaimiento basado en la edad de la sesion
"""Context Quality Score calculator."""import mathfrom dataclasses import dataclass
@dataclassclass SessionMetrics: """Metricas de una sesion activa.""" total_tokens: int useful_tokens: int # tokens en calls exitosas tool_calls_total: int tool_calls_failed: int session_age_minutes: int
def calculate_cqs(metrics: SessionMetrics) -> float: """Calcula el Context Quality Score (0.0 - 1.0).
Args: metrics: Metricas de la sesion activa.
Returns: CQS entre 0.0 y 1.0. Target >= 0.70. """ # Token efficiency ratio token_ratio = metrics.useful_tokens / max(metrics.total_tokens, 1)
# Error rate error_rate = metrics.tool_calls_failed / max(metrics.tool_calls_total, 1)
# Recency factor: decae exponencialmente despues de 120 minutos # 0-120 min: ~1.0, 240 min: ~0.5, 360 min: ~0.25 half_life_minutes = 120 recency_factor = math.exp( -0.693 * max(metrics.session_age_minutes - half_life_minutes, 0) / half_life_minutes ) recency_factor = min(recency_factor, 1.0)
cqs = token_ratio * (1 - error_rate) * recency_factor return round(cqs, 4)Umbrales de CQS:
| CQS | Estado | Accion |
|---|---|---|
| >= 0.80 | Verde | Operacion normal |
| 0.60 - 0.79 | Amarillo | Considerar compaction |
| 0.40 - 0.59 | Naranja | Compaction urgente o relay |
| < 0.40 | Rojo | Iniciar nueva sesion inmediatamente |
10.4 Token Efficiency Analysis
Sección titulada «10.4 Token Efficiency Analysis»Metricas para evaluar la eficiencia de consumo de tokens:
Tokens por accion exitosa (TPA): metrica principal de eficiencia — menor es mejor.
panels: - title: "Tokens per Successful Action (TPA)" type: timeseries query: | rate(agent_tokens_total[1h]) / rate(agent_tasks_success_total[1h]) thresholds: green: "< 5000" yellow: "< 15000" red: ">= 15000" description: "Eficiencia: tokens consumidos por cada accion exitosa"
- title: "Context Utilization Curve" type: timeseries queries: - label: "CQS" query: agent_context_quality_score - label: "Token ratio" query: agent_useful_tokens / agent_total_tokens - label: "Compaction threshold" query: 0.60 description: "CQS a lo largo de la sesion — indica momento optimo para compaction"
- title: "Session Age vs CQS" type: scatter query: | agent_context_quality_score by (session_age_bucket) description: "Correlacion entre edad de sesion y calidad de contexto"Timing optimo de compaction basado en CQS:
Inicio sesion ──> CQS ~1.0 │ │ (120 min) CQS ~0.80 → monitorear │ │ (180 min) CQS ~0.65 → programar compaction │ │ (240 min) CQS ~0.50 → compaction urgente o relay │ v (300 min) CQS < 0.40 → nueva sesion obligatoria10.5 Regression Detection
Sección titulada «10.5 Regression Detection»Ejecutar el golden behavior dataset semanalmente para detectar regresiones:
# Ejecucion semanal (CI job o cron)# 1. Ejecutar behavioral baselinepython3 baseline/scripts/fab-eval-builders.py \ --golden-dataset observability/agent_behavioral_baseline.yaml \ --output .factory/drift-reports/drift-$(date +%Y%m%d).json
# 2. Comparar con la semana anteriorpython3 baseline/scripts/compare-evals.py \ --baseline .factory/drift-reports/drift-$(date -d "7 days ago" +%Y%m%d).json \ --candidate .factory/drift-reports/drift-$(date +%Y%m%d).json \ --threshold 0.05
# 3. Si hay regresion, notificar# compare-evals.py retorna exit code 1 si regression > thresholdCanales de alerta:
| Nivel de regresion | Canal | Accion |
|---|---|---|
| 0-5% | Log only | Monitorear tendencia |
| 5-15% | Slack #agent-quality | Investigar causa, revisar prompts |
| 15-30% | Slack + email tech lead | Congelar cambios, rollback si necesario |
| > 30% | PagerDuty | Incident response, rollback inmediato |
10.6 Diagnostic Toolkit
Sección titulada «10.6 Diagnostic Toolkit»Herramientas para diagnosticar y resolver agent drift:
Extension propuesta de fab-eval-builders.py:
# Check de drift contra behavioral baselinepython3 baseline/scripts/fab-eval-builders.py --drift-check \ --baseline observability/agent_behavioral_baseline.yaml \ --period 7d
# Output esperado:# DRIFT REPORT — Last 7 days# ──────────────────────────# BB-001 (unit test gen): 0.93 → 0.91 (-2.1%) ✅ OK# BB-002 (security review): 0.95 → 0.88 (-7.4%) ⚠️ WARN# BB-003 (ADR creation): 0.88 → 0.85 (-3.4%) ✅ OK# ──────────────────────────# Overall drift: -4.3% ⚠️ INVESTIGATE BB-002Grafana dashboard template para agent drift:
panels: - title: "Behavioral Drift Trend (weekly)" type: timeseries queries: - label: "Overall similarity" query: agent_behavioral_similarity_avg - label: "Threshold" query: 0.85 description: "Tendencia semanal de similitud con behavioral baseline"
- title: "CQS Distribution by Session" type: histogram query: agent_context_quality_score description: "Distribucion de CQS al final de cada sesion"
- title: "Drift by Behavior Category" type: bar query: | agent_behavioral_similarity by (behavior_id) description: "Similitud por categoria de comportamiento"
- title: "Token Efficiency Trend" type: timeseries query: | rate(agent_tokens_total[24h]) / rate(agent_tasks_success_total[24h]) description: "TPA trending — detectar degradacion de eficiencia"Umbrales de alerta:
| Metrica | Warn | Critical | Accion |
|---|---|---|---|
| Behavioral similarity | < 85% | < 70% | Investigar / rollback |
| CQS promedio | < 0.65 | < 0.50 | Ajustar compaction timing |
| TPA (tokens/accion) | > 10K | > 20K | Revisar context loading |
| Weekly regression | > 15% | > 30% | Congelar cambios |
11. Checklist de Implementacion
Sección titulada «11. Checklist de Implementacion»Quick Start (Dia 1)
Sección titulada «Quick Start (Dia 1)»- Instalar Helicone proxy O Langfuse
- Instrumentar primera llamada LLM con cost tracking
- Verificar que spans tienen los 10 attributes obligatorios
Produccion (Semana 1)
Sección titulada «Produccion (Semana 1)»- Configurar dashboard con 8 panels esenciales
- Crear alertas para budget y error rate
- Implementar semantic cache básico
- Configurar cost attribution by feature
Madurez (Mes 1)
Sección titulada «Madurez (Mes 1)»- Model routing automatico por complejidad
- HANDOFF.md automation al final de sesiones
- EU AI Act audit trail completo
- Review mensual de costos con equipo
Agent Drift (Mes 2)
Sección titulada «Agent Drift (Mes 2)»- Crear golden behavior dataset con al menos 5 casos
- Configurar ejecucion semanal de behavioral baseline
- Implementar CQS calculator en sesiones de agentes
- Configurar dashboard de drift en Grafana
- Establecer alertas de regresion (warn 15%, critical 30%)
- Integrar
compare-evals.pyen CI job semanal
Documento generado como companion operativo del AI-First Engineering Framework v7.6.0 — GAP-18 (Tool Landscape), GAP-19 (Agent Production Monitoring), y GAP-20 (Agent Drift Detection) cerrados