Ir al contenido

Observability & FinOps

LLM Observability & FinOps Implementation Guide

Sección titulada «LLM Observability & FinOps Implementation Guide»

Companion del AI-First Engineering Framework v6.5

Sin observabilidad, las aplicaciones AI son cajas negras:

  • No sabes cuánto cuesta cada feature AI.
  • No detectas degradación de calidad hasta que un usuario reporta.
  • No puedes optimizar lo que no mides.
  • No cumples EU AI Act (Aug 2026) sin audit trail.

“AI is an amplifier — it magnifies an organization’s existing strengths and weaknesses.” — DORA 2025

┌─────────────────────────────────────────────────┐
│ DASHBOARDS │
│ (Grafana / Datadog / Custom) │
├─────────────────────────────────────────────────┤
│ LLMOps PLATFORM │
│ (Langfuse / LangSmith / Arize / Helicone) │
├─────────────────────────────────────────────────┤
│ TRACING / METRICS │
│ (OpenTelemetry / Prometheus) │
├─────────────────────────────────────────────────┤
│ APPLICATION CODE │
│ (spans, token counters, cost attribution) │
└─────────────────────────────────────────────────┘

2.2 Landscape Completo de Herramientas (15+ plataformas, Mar 2026)

Sección titulada «2.2 Landscape Completo de Herramientas (15+ plataformas, Mar 2026)»

GAP-18: mapa exhaustivo del ecosistema — no solo las 6 principales, sino todas las herramientas relevantes que un equipo enterprise deberia conocer.

2.2.1 Plataformas LLMOps (tracing + eval + cost)

Sección titulada «2.2.1 Plataformas LLMOps (tracing + eval + cost)»
PlataformaMejor paraTracingEvalsCostOSSPricing (Mar 2026)
LangfuseStartups, self-hosted✅ Full✅ Built-in✅ SíFree tier + self-host
LangSmithEquipos LangChain✅ Full✅ Built-in❌ No$39/seat/mo
BraintrustEvals + observability✅ Best-in-class❌ NoFree tier, $25/seat pro
Arize PhoenixML teams, model monitoring✅ Full✅ LLM judges✅ ParcialFree tier + enterprise
HeliconeCost tracking rapido✅ Proxy❌ Basico✅ Best✅ ProxyFree tier generous
Weights & Biases (Weave)ML teams existentes con W&B✅ Full✅ Built-in❌ No$50/seat/mo
GalileoLLM quality scoring✅ Guardrail Studio❌ NoEnterprise pricing
HoneyHivePipeline debugging✅ Full✅ Data-driven❌ NoFree tier + enterprise

2.2.2 Plataformas de Infra-Observability (APM + AI)

Sección titulada «2.2.2 Plataformas de Infra-Observability (APM + AI)»
PlataformaMejor paraLLM TracingAPM NativoAI CostPricing
Datadog AI ObservabilityEnterprise, ya usa Datadog✅ LLM traces✅ Full APMIncluido en plan DD
New Relic AI MonitoringEnterprise, ya usa NR✅ LLM traces✅ Full APMIncluido en plan NR
Dynatrace AI ObservabilityEnterprises con auto-instrumentation✅ Auto-detect✅ Full APMEnterprise license
Elastic AI ObservabilitySelf-hosted enterprise✅ LLM spans✅ Full APMFree (self) / Cloud
HerramientaEnfoqueDiferenciadorOSSPricing
OpenLLMetryOTel auto-instrumentation para LLMsSDK drop-in: 1 linea instrumenta Anthropic/OpenAI/Cohere✅ SíGratis
TraceloopDeveloper experience para OTel LLMBuilds on OpenLLMetry, UI included✅ CoreFree + cloud
Prompt LayerPrompt versioning + monitoringPrompt registry + A/B testing❌ NoFree tier
Log10LLM feedback + debuggingHuman feedback loops integrados❌ NoFree tier
Portkey AI GatewayAI gateway con observability built-inLoad balancing + caching + observability✅ SíFree tier + enterprise
¿Ya usas APM enterprise (Datadog/NR/Dynatrace)?
├── SI → Activar modulo AI de tu APM existente
└── NO → ¿Necesitas self-hosting?
├── SI → Langfuse (self-hosted) + OpenLLMetry
└── NO → ¿Prioridad?
├── Cost → Helicone (proxy, 5 min setup)
├── Evals → Braintrust (evals + traces)
├── Tracing → Langfuse Cloud (traces + dashboard)
└── Todo → Langfuse + Portkey (gateway + traces)

2.2.5 Instrumentacion Base: OpenLLMetry (1 linea)

Sección titulada «2.2.5 Instrumentacion Base: OpenLLMetry (1 linea)»
# La forma mas rapida de instrumentar CUALQUIER app LLM
# Funciona con Langfuse, Datadog, Jaeger, cualquier backend OTLP
# pip install openllmetry-sdk
from openllmetry import init_tracing
# Esto instrumenta automaticamente: Anthropic, OpenAI, Cohere,
# LangChain, LlamaIndex, Pinecone, ChromaDB, etc.
init_tracing(
exporter_endpoint="http://localhost:4317", # OTLP endpoint
# o: exporter_endpoint="https://cloud.langfuse.com"
# o: exporter_endpoint="https://api.helicone.ai"
)
# Tu codigo normal — ya esta instrumentado automaticamente
import anthropic
client = anthropic.Anthropic()
response = client.messages.create(
model="claude-sonnet-4-5",
max_tokens=1024,
messages=[{"role": "user", "content": "Hello"}]
)
# → Span creado automaticamente con: model, tokens, cost, latency
TrackRecomendadoRazón
SoloHelicone (proxy)Setup en 5 min, cost tracking gratuito
LeanLangfuse (self-hosted)Open source, traces completos, gratis
FullLangfuse + DatadogTraces AI + infra monitoring integrado
# Attributes que DEBEN estar en cada span LLM
required_span_attributes:
- agent_id # Identificador del agente
- llm.system # "anthropic", "openai", "google"
- llm.model # "claude-sonnet-4-5", "gpt-4o"
- llm.tokens.input # Tokens de entrada
- llm.tokens.output # Tokens de salida
- llm.tokens.total # Total tokens
- llm.cost.usd # Costo en USD
- llm.latency_ms # Latencia en ms
- trace_id # ID de trace OpenTelemetry
- tenant_id # ID del tenant/proyecto
- feature_id # Feature que originó la llamada
- spec_version # Versión del spec/contract
optional_span_attributes:
- rag.recall_at_5 # Recall del RAG
- rag.precision # Precisión del RAG
- cache.hit # Si fue cache hit
- cache.similarity # Score de similaridad semántica
- user_id # Para cost attribution per-user
- session_id # Sesión del agente
observability/tracing.py
"""Instrumentación OpenTelemetry para LLM calls."""
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
import time
# Setup
provider = TracerProvider()
processor = BatchSpanProcessor(
OTLPSpanExporter(endpoint="http://localhost:4317")
)
provider.add_span_processor(processor)
trace.set_tracer_provider(provider)
tracer = trace.get_tracer("ai-first-framework")
# Pricing por modelo (USD per 1K tokens, Mar 2026)
MODEL_PRICING = {
"claude-sonnet-4-5": {"input": 0.003, "output": 0.015},
"claude-haiku-3-5": {"input": 0.00025, "output": 0.00125},
"gpt-4o": {"input": 0.0025, "output": 0.01},
"gpt-4o-mini": {"input": 0.00015, "output": 0.0006},
"gemini-2.5-pro": {"input": 0.00125, "output": 0.005},
"gemini-2.0-flash": {"input": 0.0001, "output": 0.0004},
}
def calculate_cost(model: str, input_tokens: int, output_tokens: int) -> float:
"""Calcula costo en USD."""
prices = MODEL_PRICING.get(model, {"input": 0.01, "output": 0.03})
return (input_tokens * prices["input"] / 1000 +
output_tokens * prices["output"] / 1000)
def traced_llm_call(func):
"""Decorator para instrumentar llamadas LLM."""
def wrapper(*args, **kwargs):
with tracer.start_as_current_span("llm.call") as span:
model = kwargs.get("model", "unknown")
span.set_attribute("llm.model", model)
span.set_attribute("llm.system", _get_system(model))
span.set_attribute("agent_id", kwargs.get("agent_id", "default"))
span.set_attribute("feature_id", kwargs.get("feature_id", "unknown"))
start = time.time()
result = func(*args, **kwargs)
latency = (time.time() - start) * 1000
# Extract token usage from response
usage = getattr(result, "usage", None)
if usage:
input_tokens = getattr(usage, "input_tokens", 0)
output_tokens = getattr(usage, "output_tokens", 0)
span.set_attribute("llm.tokens.input", input_tokens)
span.set_attribute("llm.tokens.output", output_tokens)
span.set_attribute("llm.tokens.total", input_tokens + output_tokens)
span.set_attribute("llm.cost.usd",
calculate_cost(model, input_tokens, output_tokens))
span.set_attribute("llm.latency_ms", latency)
return result
return wrapper
def _get_system(model: str) -> str:
if "claude" in model: return "anthropic"
if "gpt" in model: return "openai"
if "gemini" in model: return "google"
return "unknown"
observability/langfuse_setup.py
"""Setup Langfuse para observabilidad LLM."""
from langfuse import Langfuse
from langfuse.decorators import observe, langfuse_context
# Inicialización
langfuse = Langfuse(
public_key="pk-...",
secret_key="sk-...",
host="https://cloud.langfuse.com" # o self-hosted
)
@observe(as_type="generation")
def call_llm(prompt: str, model: str = "claude-sonnet-4-5"):
"""Llamada LLM instrumentada con Langfuse."""
import anthropic
client = anthropic.Anthropic()
langfuse_context.update_current_observation(
model=model,
metadata={"feature": "search", "version": "1.0"}
)
response = client.messages.create(
model=model,
max_tokens=4096,
messages=[{"role": "user", "content": prompt}]
)
langfuse_context.update_current_observation(
usage={
"input": response.usage.input_tokens,
"output": response.usage.output_tokens,
}
)
return response.content[0].text
@observe()
def rag_pipeline(query: str):
"""Pipeline RAG instrumentada."""
# Retrieve
langfuse_context.update_current_trace(
tags=["rag", "search"],
metadata={"tenant_id": "project-x"}
)
with langfuse_context.observe(name="retrieval") as span:
docs = retrieve_documents(query)
span.update(metadata={"doc_count": len(docs)})
# Generate
context = "\n".join(docs)
result = call_llm(f"Context: {context}\n\nQuestion: {query}")
return result
# project/project-config.yaml (sección finops)
finops:
daily_budget_usd: 100
alert_threshold_pct: 80
cost_allocation:
by_feature: true # Costo por feature/endpoint
by_user: true # Costo por usuario
by_model: true # Costo por modelo
by_environment: true # dev vs staging vs prod
optimization:
semantic_cache:
enabled: true
similarity_threshold: 0.92
ttl_seconds:
faq: 86400 # 24h
documentation: 43200 # 12h
dynamic: 3600 # 1h
target_hit_rate: 0.30
target_cost_savings: 0.40
model_routing:
enabled: true
rules:
- condition: "token_count < 500 AND complexity == 'low'"
model: "claude-haiku-3-5"
- condition: "task_type == 'code_review'"
model: "claude-sonnet-4-5"
- condition: "task_type == 'summarization'"
model: "gemini-2.0-flash"
observability/semantic_cache.py
"""Semantic cache para reducir costos de LLM calls."""
import numpy as np
from typing import Optional
class SemanticCache:
def __init__(self, similarity_threshold: float = 0.92, ttl: int = 3600):
self.threshold = similarity_threshold
self.ttl = ttl
self.cache = {} # En prod: Redis + vector store
def get(self, query: str, embedding: list[float]) -> Optional[str]:
"""Busca respuesta similar en cache."""
for key, entry in self.cache.items():
similarity = self._cosine_similarity(embedding, entry["embedding"])
if similarity >= self.threshold:
if not self._is_expired(entry):
return entry["response"]
return None
def set(self, query: str, embedding: list[float], response: str):
"""Almacena respuesta en cache."""
import time
self.cache[hash(query)] = {
"embedding": embedding,
"response": response,
"timestamp": time.time(),
"ttl": self.ttl
}
def _cosine_similarity(self, a, b) -> float:
a, b = np.array(a), np.array(b)
return float(np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b)))
def _is_expired(self, entry) -> bool:
import time
return (time.time() - entry["timestamp"]) > entry["ttl"]
observability/session_cost.py
"""Tracking de costo por sesión de agente."""
class SessionCostTracker:
"""Tracks cost per agent session."""
# Thresholds (USD)
THRESHOLDS = {
"warning": 5.0, # Alerta amarilla
"critical": 20.0, # Alerta roja, requiere review
"hard_limit": 50.0, # Detiene la sesión
}
def __init__(self, session_id: str):
self.session_id = session_id
self.calls = []
self.total_cost = 0.0
def record(self, model: str, input_tokens: int,
output_tokens: int, cost: float):
self.calls.append({
"model": model,
"input_tokens": input_tokens,
"output_tokens": output_tokens,
"cost": cost
})
self.total_cost += cost
self._check_thresholds()
def _check_thresholds(self):
if self.total_cost >= self.THRESHOLDS["hard_limit"]:
raise SessionBudgetExceeded(
f"Session {self.session_id} exceeded hard limit: "
f"${self.total_cost:.2f}"
)
elif self.total_cost >= self.THRESHOLDS["critical"]:
print(f"🔴 CRITICAL: Session cost ${self.total_cost:.2f}")
elif self.total_cost >= self.THRESHOLDS["warning"]:
print(f"🟡 WARNING: Session cost ${self.total_cost:.2f}")
def report(self) -> dict:
return {
"session_id": self.session_id,
"total_cost_usd": round(self.total_cost, 4),
"total_calls": len(self.calls),
"by_model": self._group_by_model(),
"avg_cost_per_call": round(
self.total_cost / max(len(self.calls), 1), 4
),
}
def _group_by_model(self) -> dict:
groups = {}
for call in self.calls:
m = call["model"]
if m not in groups:
groups[m] = {"calls": 0, "cost": 0, "tokens": 0}
groups[m]["calls"] += 1
groups[m]["cost"] += call["cost"]
groups[m]["tokens"] += call["input_tokens"] + call["output_tokens"]
return groups
class SessionBudgetExceeded(Exception):
pass

5.1 Métricas Esenciales (Grafana/Prometheus)

Sección titulada «5.1 Métricas Esenciales (Grafana/Prometheus)»
observability/dashboards/llm_metrics.yaml
# Prometheus queries para dashboard LLM
panels:
- title: "Cost per Day"
query: sum(rate(llm_cost_usd_total[24h]))
type: stat
thresholds: [50, 80, 100]
- title: "Tokens per Hour"
query: sum(rate(llm_tokens_total[1h]))
type: graph
- title: "Latency p50/p95/p99"
queries:
- histogram_quantile(0.50, rate(llm_latency_seconds_bucket[5m]))
- histogram_quantile(0.95, rate(llm_latency_seconds_bucket[5m]))
- histogram_quantile(0.99, rate(llm_latency_seconds_bucket[5m]))
type: graph
- title: "Cache Hit Rate"
query: |
rate(semantic_cache_hits_total[1h]) /
(rate(semantic_cache_hits_total[1h]) + rate(semantic_cache_misses_total[1h]))
type: gauge
thresholds: [0.20, 0.30, 0.50]
- title: "Cost by Feature"
query: sum by (feature_id) (rate(llm_cost_usd_total[24h]))
type: bar
- title: "Error Rate"
query: |
rate(llm_errors_total[5m]) / rate(llm_calls_total[5m])
type: stat
thresholds: [0.01, 0.05, 0.10]
- title: "RAG Recall@5"
query: avg(rag_recall_at_5)
type: gauge
thresholds: [0.70, 0.85, 0.95]
- title: "Model Distribution"
query: sum by (llm_model) (rate(llm_calls_total[24h]))
type: pie
observability/alerts/llm_alerts.yaml
groups:
- name: llm_cost_alerts
rules:
- alert: DailyBudgetWarning
expr: sum(increase(llm_cost_usd_total[24h])) > 80
for: 5m
labels:
severity: warning
annotations:
summary: "Daily LLM cost approaching budget (>$80)"
- alert: DailyBudgetCritical
expr: sum(increase(llm_cost_usd_total[24h])) > 100
for: 1m
labels:
severity: critical
annotations:
summary: "Daily LLM budget exceeded ($100)"
- alert: HighLatency
expr: histogram_quantile(0.95, rate(llm_latency_seconds_bucket[5m])) > 5
for: 10m
labels:
severity: warning
- alert: HighErrorRate
expr: rate(llm_errors_total[5m]) / rate(llm_calls_total[5m]) > 0.05
for: 5m
labels:
severity: critical
- alert: LowCacheHitRate
expr: |
rate(semantic_cache_hits_total[1h]) /
(rate(semantic_cache_hits_total[1h]) + rate(semantic_cache_misses_total[1h])) < 0.15
for: 30m
labels:
severity: warning

Fuente: OpenAI “Harness Engineering” (Feb 2026) — agentes que consultan metricas y logs directamente para debugging autonomo.

Agent-Legible Observability es la practica de exponer logs, metricas y traces directamente a los agentes para que puedan razonar sobre el comportamiento del sistema, diagnosticar problemas y tomar decisiones informadas — sin intervencion humana.

“Expusimos LogQL y PromQL a los agentes. Cuando un test falla, el agente consulta los logs, identifica el error, lo correlaciona con metricas, y propone un fix — todo autonomamente.” — OpenAI Harness Engineering, Feb 2026

Exponer la observabilidad como herramientas MCP o CLI wrappers que los agentes pueden invocar:

mcp_servers/observability_tools.py
"""MCP Server que expone metricas y logs a agentes."""
from mcp.server.fastmcp import FastMCP
mcp = FastMCP("observability-tools", version="1.0.0")
@mcp.tool()
async def query_logs(service: str, level: str = "error",
duration: str = "1h", limit: int = 20) -> str:
"""Consulta logs del servicio. Retorna logs recientes filtrados.
Args:
service: Nombre del servicio (ej: 'api', 'worker', 'frontend')
level: Nivel minimo (debug, info, warn, error)
duration: Ventana de tiempo (ej: '1h', '30m', '24h')
limit: Maximo de entradas a retornar
"""
# LogQL query
query = f'{{app="{service}"}} |= "{level}" | json | line_format "{{.timestamp}} {{.level}} {{.message}}"'
# En produccion: llamar a Loki API
return f"# Logs for {service} (last {duration}, level>={level})\n..."
@mcp.tool()
async def query_metrics(metric: str, duration: str = "5m",
quantile: float = 0.95) -> str:
"""Consulta metricas de Prometheus. Retorna valores actuales.
Args:
metric: Nombre de la metrica (ej: 'http_duration_seconds', 'llm_cost_usd_total')
duration: Ventana para rate() (ej: '5m', '1h')
quantile: Percentil para histogramas (0.50, 0.95, 0.99)
"""
# PromQL query
query = f'histogram_quantile({quantile}, rate({metric}_bucket[{duration}]))'
return f"# Metric: {metric} p{int(quantile*100)} over {duration}\n..."
@mcp.tool()
async def check_golden_principles() -> str:
"""Verifica el estado de todos los Golden Principles del proyecto.
Retorna un reporte de cumplimiento."""
checks = [
{"id": "GP-01", "name": "Startup < 800ms", "status": "PASS", "value": "623ms"},
{"id": "GP-02", "name": "No span > 2s", "status": "PASS", "value": "max 1.2s"},
{"id": "GP-03", "name": "No direct DB from handlers", "status": "PASS"},
]
report = "# Golden Principles Status\n"
for c in checks:
emoji = "" if c["status"] == "PASS" else ""
report += f"{emoji} {c['id']}: {c['name']}{c['status']}\n"
return report
@mcp.tool()
async def get_cost_report(period: str = "today") -> str:
"""Obtiene reporte de costos LLM del periodo.
Args:
period: 'today', 'week', 'month'
"""
return f"# Cost Report ({period})\nTotal: $47.23\nBy model: claude-sonnet-4-5: $32.10, gpt-4o: $15.13\n..."
if __name__ == "__main__":
mcp.run(transport="stdio")
{
"mcpServers": {
"observability-tools": {
"command": "python",
"args": ["mcp_servers/observability_tools.py"],
"env": {
"LOKI_URL": "http://localhost:3100",
"PROMETHEUS_URL": "http://localhost:9090",
"LANGFUSE_PUBLIC_KEY": "pk-...",
"LANGFUSE_HOST": "https://cloud.langfuse.com"
}
}
}
}

Para equipos que no usan MCP, exponer como scripts que el agente puede ejecutar:

#!/bin/bash
# scripts/agent-query-logs.sh — Wrapper para agentes
# Uso: ./scripts/agent-query-logs.sh <service> [level] [duration]
SERVICE=${1:?"Uso: agent-query-logs.sh <service> [level] [duration]"}
LEVEL=${2:-"error"}
DURATION=${3:-"1h"}
curl -s "http://localhost:3100/loki/api/v1/query_range" \
--data-urlencode "query={app=\"$SERVICE\"} |= \"$LEVEL\"" \
--data-urlencode "start=$(date -d "-$DURATION" +%s)000000000" \
--data-urlencode "end=$(date +%s)000000000" \
--data-urlencode "limit=20" | \
jq -r '.data.result[].values[][1]' | head -30
#!/bin/bash
# scripts/agent-query-metrics.sh — Wrapper para agentes
# Uso: ./scripts/agent-query-metrics.sh <promql_query>
QUERY=${1:?"Uso: agent-query-metrics.sh '<promql_query>'"}
curl -s "http://localhost:9090/api/v1/query" \
--data-urlencode "query=$QUERY" | \
jq -r '.data.result[] | "\(.metric | to_entries | map("\(.key)=\(.value)") | join(", ")): \(.value[1])"'
EscenarioQuery del agenteAccion autonoma
Test falla con timeoutquery_logs("api", "error", "5m")Identifica slow query, sugiere indice
Costo diario subeget_cost_report("today")Detecta feature con uso excesivo, sugiere cache
Latencia p95 degradadaquery_metrics("http_duration_seconds", "10m", 0.95)Correlaciona con deployment reciente
Golden Principle violationcheck_golden_principles()Identifica span > 2s, abre fix
RAG recall bajaquery_metrics("rag_recall_at_5", "1h")Revisa embeddings, sugiere re-indexar
TrackImplementacionEsfuerzo
SoloCLI wrappers (scripts/agent-*.sh)1 hora
LeanMCP Server basico (3-4 tools)1 dia
FullMCP Server completo + Langfuse API + alertas proactivas1 semana
Anti-patronProblemaSolucion
Exponer todos los logs sin filtrarInunda context windowLimitar a 20-30 lineas, solo nivel relevante
Queries sin timeoutAgente espera indefinidamenteTimeout de 5s en todas las queries
Metricas sin contextoAgente no sabe que es “normal”Incluir thresholds y baselines en la respuesta
Sin rate limitingAgente en loop consulta 100xMax 10 queries por sesion

EstrategiaCuándo usarCómo
Tool result clearingSiempreLimpiar outputs de tools antiguos del historial
Summarize & restartContext > 80% capacidadResumir conversación, reiniciar con summary
HANDOFF.md automationCambio de sesión/agenteEscribir estado en archivo, nuevo agente lee
Progressive pruningSesiones largas (>1h)Eliminar mensajes intermedios, mantener decisiones
<!-- HANDOFF.md — Auto-generated at session end -->
# Session Handoff
## Timestamp
2026-03-19T15:30:00Z
## Current State
- Phase: F06_Build
- Task: Implementing search endpoint
- Branch: feature/ai-search
- Tests: 14/16 passing (2 pending for edge cases)
## Key Decisions Made
1. Chose vector search over keyword (ADR-002)
2. Using pgvector extension for embeddings
3. Batch indexing every 15 min (not real-time)
## Open Issues
- [ ] Edge case: empty query string returns 500
- [ ] Performance test needed for >10K documents
- [ ] Cache invalidation strategy TBD
## Files Modified
- src/api/search.ts (new)
- src/services/embedding.ts (new)
- tests/search.test.ts (14 tests)
- project-config.yaml (updated phase)
## Context for Next Session
The search API is functional but needs edge case handling.
Start by fixing the empty query bug in search.ts:42.
Then run the full test suite: `npm test -- --grep search`

7.3 Compact Instructions (Prioridad de Preservación)

Sección titulada «7.3 Compact Instructions (Prioridad de Preservación)»

Cuando el agente ejecuta /compact, preservar en este orden:

  1. 🔴 Nunca resumir: ADRs, decisiones de arquitectura, stack choices
  2. 🟠 Mantener key facts: Archivos modificados, estado de tests, errores abiertos
  3. 🟡 Resumir: Outputs de herramientas, logs intermedios, exploraciones
  4. 🟢 Descartar: Conversación casual, intentos fallidos ya resueltos

Para el workflow completo de clasificacion y conformity assessment, ver OWASP Agentic Security Guide §8.

Para sistemas AI de alto riesgo, el EU AI Act requiere:

RequisitoImplementaciónArtículo
Audit trailLangfuse traces con full contextArt. 12
Decision explainabilityGuardar reasoning en span attributesArt. 13
Data provenanceTag de source en cada document RAGArt. 10
Human oversight recordsLog de approvals/rejections humanasArt. 14
Performance monitoringDashboards con métricas continuasArt. 15
Incident reportingAlertas + runbook documentadoArt. 62
AI literacy training recordsRegistro de capacitaciones completadasArt. 4
# Mapeo de nuestro framework a ISO 42001
iso_42001_mapping:
"4.1 Understanding context": F01_Strategy
"5.1 Leadership": project-config.yaml governance
"6.1 Risk assessment": F08_Security compliance_matrix
"7.1 Resources": finops budget tracking
"8.1 AI system lifecycle": F01-F10 phases
"9.1 Monitoring": Observability dashboards
"10.1 Continual improvement": F10_Evolution

9. Agent Production Monitoring — Health Score & KPIs

Sección titulada «9. Agent Production Monitoring — Health Score & KPIs»

GAP-19: metricas de produccion especificas para agentes AI — health score compuesto, tool success rate, context utilisation, cost efficiency.

Monitorear un agente AI en produccion es diferente a monitorear un servicio HTTP:

  • No basta con latency/error-rate: un agente puede “funcionar” (200 OK) pero dar respuestas terribles.
  • El costo es variable y puede explotar sin warning.
  • Las herramientas (tools) pueden fallar silenciosamente.
  • El context window puede saturarse causando degradacion gradual (no errores).

9.2 Agent Health Score (AHS) — Metrica Compuesta

Sección titulada «9.2 Agent Health Score (AHS) — Metrica Compuesta»

Un unico numero (0-100) que indica la salud general de un agente en produccion:

observability/agent_health_score.yaml
agent_health_score:
name: "Agent Health Score (AHS)"
range: 0-100
target: ">= 80 (Green), 60-79 (Yellow), < 60 (Red)"
components:
task_success_rate:
weight: 0.30
description: "% de tareas completadas correctamente"
metric: "agent_tasks_success_total / agent_tasks_total"
thresholds:
green: ">= 0.90"
yellow: ">= 0.70"
red: "< 0.70"
tool_success_rate:
weight: 0.20
description: "% de tool calls exitosas"
metric: "agent_tool_calls_success / agent_tool_calls_total"
thresholds:
green: ">= 0.95"
yellow: ">= 0.85"
red: "< 0.85"
context_utilisation:
weight: 0.15
description: "Eficiencia de uso del context window"
metric: "agent_context_tokens_used / agent_context_tokens_available"
thresholds:
green: "0.30 - 0.70" # sweet spot
yellow: "0.70 - 0.90" # acercandose al limite
red: "> 0.90 OR < 0.10" # saturado o desaprovechado
cost_efficiency:
weight: 0.15
description: "Costo por tarea exitosa (USD)"
metric: "agent_cost_total_usd / agent_tasks_success_total"
thresholds:
green: "< budget_per_task * 1.0"
yellow: "< budget_per_task * 1.5"
red: ">= budget_per_task * 1.5"
latency_p95:
weight: 0.10
description: "Latencia p95 end-to-end del agente"
metric: "histogram_quantile(0.95, agent_task_duration_seconds)"
thresholds:
green: "< 10s"
yellow: "< 30s"
red: ">= 30s"
error_escalation_rate:
weight: 0.10
description: "% de tareas que requirieron escalacion humana"
metric: "agent_escalations_total / agent_tasks_total"
thresholds:
green: "< 0.10"
yellow: "< 0.25"
red: ">= 0.25"
# AHS = sum(component_score * weight)
# component_score: green=100, yellow=60, red=20

9.3 Implementacion Python — Health Score Calculator

Sección titulada «9.3 Implementacion Python — Health Score Calculator»
observability/agent_health.py
"""Agent Health Score calculator for production monitoring."""
from dataclasses import dataclass, field
from enum import Enum
from typing import Optional
import time
class HealthLevel(Enum):
GREEN = "green" # score >= 80
YELLOW = "yellow" # score 60-79
RED = "red" # score < 60
@dataclass
class AgentMetrics:
"""Metricas en tiempo real de un agente."""
tasks_total: int = 0
tasks_success: int = 0
tool_calls_total: int = 0
tool_calls_success: int = 0
context_tokens_used: int = 0
context_tokens_available: int = 200_000 # default 200K
total_cost_usd: float = 0.0
budget_per_task_usd: float = 0.50 # configurable
latencies_ms: list = field(default_factory=list)
escalations: int = 0
@property
def task_success_rate(self) -> float:
return self.tasks_success / max(self.tasks_total, 1)
@property
def tool_success_rate(self) -> float:
return self.tool_calls_success / max(self.tool_calls_total, 1)
@property
def context_utilisation(self) -> float:
return self.context_tokens_used / max(self.context_tokens_available, 1)
@property
def cost_per_success(self) -> float:
return self.total_cost_usd / max(self.tasks_success, 1)
@property
def latency_p95_ms(self) -> float:
if not self.latencies_ms:
return 0
sorted_lat = sorted(self.latencies_ms)
idx = int(len(sorted_lat) * 0.95)
return sorted_lat[min(idx, len(sorted_lat) - 1)]
@property
def escalation_rate(self) -> float:
return self.escalations / max(self.tasks_total, 1)
def calculate_health_score(metrics: AgentMetrics) -> dict:
"""Calcula el Agent Health Score (0-100)."""
def _score(value: float, green_check, yellow_check) -> int:
if green_check(value): return 100
if yellow_check(value): return 60
return 20
components = {
"task_success_rate": {
"value": metrics.task_success_rate,
"score": _score(metrics.task_success_rate,
lambda v: v >= 0.90, lambda v: v >= 0.70),
"weight": 0.30,
},
"tool_success_rate": {
"value": metrics.tool_success_rate,
"score": _score(metrics.tool_success_rate,
lambda v: v >= 0.95, lambda v: v >= 0.85),
"weight": 0.20,
},
"context_utilisation": {
"value": metrics.context_utilisation,
"score": _score(metrics.context_utilisation,
lambda v: 0.30 <= v <= 0.70,
lambda v: 0.10 <= v <= 0.90),
"weight": 0.15,
},
"cost_efficiency": {
"value": metrics.cost_per_success,
"score": _score(metrics.cost_per_success,
lambda v: v < metrics.budget_per_task_usd,
lambda v: v < metrics.budget_per_task_usd * 1.5),
"weight": 0.15,
},
"latency_p95": {
"value": metrics.latency_p95_ms / 1000, # to seconds
"score": _score(metrics.latency_p95_ms / 1000,
lambda v: v < 10, lambda v: v < 30),
"weight": 0.10,
},
"escalation_rate": {
"value": metrics.escalation_rate,
"score": _score(metrics.escalation_rate,
lambda v: v < 0.10, lambda v: v < 0.25),
"weight": 0.10,
},
}
ahs = sum(c["score"] * c["weight"] for c in components.values())
level = (HealthLevel.GREEN if ahs >= 80
else HealthLevel.YELLOW if ahs >= 60
else HealthLevel.RED)
return {
"agent_health_score": round(ahs, 1),
"level": level.value,
"components": components,
"timestamp": time.time(),
}

9.4 Dashboard Panels — Agent Production Monitoring

Sección titulada «9.4 Dashboard Panels — Agent Production Monitoring»
observability/dashboards/agent_monitoring.yaml
# Grafana panels especificos para agentes AI en produccion
panels:
- title: "Agent Health Score (AHS)"
type: gauge
query: agent_health_score
thresholds: [60, 80, 100]
colors: ["red", "yellow", "green"]
description: "Metrica compuesta 0-100. Target >= 80."
- title: "Task Success Rate (rolling 1h)"
type: stat
query: |
rate(agent_tasks_success_total[1h]) /
rate(agent_tasks_total[1h])
thresholds: [0.70, 0.90]
format: "percent"
- title: "Tool Call Success Rate by Tool"
type: bar
query: |
sum by (tool_name) (rate(agent_tool_calls_success[1h])) /
sum by (tool_name) (rate(agent_tool_calls_total[1h]))
description: "Identifica tools problematicas rapidamente"
- title: "Context Window Utilisation"
type: timeseries
queries:
- label: "tokens_used"
query: agent_context_tokens_used
- label: "70% threshold"
query: agent_context_tokens_available * 0.70
- label: "90% critical"
query: agent_context_tokens_available * 0.90
description: "Monitorea saturacion de context window"
- title: "Cost per Successful Task"
type: timeseries
query: |
rate(agent_cost_total_usd[1h]) /
rate(agent_tasks_success_total[1h])
thresholds: [0.50, 0.75] # budget_per_task * 1.0 / 1.5
- title: "Agent Latency Distribution"
type: heatmap
query: |
rate(agent_task_duration_seconds_bucket[5m])
description: "Distribucion de latencia end-to-end"
- title: "Escalation Rate (rolling 24h)"
type: stat
query: |
increase(agent_escalations_total[24h]) /
increase(agent_tasks_total[24h])
thresholds: [0.10, 0.25]
format: "percent"
- title: "Tool Usage Heatmap (by hour)"
type: heatmap
query: |
sum by (tool_name, hour) (increase(agent_tool_calls_total[1h]))
description: "Patrones de uso de herramientas a lo largo del dia"
- title: "Agent Backtracking Rate"
type: stat
query: |
rate(agent_backtrack_total[1h]) /
rate(agent_steps_total[1h])
thresholds: [0.15, 0.30]
description: "Indica ineficiencia en el planning del agente"
observability/alerts/agent_production_alerts.yaml
groups:
- name: agent_health_alerts
rules:
- alert: AgentHealthScoreCritical
expr: agent_health_score < 60
for: 10m
labels:
severity: critical
annotations:
summary: "Agent {{ $labels.agent_id }} health score < 60 ({{ $value }})"
runbook: "1) Check tool_success_rate 2) Check context_utilisation 3) Check cost"
- alert: AgentHealthScoreWarning
expr: agent_health_score < 80 and agent_health_score >= 60
for: 30m
labels:
severity: warning
annotations:
summary: "Agent {{ $labels.agent_id }} health degraded ({{ $value }})"
- alert: ToolFailureSpike
expr: |
(rate(agent_tool_calls_total[5m]) - rate(agent_tool_calls_success[5m])) /
rate(agent_tool_calls_total[5m]) > 0.15
for: 5m
labels:
severity: critical
annotations:
summary: "Tool failure rate > 15% for agent {{ $labels.agent_id }}"
action: "Check MCP server health, verify tool permissions"
- alert: ContextWindowSaturation
expr: agent_context_tokens_used / agent_context_tokens_available > 0.90
for: 2m
labels:
severity: warning
annotations:
summary: "Context window > 90% for agent {{ $labels.agent_id }}"
action: "Trigger /compact or start new session"
- alert: AgentCostRunaway
expr: |
rate(agent_cost_total_usd[1h]) >
(agent_budget_per_task_usd * rate(agent_tasks_total[1h]) * 2.0)
for: 15m
labels:
severity: critical
annotations:
summary: "Agent cost 2x over budget"
action: "Check for infinite loops, excessive tool calls, model routing"
- alert: HighBacktrackingRate
expr: |
rate(agent_backtrack_total[1h]) /
rate(agent_steps_total[1h]) > 0.30
for: 15m
labels:
severity: warning
annotations:
summary: "Agent backtracking > 30% — possible planning inefficiency"
action: "Review agent instructions, check context quality"
- alert: EscalationRateHigh
expr: |
increase(agent_escalations_total[24h]) /
increase(agent_tasks_total[24h]) > 0.25
for: 1h
labels:
severity: warning
annotations:
summary: "Agent escalation rate > 25% in 24h"
action: "Review failure patterns, consider expanding agent capabilities"

9.6 Production Runbook — Agent Degradation

Sección titulada «9.6 Production Runbook — Agent Degradation»
# RUNBOOK: Agent Health Score Degradation
## Triage Rapido (< 5 min)
1. **Check AHS components** — cual esta en rojo?
- task_success_rate baja → ir a §A
- tool_success_rate baja → ir a §B
- context_utilisation alta → ir a §C
- cost_efficiency mala → ir a §D
- latency alta → ir a §E
## §A — Task Success Rate Baja
- [ ] Revisar logs de tareas fallidas: `query_logs("agent", "error", "1h")`
- [ ] Verificar si hay cambios recientes en prompts/instructions
- [ ] Comprobar si los datos de entrada cambiaron (nuevo schema, encoding)
- [ ] Ejecutar golden dataset de evaluacion para detectar regresion
- [ ] Rollback si hay deployment reciente
## §B — Tool Success Rate Baja
- [ ] Identificar tool problematica: dashboard "Tool Call Success by Tool"
- [ ] Verificar MCP server health: `curl http://mcp-server/health`
- [ ] Comprobar permisos y rate limits del tool
- [ ] Revisar si hubo cambios en APIs externas (breaking changes)
- [ ] Failover: deshabilitar tool y usar fallback
## §C — Context Window Saturada
- [ ] Forzar compaction: trigger `/compact` automatico
- [ ] Revisar si hay tool outputs gigantes (logs sin truncar)
- [ ] Verificar que .claudeignore excluye archivos pesados
- [ ] Iniciar nueva sesion si > 90% persistente
## §D — Cost Efficiency Degradada
- [ ] Revisar model routing: ¿se usa el modelo mas caro innecesariamente?
- [ ] Verificar cache hit rate: debe ser > 30%
- [ ] Buscar loops: agente repitiendo tool calls identicas
- [ ] Revisar token usage: ¿context window demasiado grande?
## §E — Latency Alta
- [ ] Revisar latencia por componente: model, tool calls, retrieval
- [ ] Verificar colas de MCP servers
- [ ] Comprobar si hay throttling del provider (Anthropic/OpenAI)
- [ ] Considerar model routing a modelo mas rapido
MetricaSoloLeanFull
AHSManual check semanalDashboard basicoDashboard real-time + alerts
Task successLog reviewLangfuse tracesAutomated + golden dataset regression
Tool success--verbose flagMCP server logsPer-tool dashboards + auto-failover
Context util/context commandSession loggingReal-time gauge + auto-compact
CostHelicone free tierLangfuse cost trackingFull FinOps + anomaly detection
LatencyManual timingp50/p95 dashboardFull distribution + SLO alerts
EscalationCount in notesTag in trackerAutomated classification + trend analysis

10. AGENT DRIFT DETECTION & BEHAVIORAL BASELINE

Sección titulada «10. AGENT DRIFT DETECTION & BEHAVIORAL BASELINE»

GAP-20: deteccion de drift comportamental en agentes de larga duracion — baseline de comportamiento, context quality score, eficiencia de tokens, y deteccion de regresion.

Agent drift es la degradacion gradual del comportamiento de un agente durante sesiones largas o a lo largo de multiples sesiones. A diferencia de un error discreto, el drift es silencioso — el agente sigue funcionando pero sus outputs se alejan progresivamente del comportamiento esperado.

Causas principales:

  • Context pollution: informacion irrelevante o contradictoria acumulada en el context window
  • Model updates: cambios en el modelo subyacente que alteran el comportamiento
  • Prompt rot: instrucciones que dejan de ser efectivas al cambiar el contexto del proyecto
  • Data drift: los datos de entrada cambian en distribucion pero las instrucciones no se adaptan
  • Tool evolution: APIs externas cambian su formato de respuesta

Establecer un golden behavior dataset: outputs esperados para N inputs estandar que se ejecutan periodicamente para detectar desviaciones.

observability/agent_behavioral_baseline.yaml
behavioral_baseline:
version: "1.0.0"
schedule: "weekly" # Frecuencia de ejecucion
similarity_threshold: 0.85 # Alerta si cae debajo
golden_behaviors:
- id: "BB-001"
description: "Generar test unitario para funcion de utilidad"
input:
task: "Escribir test para una funcion sum(a, b) que retorna la suma"
context: "Proyecto TypeScript con Jest"
expected_patterns:
- "describe("
- "expect("
- "toBe("
expected_structure:
has_imports: true
has_describe_block: true
has_at_least_n_assertions: 2
quality_score_baseline: 0.92
- id: "BB-002"
description: "Revisar codigo con problema de seguridad"
input:
task: "Revisar este codigo: const query = `SELECT * FROM users WHERE id = ${userId}`"
context: "Proyecto Node.js"
expected_patterns:
- "SQL injection"
- "parameterized"
- "prepared statement"
quality_score_baseline: 0.95
- id: "BB-003"
description: "Crear ADR para decision de arquitectura"
input:
task: "Documentar la decision de usar PostgreSQL en vez de MongoDB"
context: "Proyecto con datos relacionales y transacciones ACID"
expected_patterns:
- "Contexto"
- "Decision"
- "Consecuencias"
quality_score_baseline: 0.88
# Comparar outputs actuales vs baseline semanalmente
comparison:
method: "semantic_similarity + pattern_match"
alert_if_similarity_below: 0.85
alert_channels: ["slack:#agent-quality", "email:tech-lead"]

Workflow semanal:

1. Ejecutar golden behaviors contra agente actual
2. Comparar outputs con baseline (semantic similarity + patterns)
3. Si similarity < 0.85 → alerta + investigar causa
4. Si similarity > 0.95 → actualizar baseline (el agente mejoro)
5. Documentar resultados en .factory/drift-reports/

Formula para medir la calidad del contexto en una sesion activa:

CQS = (useful_tokens / total_tokens) × (1 - error_rate) × recency_factor

Donde:

  • useful_tokens: tokens que contribuyeron a outputs correctos (estimado via tool call success y task completion)
  • total_tokens: tokens totales consumidos en la sesion
  • error_rate: ratio de tool calls fallidas / total tool calls
  • recency_factor: factor de decaimiento basado en la edad de la sesion
observability/context_quality.py
"""Context Quality Score calculator."""
import math
from dataclasses import dataclass
@dataclass
class SessionMetrics:
"""Metricas de una sesion activa."""
total_tokens: int
useful_tokens: int # tokens en calls exitosas
tool_calls_total: int
tool_calls_failed: int
session_age_minutes: int
def calculate_cqs(metrics: SessionMetrics) -> float:
"""Calcula el Context Quality Score (0.0 - 1.0).
Args:
metrics: Metricas de la sesion activa.
Returns:
CQS entre 0.0 y 1.0. Target >= 0.70.
"""
# Token efficiency ratio
token_ratio = metrics.useful_tokens / max(metrics.total_tokens, 1)
# Error rate
error_rate = metrics.tool_calls_failed / max(metrics.tool_calls_total, 1)
# Recency factor: decae exponencialmente despues de 120 minutos
# 0-120 min: ~1.0, 240 min: ~0.5, 360 min: ~0.25
half_life_minutes = 120
recency_factor = math.exp(
-0.693 * max(metrics.session_age_minutes - half_life_minutes, 0)
/ half_life_minutes
)
recency_factor = min(recency_factor, 1.0)
cqs = token_ratio * (1 - error_rate) * recency_factor
return round(cqs, 4)

Umbrales de CQS:

CQSEstadoAccion
>= 0.80VerdeOperacion normal
0.60 - 0.79AmarilloConsiderar compaction
0.40 - 0.59NaranjaCompaction urgente o relay
< 0.40RojoIniciar nueva sesion inmediatamente

Metricas para evaluar la eficiencia de consumo de tokens:

Tokens por accion exitosa (TPA): metrica principal de eficiencia — menor es mejor.

observability/dashboards/token_efficiency.yaml
panels:
- title: "Tokens per Successful Action (TPA)"
type: timeseries
query: |
rate(agent_tokens_total[1h]) /
rate(agent_tasks_success_total[1h])
thresholds:
green: "< 5000"
yellow: "< 15000"
red: ">= 15000"
description: "Eficiencia: tokens consumidos por cada accion exitosa"
- title: "Context Utilization Curve"
type: timeseries
queries:
- label: "CQS"
query: agent_context_quality_score
- label: "Token ratio"
query: agent_useful_tokens / agent_total_tokens
- label: "Compaction threshold"
query: 0.60
description: "CQS a lo largo de la sesion — indica momento optimo para compaction"
- title: "Session Age vs CQS"
type: scatter
query: |
agent_context_quality_score by (session_age_bucket)
description: "Correlacion entre edad de sesion y calidad de contexto"

Timing optimo de compaction basado en CQS:

Inicio sesion ──> CQS ~1.0
│ (120 min) CQS ~0.80 → monitorear
│ (180 min) CQS ~0.65 → programar compaction
│ (240 min) CQS ~0.50 → compaction urgente o relay
v (300 min) CQS < 0.40 → nueva sesion obligatoria

Ejecutar el golden behavior dataset semanalmente para detectar regresiones:

Ventana de terminal
# Ejecucion semanal (CI job o cron)
# 1. Ejecutar behavioral baseline
python3 baseline/scripts/fab-eval-builders.py \
--golden-dataset observability/agent_behavioral_baseline.yaml \
--output .factory/drift-reports/drift-$(date +%Y%m%d).json
# 2. Comparar con la semana anterior
python3 baseline/scripts/compare-evals.py \
--baseline .factory/drift-reports/drift-$(date -d "7 days ago" +%Y%m%d).json \
--candidate .factory/drift-reports/drift-$(date +%Y%m%d).json \
--threshold 0.05
# 3. Si hay regresion, notificar
# compare-evals.py retorna exit code 1 si regression > threshold

Canales de alerta:

Nivel de regresionCanalAccion
0-5%Log onlyMonitorear tendencia
5-15%Slack #agent-qualityInvestigar causa, revisar prompts
15-30%Slack + email tech leadCongelar cambios, rollback si necesario
> 30%PagerDutyIncident response, rollback inmediato

Herramientas para diagnosticar y resolver agent drift:

Extension propuesta de fab-eval-builders.py:

Ventana de terminal
# Check de drift contra behavioral baseline
python3 baseline/scripts/fab-eval-builders.py --drift-check \
--baseline observability/agent_behavioral_baseline.yaml \
--period 7d
# Output esperado:
# DRIFT REPORT — Last 7 days
# ──────────────────────────
# BB-001 (unit test gen): 0.93 → 0.91 (-2.1%) ✅ OK
# BB-002 (security review): 0.95 → 0.88 (-7.4%) ⚠️ WARN
# BB-003 (ADR creation): 0.88 → 0.85 (-3.4%) ✅ OK
# ──────────────────────────
# Overall drift: -4.3% ⚠️ INVESTIGATE BB-002

Grafana dashboard template para agent drift:

observability/dashboards/agent_drift.yaml
panels:
- title: "Behavioral Drift Trend (weekly)"
type: timeseries
queries:
- label: "Overall similarity"
query: agent_behavioral_similarity_avg
- label: "Threshold"
query: 0.85
description: "Tendencia semanal de similitud con behavioral baseline"
- title: "CQS Distribution by Session"
type: histogram
query: agent_context_quality_score
description: "Distribucion de CQS al final de cada sesion"
- title: "Drift by Behavior Category"
type: bar
query: |
agent_behavioral_similarity by (behavior_id)
description: "Similitud por categoria de comportamiento"
- title: "Token Efficiency Trend"
type: timeseries
query: |
rate(agent_tokens_total[24h]) /
rate(agent_tasks_success_total[24h])
description: "TPA trending — detectar degradacion de eficiencia"

Umbrales de alerta:

MetricaWarnCriticalAccion
Behavioral similarity< 85%< 70%Investigar / rollback
CQS promedio< 0.65< 0.50Ajustar compaction timing
TPA (tokens/accion)> 10K> 20KRevisar context loading
Weekly regression> 15%> 30%Congelar cambios

  • Instalar Helicone proxy O Langfuse
  • Instrumentar primera llamada LLM con cost tracking
  • Verificar que spans tienen los 10 attributes obligatorios
  • Configurar dashboard con 8 panels esenciales
  • Crear alertas para budget y error rate
  • Implementar semantic cache básico
  • Configurar cost attribution by feature
  • Model routing automatico por complejidad
  • HANDOFF.md automation al final de sesiones
  • EU AI Act audit trail completo
  • Review mensual de costos con equipo
  • Crear golden behavior dataset con al menos 5 casos
  • Configurar ejecucion semanal de behavioral baseline
  • Implementar CQS calculator en sesiones de agentes
  • Configurar dashboard de drift en Grafana
  • Establecer alertas de regresion (warn 15%, critical 30%)
  • Integrar compare-evals.py en CI job semanal

Documento generado como companion operativo del AI-First Engineering Framework v7.6.0 — GAP-18 (Tool Landscape), GAP-19 (Agent Production Monitoring), y GAP-20 (Agent Drift Detection) cerrados