Observability & FinOps

LLM Observability & FinOps Implementation Guide

Companion del AI-First Engineering Framework v6.5

1. Por Qué Observar

Sin observabilidad, las aplicaciones AI son cajas negras:

No sabes cuánto cuesta cada feature AI.
No detectas degradación de calidad hasta que un usuario reporta.
No puedes optimizar lo que no mides.
No cumples EU AI Act (Aug 2026) sin audit trail.

“AI is an amplifier — it magnifies an organization’s existing strengths and weaknesses.” — DORA 2025

2. Stack de Observabilidad

2.1 Capas

┌─────────────────────────────────────────────────┐
│                  DASHBOARDS                      │
│          (Grafana / Datadog / Custom)            │
├─────────────────────────────────────────────────┤
│               LLMOps PLATFORM                    │
│     (Langfuse / LangSmith / Arize / Helicone)   │
├─────────────────────────────────────────────────┤
│             TRACING / METRICS                    │
│        (OpenTelemetry / Prometheus)              │
├─────────────────────────────────────────────────┤
│              APPLICATION CODE                    │
│   (spans, token counters, cost attribution)      │
└─────────────────────────────────────────────────┘

2.2 Landscape Completo de Herramientas (15+ plataformas, Mar 2026)

GAP-18: mapa exhaustivo del ecosistema — no solo las 6 principales, sino todas las herramientas relevantes que un equipo enterprise deberia conocer.

2.2.1 Plataformas LLMOps (tracing + eval + cost)

Plataforma	Mejor para	Tracing	Evals	Cost	OSS	Pricing (Mar 2026)
Langfuse	Startups, self-hosted	✅ Full	✅ Built-in	✅	✅ Sí	Free tier + self-host
LangSmith	Equipos LangChain	✅ Full	✅ Built-in	✅	❌ No	$39/seat/mo
Braintrust	Evals + observability	✅	✅ Best-in-class	✅	❌ No	Free tier, $25/seat pro
Arize Phoenix	ML teams, model monitoring	✅ Full	✅ LLM judges	✅	✅ Parcial	Free tier + enterprise
Helicone	Cost tracking rapido	✅ Proxy	❌ Basico	✅ Best	✅ Proxy	Free tier generous
Weights & Biases (Weave)	ML teams existentes con W&B	✅ Full	✅ Built-in	✅	❌ No	$50/seat/mo
Galileo	LLM quality scoring	✅	✅ Guardrail Studio	✅	❌ No	Enterprise pricing
HoneyHive	Pipeline debugging	✅ Full	✅ Data-driven	✅	❌ No	Free tier + enterprise

2.2.2 Plataformas de Infra-Observability (APM + AI)

Plataforma	Mejor para	LLM Tracing	APM Nativo	AI Cost	Pricing
Datadog AI Observability	Enterprise, ya usa Datadog	✅ LLM traces	✅ Full APM	✅	Incluido en plan DD
New Relic AI Monitoring	Enterprise, ya usa NR	✅ LLM traces	✅ Full APM	✅	Incluido en plan NR
Dynatrace AI Observability	Enterprises con auto-instrumentation	✅ Auto-detect	✅ Full APM	✅	Enterprise license
Elastic AI Observability	Self-hosted enterprise	✅ LLM spans	✅ Full APM	✅	Free (self) / Cloud

2.2.3 Herramientas Especializadas

Herramienta	Enfoque	Diferenciador	OSS	Pricing
OpenLLMetry	OTel auto-instrumentation para LLMs	SDK drop-in: 1 linea instrumenta Anthropic/OpenAI/Cohere	✅ Sí	Gratis
Traceloop	Developer experience para OTel LLM	Builds on OpenLLMetry, UI included	✅ Core	Free + cloud
Prompt Layer	Prompt versioning + monitoring	Prompt registry + A/B testing	❌ No	Free tier
Log10	LLM feedback + debugging	Human feedback loops integrados	❌ No	Free tier
Portkey AI Gateway	AI gateway con observability built-in	Load balancing + caching + observability	✅ Sí	Free tier + enterprise

2.2.4 Matriz de Decision Rapida

¿Ya usas APM enterprise (Datadog/NR/Dynatrace)?
  ├── SI → Activar modulo AI de tu APM existente
  └── NO → ¿Necesitas self-hosting?
              ├── SI → Langfuse (self-hosted) + OpenLLMetry
              └── NO → ¿Prioridad?
                        ├── Cost → Helicone (proxy, 5 min setup)
                        ├── Evals → Braintrust (evals + traces)
                        ├── Tracing → Langfuse Cloud (traces + dashboard)
                        └── Todo → Langfuse + Portkey (gateway + traces)

2.2.5 Instrumentacion Base: OpenLLMetry (1 linea)

# La forma mas rapida de instrumentar CUALQUIER app LLM
# Funciona con Langfuse, Datadog, Jaeger, cualquier backend OTLP

# pip install openllmetry-sdk
from openllmetry import init_tracing

# Esto instrumenta automaticamente: Anthropic, OpenAI, Cohere,
# LangChain, LlamaIndex, Pinecone, ChromaDB, etc.
init_tracing(
    exporter_endpoint="http://localhost:4317",  # OTLP endpoint
    # o: exporter_endpoint="https://cloud.langfuse.com"
    # o: exporter_endpoint="https://api.helicone.ai"
)

# Tu codigo normal — ya esta instrumentado automaticamente
import anthropic
client = anthropic.Anthropic()
response = client.messages.create(
    model="claude-sonnet-4-5",
    max_tokens=1024,
    messages=[{"role": "user", "content": "Hello"}]
)
# → Span creado automaticamente con: model, tokens, cost, latency

2.3 Recomendación por Track

Track	Recomendado	Razón
Solo	Helicone (proxy)	Setup en 5 min, cost tracking gratuito
Lean	Langfuse (self-hosted)	Open source, traces completos, gratis
Full	Langfuse + Datadog	Traces AI + infra monitoring integrado

3. Instrumentación con OpenTelemetry

3.1 Span Attributes Obligatorios

# Attributes que DEBEN estar en cada span LLM
required_span_attributes:
  - agent_id          # Identificador del agente
  - llm.system        # "anthropic", "openai", "google"
  - llm.model         # "claude-sonnet-4-5", "gpt-4o"
  - llm.tokens.input  # Tokens de entrada
  - llm.tokens.output # Tokens de salida
  - llm.tokens.total  # Total tokens
  - llm.cost.usd      # Costo en USD
  - llm.latency_ms    # Latencia en ms
  - trace_id          # ID de trace OpenTelemetry
  - tenant_id         # ID del tenant/proyecto
  - feature_id        # Feature que originó la llamada
  - spec_version      # Versión del spec/contract

optional_span_attributes:
  - rag.recall_at_5    # Recall del RAG
  - rag.precision      # Precisión del RAG
  - cache.hit          # Si fue cache hit
  - cache.similarity   # Score de similaridad semántica
  - user_id            # Para cost attribution per-user
  - session_id         # Sesión del agente

3.2 Implementación Python

"""Instrumentación OpenTelemetry para LLM calls."""
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
import time

# Setup
provider = TracerProvider()
processor = BatchSpanProcessor(
    OTLPSpanExporter(endpoint="http://localhost:4317")
)
provider.add_span_processor(processor)
trace.set_tracer_provider(provider)
tracer = trace.get_tracer("ai-first-framework")

# Pricing por modelo (USD per 1K tokens, Mar 2026)
MODEL_PRICING = {
    "claude-sonnet-4-5": {"input": 0.003, "output": 0.015},
    "claude-haiku-3-5":  {"input": 0.00025, "output": 0.00125},
    "gpt-4o":            {"input": 0.0025, "output": 0.01},
    "gpt-4o-mini":       {"input": 0.00015, "output": 0.0006},
    "gemini-2.5-pro":    {"input": 0.00125, "output": 0.005},
    "gemini-2.0-flash":  {"input": 0.0001, "output": 0.0004},
}

def calculate_cost(model: str, input_tokens: int, output_tokens: int) -> float:
    """Calcula costo en USD."""
    prices = MODEL_PRICING.get(model, {"input": 0.01, "output": 0.03})
    return (input_tokens * prices["input"] / 1000 +
            output_tokens * prices["output"] / 1000)

def traced_llm_call(func):
    """Decorator para instrumentar llamadas LLM."""
    def wrapper(*args, **kwargs):
        with tracer.start_as_current_span("llm.call") as span:
            model = kwargs.get("model", "unknown")
            span.set_attribute("llm.model", model)
            span.set_attribute("llm.system", _get_system(model))
            span.set_attribute("agent_id", kwargs.get("agent_id", "default"))
            span.set_attribute("feature_id", kwargs.get("feature_id", "unknown"))

            start = time.time()
            result = func(*args, **kwargs)
            latency = (time.time() - start) * 1000

            # Extract token usage from response
            usage = getattr(result, "usage", None)
            if usage:
                input_tokens = getattr(usage, "input_tokens", 0)
                output_tokens = getattr(usage, "output_tokens", 0)
                span.set_attribute("llm.tokens.input", input_tokens)
                span.set_attribute("llm.tokens.output", output_tokens)
                span.set_attribute("llm.tokens.total", input_tokens + output_tokens)
                span.set_attribute("llm.cost.usd",
                    calculate_cost(model, input_tokens, output_tokens))

            span.set_attribute("llm.latency_ms", latency)
            return result
    return wrapper

def _get_system(model: str) -> str:
    if "claude" in model: return "anthropic"
    if "gpt" in model: return "openai"
    if "gemini" in model: return "google"
    return "unknown"

3.3 Implementación con Langfuse

"""Setup Langfuse para observabilidad LLM."""
from langfuse import Langfuse
from langfuse.decorators import observe, langfuse_context

# Inicialización
langfuse = Langfuse(
    public_key="pk-...",
    secret_key="sk-...",
    host="https://cloud.langfuse.com"  # o self-hosted
)

@observe(as_type="generation")
def call_llm(prompt: str, model: str = "claude-sonnet-4-5"):
    """Llamada LLM instrumentada con Langfuse."""
    import anthropic
    client = anthropic.Anthropic()

    langfuse_context.update_current_observation(
        model=model,
        metadata={"feature": "search", "version": "1.0"}
    )

    response = client.messages.create(
        model=model,
        max_tokens=4096,
        messages=[{"role": "user", "content": prompt}]
    )

    langfuse_context.update_current_observation(
        usage={
            "input": response.usage.input_tokens,
            "output": response.usage.output_tokens,
        }
    )
    return response.content[0].text

@observe()
def rag_pipeline(query: str):
    """Pipeline RAG instrumentada."""
    # Retrieve
    langfuse_context.update_current_trace(
        tags=["rag", "search"],
        metadata={"tenant_id": "project-x"}
    )

    with langfuse_context.observe(name="retrieval") as span:
        docs = retrieve_documents(query)
        span.update(metadata={"doc_count": len(docs)})

    # Generate
    context = "\n".join(docs)
    result = call_llm(f"Context: {context}\n\nQuestion: {query}")

    return result

4. FinOps: Control de Costos

4.1 Budget Model

# project/project-config.yaml (sección finops)
finops:
  daily_budget_usd: 100
  alert_threshold_pct: 80

  cost_allocation:
    by_feature: true        # Costo por feature/endpoint
    by_user: true           # Costo por usuario
    by_model: true          # Costo por modelo
    by_environment: true    # dev vs staging vs prod

  optimization:
    semantic_cache:
      enabled: true
      similarity_threshold: 0.92
      ttl_seconds:
        faq: 86400          # 24h
        documentation: 43200 # 12h
        dynamic: 3600        # 1h
      target_hit_rate: 0.30
      target_cost_savings: 0.40

    model_routing:
      enabled: true
      rules:
        - condition: "token_count < 500 AND complexity == 'low'"
          model: "claude-haiku-3-5"
        - condition: "task_type == 'code_review'"
          model: "claude-sonnet-4-5"
        - condition: "task_type == 'summarization'"
          model: "gemini-2.0-flash"

4.2 Semantic Caching

"""Semantic cache para reducir costos de LLM calls."""
import numpy as np
from typing import Optional

class SemanticCache:
    def __init__(self, similarity_threshold: float = 0.92, ttl: int = 3600):
        self.threshold = similarity_threshold
        self.ttl = ttl
        self.cache = {}  # En prod: Redis + vector store

    def get(self, query: str, embedding: list[float]) -> Optional[str]:
        """Busca respuesta similar en cache."""
        for key, entry in self.cache.items():
            similarity = self._cosine_similarity(embedding, entry["embedding"])
            if similarity >= self.threshold:
                if not self._is_expired(entry):
                    return entry["response"]
        return None

    def set(self, query: str, embedding: list[float], response: str):
        """Almacena respuesta en cache."""
        import time
        self.cache[hash(query)] = {
            "embedding": embedding,
            "response": response,
            "timestamp": time.time(),
            "ttl": self.ttl
        }

    def _cosine_similarity(self, a, b) -> float:
        a, b = np.array(a), np.array(b)
        return float(np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b)))

    def _is_expired(self, entry) -> bool:
        import time
        return (time.time() - entry["timestamp"]) > entry["ttl"]

4.3 Cost-per-Session Tracking

"""Tracking de costo por sesión de agente."""

class SessionCostTracker:
    """Tracks cost per agent session."""

    # Thresholds (USD)
    THRESHOLDS = {
        "warning": 5.0,     # Alerta amarilla
        "critical": 20.0,   # Alerta roja, requiere review
        "hard_limit": 50.0, # Detiene la sesión
    }

    def __init__(self, session_id: str):
        self.session_id = session_id
        self.calls = []
        self.total_cost = 0.0

    def record(self, model: str, input_tokens: int,
               output_tokens: int, cost: float):
        self.calls.append({
            "model": model,
            "input_tokens": input_tokens,
            "output_tokens": output_tokens,
            "cost": cost
        })
        self.total_cost += cost
        self._check_thresholds()

    def _check_thresholds(self):
        if self.total_cost >= self.THRESHOLDS["hard_limit"]:
            raise SessionBudgetExceeded(
                f"Session {self.session_id} exceeded hard limit: "
                f"${self.total_cost:.2f}"
            )
        elif self.total_cost >= self.THRESHOLDS["critical"]:
            print(f"🔴 CRITICAL: Session cost ${self.total_cost:.2f}")
        elif self.total_cost >= self.THRESHOLDS["warning"]:
            print(f"🟡 WARNING: Session cost ${self.total_cost:.2f}")

    def report(self) -> dict:
        return {
            "session_id": self.session_id,
            "total_cost_usd": round(self.total_cost, 4),
            "total_calls": len(self.calls),
            "by_model": self._group_by_model(),
            "avg_cost_per_call": round(
                self.total_cost / max(len(self.calls), 1), 4
            ),
        }

    def _group_by_model(self) -> dict:
        groups = {}
        for call in self.calls:
            m = call["model"]
            if m not in groups:
                groups[m] = {"calls": 0, "cost": 0, "tokens": 0}
            groups[m]["calls"] += 1
            groups[m]["cost"] += call["cost"]
            groups[m]["tokens"] += call["input_tokens"] + call["output_tokens"]
        return groups

class SessionBudgetExceeded(Exception):
    pass

5. Dashboard Templates

5.1 Métricas Esenciales (Grafana/Prometheus)

# Prometheus queries para dashboard LLM
panels:
  - title: "Cost per Day"
    query: sum(rate(llm_cost_usd_total[24h]))
    type: stat
    thresholds: [50, 80, 100]

  - title: "Tokens per Hour"
    query: sum(rate(llm_tokens_total[1h]))
    type: graph

  - title: "Latency p50/p95/p99"
    queries:
      - histogram_quantile(0.50, rate(llm_latency_seconds_bucket[5m]))
      - histogram_quantile(0.95, rate(llm_latency_seconds_bucket[5m]))
      - histogram_quantile(0.99, rate(llm_latency_seconds_bucket[5m]))
    type: graph

  - title: "Cache Hit Rate"
    query: |
      rate(semantic_cache_hits_total[1h]) /
      (rate(semantic_cache_hits_total[1h]) + rate(semantic_cache_misses_total[1h]))
    type: gauge
    thresholds: [0.20, 0.30, 0.50]

  - title: "Cost by Feature"
    query: sum by (feature_id) (rate(llm_cost_usd_total[24h]))
    type: bar

  - title: "Error Rate"
    query: |
      rate(llm_errors_total[5m]) / rate(llm_calls_total[5m])
    type: stat
    thresholds: [0.01, 0.05, 0.10]

  - title: "RAG Recall@5"
    query: avg(rag_recall_at_5)
    type: gauge
    thresholds: [0.70, 0.85, 0.95]

  - title: "Model Distribution"
    query: sum by (llm_model) (rate(llm_calls_total[24h]))
    type: pie

5.2 Alertas

groups:
  - name: llm_cost_alerts
    rules:
      - alert: DailyBudgetWarning
        expr: sum(increase(llm_cost_usd_total[24h])) > 80
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Daily LLM cost approaching budget (>$80)"

      - alert: DailyBudgetCritical
        expr: sum(increase(llm_cost_usd_total[24h])) > 100
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Daily LLM budget exceeded ($100)"

      - alert: HighLatency
        expr: histogram_quantile(0.95, rate(llm_latency_seconds_bucket[5m])) > 5
        for: 10m
        labels:
          severity: warning

      - alert: HighErrorRate
        expr: rate(llm_errors_total[5m]) / rate(llm_calls_total[5m]) > 0.05
        for: 5m
        labels:
          severity: critical

      - alert: LowCacheHitRate
        expr: |
          rate(semantic_cache_hits_total[1h]) /
          (rate(semantic_cache_hits_total[1h]) + rate(semantic_cache_misses_total[1h])) < 0.15
        for: 30m
        labels:
          severity: warning

6. Agent-Legible Observability

Fuente: OpenAI “Harness Engineering” (Feb 2026) — agentes que consultan metricas y logs directamente para debugging autonomo.

6.1 El Concepto

Agent-Legible Observability es la practica de exponer logs, metricas y traces directamente a los agentes para que puedan razonar sobre el comportamiento del sistema, diagnosticar problemas y tomar decisiones informadas — sin intervencion humana.

“Expusimos LogQL y PromQL a los agentes. Cuando un test falla, el agente consulta los logs, identifica el error, lo correlaciona con metricas, y propone un fix — todo autonomamente.” — OpenAI Harness Engineering, Feb 2026

6.2 Patron: Observability-as-Tool

Exponer la observabilidad como herramientas MCP o CLI wrappers que los agentes pueden invocar:

"""MCP Server que expone metricas y logs a agentes."""
from mcp.server.fastmcp import FastMCP

mcp = FastMCP("observability-tools", version="1.0.0")

@mcp.tool()
async def query_logs(service: str, level: str = "error",
                     duration: str = "1h", limit: int = 20) -> str:
    """Consulta logs del servicio. Retorna logs recientes filtrados.

    Args:
        service: Nombre del servicio (ej: 'api', 'worker', 'frontend')
        level: Nivel minimo (debug, info, warn, error)
        duration: Ventana de tiempo (ej: '1h', '30m', '24h')
        limit: Maximo de entradas a retornar
    """
    # LogQL query
    query = f'{{app="{service}"}} |= "{level}" | json | line_format "{{.timestamp}} {{.level}} {{.message}}"'
    # En produccion: llamar a Loki API
    return f"# Logs for {service} (last {duration}, level>={level})\n..."

@mcp.tool()
async def query_metrics(metric: str, duration: str = "5m",
                        quantile: float = 0.95) -> str:
    """Consulta metricas de Prometheus. Retorna valores actuales.

    Args:
        metric: Nombre de la metrica (ej: 'http_duration_seconds', 'llm_cost_usd_total')
        duration: Ventana para rate() (ej: '5m', '1h')
        quantile: Percentil para histogramas (0.50, 0.95, 0.99)
    """
    # PromQL query
    query = f'histogram_quantile({quantile}, rate({metric}_bucket[{duration}]))'
    return f"# Metric: {metric} p{int(quantile*100)} over {duration}\n..."

@mcp.tool()
async def check_golden_principles() -> str:
    """Verifica el estado de todos los Golden Principles del proyecto.
    Retorna un reporte de cumplimiento."""
    checks = [
        {"id": "GP-01", "name": "Startup < 800ms", "status": "PASS", "value": "623ms"},
        {"id": "GP-02", "name": "No span > 2s", "status": "PASS", "value": "max 1.2s"},
        {"id": "GP-03", "name": "No direct DB from handlers", "status": "PASS"},
    ]
    report = "# Golden Principles Status\n"
    for c in checks:
        emoji = "✅" if c["status"] == "PASS" else "❌"
        report += f"{emoji} {c['id']}: {c['name']} — {c['status']}\n"
    return report

@mcp.tool()
async def get_cost_report(period: str = "today") -> str:
    """Obtiene reporte de costos LLM del periodo.

    Args:
        period: 'today', 'week', 'month'
    """
    return f"# Cost Report ({period})\nTotal: $47.23\nBy model: claude-sonnet-4-5: $32.10, gpt-4o: $15.13\n..."

if __name__ == "__main__":
    mcp.run(transport="stdio")

6.3 Configuracion en Claude Code

{
  "mcpServers": {
    "observability-tools": {
      "command": "python",
      "args": ["mcp_servers/observability_tools.py"],
      "env": {
        "LOKI_URL": "http://localhost:3100",
        "PROMETHEUS_URL": "http://localhost:9090",
        "LANGFUSE_PUBLIC_KEY": "pk-...",
        "LANGFUSE_HOST": "https://cloud.langfuse.com"
      }
    }
  }
}

6.4 CLI Wrappers (Alternativa sin MCP)

Para equipos que no usan MCP, exponer como scripts que el agente puede ejecutar:

#!/bin/bash
# scripts/agent-query-logs.sh — Wrapper para agentes
# Uso: ./scripts/agent-query-logs.sh <service> [level] [duration]
SERVICE=${1:?"Uso: agent-query-logs.sh <service> [level] [duration]"}
LEVEL=${2:-"error"}
DURATION=${3:-"1h"}

curl -s "http://localhost:3100/loki/api/v1/query_range" \
  --data-urlencode "query={app=\"$SERVICE\"} |= \"$LEVEL\"" \
  --data-urlencode "start=$(date -d "-$DURATION" +%s)000000000" \
  --data-urlencode "end=$(date +%s)000000000" \
  --data-urlencode "limit=20" | \
  jq -r '.data.result[].values[][1]' | head -30

#!/bin/bash
# scripts/agent-query-metrics.sh — Wrapper para agentes
# Uso: ./scripts/agent-query-metrics.sh <promql_query>
QUERY=${1:?"Uso: agent-query-metrics.sh '<promql_query>'"}

curl -s "http://localhost:9090/api/v1/query" \
  --data-urlencode "query=$QUERY" | \
  jq -r '.data.result[] | "\(.metric | to_entries | map("\(.key)=\(.value)") | join(", ")): \(.value[1])"'

6.5 Casos de Uso

Escenario	Query del agente	Accion autonoma
Test falla con timeout	`query_logs("api", "error", "5m")`	Identifica slow query, sugiere indice
Costo diario sube	`get_cost_report("today")`	Detecta feature con uso excesivo, sugiere cache
Latencia p95 degradada	`query_metrics("http_duration_seconds", "10m", 0.95)`	Correlaciona con deployment reciente
Golden Principle violation	`check_golden_principles()`	Identifica span > 2s, abre fix
RAG recall baja	`query_metrics("rag_recall_at_5", "1h")`	Revisa embeddings, sugiere re-indexar

6.6 Recomendacion por Track

Track	Implementacion	Esfuerzo
Solo	CLI wrappers (scripts/agent-*.sh)	1 hora
Lean	MCP Server basico (3-4 tools)	1 dia
Full	MCP Server completo + Langfuse API + alertas proactivas	1 semana

6.7 Anti-patrones

Anti-patron	Problema	Solucion
Exponer todos los logs sin filtrar	Inunda context window	Limitar a 20-30 lineas, solo nivel relevante
Queries sin timeout	Agente espera indefinidamente	Timeout de 5s en todas las queries
Metricas sin contexto	Agente no sabe que es “normal”	Incluir thresholds y baselines en la respuesta
Sin rate limiting	Agente en loop consulta 100x	Max 10 queries por sesion

7. Compaction & Context Window Management

7.1 Estrategias de Compaction

Estrategia	Cuándo usar	Cómo
Tool result clearing	Siempre	Limpiar outputs de tools antiguos del historial
Summarize & restart	Context > 80% capacidad	Resumir conversación, reiniciar con summary
HANDOFF.md automation	Cambio de sesión/agente	Escribir estado en archivo, nuevo agente lee
Progressive pruning	Sesiones largas (>1h)	Eliminar mensajes intermedios, mantener decisiones

7.2 HANDOFF.md Automatizado

<!-- HANDOFF.md — Auto-generated at session end -->
# Session Handoff

## Timestamp
2026-03-19T15:30:00Z

## Current State
- Phase: F06_Build
- Task: Implementing search endpoint
- Branch: feature/ai-search
- Tests: 14/16 passing (2 pending for edge cases)

## Key Decisions Made
1. Chose vector search over keyword (ADR-002)
2. Using pgvector extension for embeddings
3. Batch indexing every 15 min (not real-time)

## Open Issues
- [ ] Edge case: empty query string returns 500
- [ ] Performance test needed for >10K documents
- [ ] Cache invalidation strategy TBD

## Files Modified
- src/api/search.ts (new)
- src/services/embedding.ts (new)
- tests/search.test.ts (14 tests)
- project-config.yaml (updated phase)

## Context for Next Session
The search API is functional but needs edge case handling.
Start by fixing the empty query bug in search.ts:42.
Then run the full test suite: `npm test -- --grep search`

7.3 Compact Instructions (Prioridad de Preservación)

Cuando el agente ejecuta /compact, preservar en este orden:

🔴 Nunca resumir: ADRs, decisiones de arquitectura, stack choices
🟠 Mantener key facts: Archivos modificados, estado de tests, errores abiertos
🟡 Resumir: Outputs de herramientas, logs intermedios, exploraciones
🟢 Descartar: Conversación casual, intentos fallidos ya resueltos

8. EU AI Act Compliance (Aug 2026)

Para el workflow completo de clasificacion y conformity assessment, ver OWASP Agentic Security Guide §8.

8.1 Requisitos de Logging

Para sistemas AI de alto riesgo, el EU AI Act requiere:

Requisito	Implementación	Artículo
Audit trail	Langfuse traces con full context	Art. 12
Decision explainability	Guardar reasoning en span attributes	Art. 13
Data provenance	Tag de source en cada document RAG	Art. 10
Human oversight records	Log de approvals/rejections humanas	Art. 14
Performance monitoring	Dashboards con métricas continuas	Art. 15
Incident reporting	Alertas + runbook documentado	Art. 62
AI literacy training records	Registro de capacitaciones completadas	Art. 4

8.2 ISO 42001 Alignment

# Mapeo de nuestro framework a ISO 42001
iso_42001_mapping:
  "4.1 Understanding context": F01_Strategy
  "5.1 Leadership": project-config.yaml governance
  "6.1 Risk assessment": F08_Security compliance_matrix
  "7.1 Resources": finops budget tracking
  "8.1 AI system lifecycle": F01-F10 phases
  "9.1 Monitoring": Observability dashboards
  "10.1 Continual improvement": F10_Evolution

9. Agent Production Monitoring — Health Score & KPIs

GAP-19: metricas de produccion especificas para agentes AI — health score compuesto, tool success rate, context utilisation, cost efficiency.

9.1 El Problema

Monitorear un agente AI en produccion es diferente a monitorear un servicio HTTP:

No basta con latency/error-rate: un agente puede “funcionar” (200 OK) pero dar respuestas terribles.
El costo es variable y puede explotar sin warning.
Las herramientas (tools) pueden fallar silenciosamente.
El context window puede saturarse causando degradacion gradual (no errores).

9.2 Agent Health Score (AHS) — Metrica Compuesta

Un unico numero (0-100) que indica la salud general de un agente en produccion:

agent_health_score:
  name: "Agent Health Score (AHS)"
  range: 0-100
  target: ">= 80 (Green), 60-79 (Yellow), < 60 (Red)"

  components:
    task_success_rate:
      weight: 0.30
      description: "% de tareas completadas correctamente"
      metric: "agent_tasks_success_total / agent_tasks_total"
      thresholds:
        green: ">= 0.90"
        yellow: ">= 0.70"
        red: "< 0.70"

    tool_success_rate:
      weight: 0.20
      description: "% de tool calls exitosas"
      metric: "agent_tool_calls_success / agent_tool_calls_total"
      thresholds:
        green: ">= 0.95"
        yellow: ">= 0.85"
        red: "< 0.85"

    context_utilisation:
      weight: 0.15
      description: "Eficiencia de uso del context window"
      metric: "agent_context_tokens_used / agent_context_tokens_available"
      thresholds:
        green: "0.30 - 0.70"  # sweet spot
        yellow: "0.70 - 0.90"  # acercandose al limite
        red: "> 0.90 OR < 0.10"  # saturado o desaprovechado

    cost_efficiency:
      weight: 0.15
      description: "Costo por tarea exitosa (USD)"
      metric: "agent_cost_total_usd / agent_tasks_success_total"
      thresholds:
        green: "< budget_per_task * 1.0"
        yellow: "< budget_per_task * 1.5"
        red: ">= budget_per_task * 1.5"

    latency_p95:
      weight: 0.10
      description: "Latencia p95 end-to-end del agente"
      metric: "histogram_quantile(0.95, agent_task_duration_seconds)"
      thresholds:
        green: "< 10s"
        yellow: "< 30s"
        red: ">= 30s"

    error_escalation_rate:
      weight: 0.10
      description: "% de tareas que requirieron escalacion humana"
      metric: "agent_escalations_total / agent_tasks_total"
      thresholds:
        green: "< 0.10"
        yellow: "< 0.25"
        red: ">= 0.25"

  # AHS = sum(component_score * weight)
  # component_score: green=100, yellow=60, red=20

9.3 Implementacion Python — Health Score Calculator

"""Agent Health Score calculator for production monitoring."""
from dataclasses import dataclass, field
from enum import Enum
from typing import Optional
import time

class HealthLevel(Enum):
    GREEN = "green"   # score >= 80
    YELLOW = "yellow" # score 60-79
    RED = "red"       # score < 60

@dataclass
class AgentMetrics:
    """Metricas en tiempo real de un agente."""
    tasks_total: int = 0
    tasks_success: int = 0
    tool_calls_total: int = 0
    tool_calls_success: int = 0
    context_tokens_used: int = 0
    context_tokens_available: int = 200_000  # default 200K
    total_cost_usd: float = 0.0
    budget_per_task_usd: float = 0.50  # configurable
    latencies_ms: list = field(default_factory=list)
    escalations: int = 0

    @property
    def task_success_rate(self) -> float:
        return self.tasks_success / max(self.tasks_total, 1)

    @property
    def tool_success_rate(self) -> float:
        return self.tool_calls_success / max(self.tool_calls_total, 1)

    @property
    def context_utilisation(self) -> float:
        return self.context_tokens_used / max(self.context_tokens_available, 1)

    @property
    def cost_per_success(self) -> float:
        return self.total_cost_usd / max(self.tasks_success, 1)

    @property
    def latency_p95_ms(self) -> float:
        if not self.latencies_ms:
            return 0
        sorted_lat = sorted(self.latencies_ms)
        idx = int(len(sorted_lat) * 0.95)
        return sorted_lat[min(idx, len(sorted_lat) - 1)]

    @property
    def escalation_rate(self) -> float:
        return self.escalations / max(self.tasks_total, 1)


def calculate_health_score(metrics: AgentMetrics) -> dict:
    """Calcula el Agent Health Score (0-100)."""

    def _score(value: float, green_check, yellow_check) -> int:
        if green_check(value): return 100
        if yellow_check(value): return 60
        return 20

    components = {
        "task_success_rate": {
            "value": metrics.task_success_rate,
            "score": _score(metrics.task_success_rate,
                          lambda v: v >= 0.90, lambda v: v >= 0.70),
            "weight": 0.30,
        },
        "tool_success_rate": {
            "value": metrics.tool_success_rate,
            "score": _score(metrics.tool_success_rate,
                          lambda v: v >= 0.95, lambda v: v >= 0.85),
            "weight": 0.20,
        },
        "context_utilisation": {
            "value": metrics.context_utilisation,
            "score": _score(metrics.context_utilisation,
                          lambda v: 0.30 <= v <= 0.70,
                          lambda v: 0.10 <= v <= 0.90),
            "weight": 0.15,
        },
        "cost_efficiency": {
            "value": metrics.cost_per_success,
            "score": _score(metrics.cost_per_success,
                          lambda v: v < metrics.budget_per_task_usd,
                          lambda v: v < metrics.budget_per_task_usd * 1.5),
            "weight": 0.15,
        },
        "latency_p95": {
            "value": metrics.latency_p95_ms / 1000,  # to seconds
            "score": _score(metrics.latency_p95_ms / 1000,
                          lambda v: v < 10, lambda v: v < 30),
            "weight": 0.10,
        },
        "escalation_rate": {
            "value": metrics.escalation_rate,
            "score": _score(metrics.escalation_rate,
                          lambda v: v < 0.10, lambda v: v < 0.25),
            "weight": 0.10,
        },
    }

    ahs = sum(c["score"] * c["weight"] for c in components.values())
    level = (HealthLevel.GREEN if ahs >= 80
             else HealthLevel.YELLOW if ahs >= 60
             else HealthLevel.RED)

    return {
        "agent_health_score": round(ahs, 1),
        "level": level.value,
        "components": components,
        "timestamp": time.time(),
    }

9.4 Dashboard Panels — Agent Production Monitoring

# Grafana panels especificos para agentes AI en produccion
panels:
  - title: "Agent Health Score (AHS)"
    type: gauge
    query: agent_health_score
    thresholds: [60, 80, 100]
    colors: ["red", "yellow", "green"]
    description: "Metrica compuesta 0-100. Target >= 80."

  - title: "Task Success Rate (rolling 1h)"
    type: stat
    query: |
      rate(agent_tasks_success_total[1h]) /
      rate(agent_tasks_total[1h])
    thresholds: [0.70, 0.90]
    format: "percent"

  - title: "Tool Call Success Rate by Tool"
    type: bar
    query: |
      sum by (tool_name) (rate(agent_tool_calls_success[1h])) /
      sum by (tool_name) (rate(agent_tool_calls_total[1h]))
    description: "Identifica tools problematicas rapidamente"

  - title: "Context Window Utilisation"
    type: timeseries
    queries:
      - label: "tokens_used"
        query: agent_context_tokens_used
      - label: "70% threshold"
        query: agent_context_tokens_available * 0.70
      - label: "90% critical"
        query: agent_context_tokens_available * 0.90
    description: "Monitorea saturacion de context window"

  - title: "Cost per Successful Task"
    type: timeseries
    query: |
      rate(agent_cost_total_usd[1h]) /
      rate(agent_tasks_success_total[1h])
    thresholds: [0.50, 0.75]  # budget_per_task * 1.0 / 1.5

  - title: "Agent Latency Distribution"
    type: heatmap
    query: |
      rate(agent_task_duration_seconds_bucket[5m])
    description: "Distribucion de latencia end-to-end"

  - title: "Escalation Rate (rolling 24h)"
    type: stat
    query: |
      increase(agent_escalations_total[24h]) /
      increase(agent_tasks_total[24h])
    thresholds: [0.10, 0.25]
    format: "percent"

  - title: "Tool Usage Heatmap (by hour)"
    type: heatmap
    query: |
      sum by (tool_name, hour) (increase(agent_tool_calls_total[1h]))
    description: "Patrones de uso de herramientas a lo largo del dia"

  - title: "Agent Backtracking Rate"
    type: stat
    query: |
      rate(agent_backtrack_total[1h]) /
      rate(agent_steps_total[1h])
    thresholds: [0.15, 0.30]
    description: "Indica ineficiencia en el planning del agente"

9.5 Alertas de Produccion para Agentes

groups:
  - name: agent_health_alerts
    rules:
      - alert: AgentHealthScoreCritical
        expr: agent_health_score < 60
        for: 10m
        labels:
          severity: critical
        annotations:
          summary: "Agent {{ $labels.agent_id }} health score < 60 ({{ $value }})"
          runbook: "1) Check tool_success_rate 2) Check context_utilisation 3) Check cost"

      - alert: AgentHealthScoreWarning
        expr: agent_health_score < 80 and agent_health_score >= 60
        for: 30m
        labels:
          severity: warning
        annotations:
          summary: "Agent {{ $labels.agent_id }} health degraded ({{ $value }})"

      - alert: ToolFailureSpike
        expr: |
          (rate(agent_tool_calls_total[5m]) - rate(agent_tool_calls_success[5m])) /
          rate(agent_tool_calls_total[5m]) > 0.15
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Tool failure rate > 15% for agent {{ $labels.agent_id }}"
          action: "Check MCP server health, verify tool permissions"

      - alert: ContextWindowSaturation
        expr: agent_context_tokens_used / agent_context_tokens_available > 0.90
        for: 2m
        labels:
          severity: warning
        annotations:
          summary: "Context window > 90% for agent {{ $labels.agent_id }}"
          action: "Trigger /compact or start new session"

      - alert: AgentCostRunaway
        expr: |
          rate(agent_cost_total_usd[1h]) >
          (agent_budget_per_task_usd * rate(agent_tasks_total[1h]) * 2.0)
        for: 15m
        labels:
          severity: critical
        annotations:
          summary: "Agent cost 2x over budget"
          action: "Check for infinite loops, excessive tool calls, model routing"

      - alert: HighBacktrackingRate
        expr: |
          rate(agent_backtrack_total[1h]) /
          rate(agent_steps_total[1h]) > 0.30
        for: 15m
        labels:
          severity: warning
        annotations:
          summary: "Agent backtracking > 30% — possible planning inefficiency"
          action: "Review agent instructions, check context quality"

      - alert: EscalationRateHigh
        expr: |
          increase(agent_escalations_total[24h]) /
          increase(agent_tasks_total[24h]) > 0.25
        for: 1h
        labels:
          severity: warning
        annotations:
          summary: "Agent escalation rate > 25% in 24h"
          action: "Review failure patterns, consider expanding agent capabilities"

9.6 Production Runbook — Agent Degradation

# RUNBOOK: Agent Health Score Degradation

## Triage Rapido (< 5 min)

1. **Check AHS components** — cual esta en rojo?
   - task_success_rate baja → ir a §A
   - tool_success_rate baja → ir a §B
   - context_utilisation alta → ir a §C
   - cost_efficiency mala → ir a §D
   - latency alta → ir a §E

## §A — Task Success Rate Baja
- [ ] Revisar logs de tareas fallidas: `query_logs("agent", "error", "1h")`
- [ ] Verificar si hay cambios recientes en prompts/instructions
- [ ] Comprobar si los datos de entrada cambiaron (nuevo schema, encoding)
- [ ] Ejecutar golden dataset de evaluacion para detectar regresion
- [ ] Rollback si hay deployment reciente

## §B — Tool Success Rate Baja
- [ ] Identificar tool problematica: dashboard "Tool Call Success by Tool"
- [ ] Verificar MCP server health: `curl http://mcp-server/health`
- [ ] Comprobar permisos y rate limits del tool
- [ ] Revisar si hubo cambios en APIs externas (breaking changes)
- [ ] Failover: deshabilitar tool y usar fallback

## §C — Context Window Saturada
- [ ] Forzar compaction: trigger `/compact` automatico
- [ ] Revisar si hay tool outputs gigantes (logs sin truncar)
- [ ] Verificar que .claudeignore excluye archivos pesados
- [ ] Iniciar nueva sesion si > 90% persistente

## §D — Cost Efficiency Degradada
- [ ] Revisar model routing: ¿se usa el modelo mas caro innecesariamente?
- [ ] Verificar cache hit rate: debe ser > 30%
- [ ] Buscar loops: agente repitiendo tool calls identicas
- [ ] Revisar token usage: ¿context window demasiado grande?

## §E — Latency Alta
- [ ] Revisar latencia por componente: model, tool calls, retrieval
- [ ] Verificar colas de MCP servers
- [ ] Comprobar si hay throttling del provider (Anthropic/OpenAI)
- [ ] Considerar model routing a modelo mas rapido

9.7 Metricas por Track

Metrica	Solo	Lean	Full
AHS	Manual check semanal	Dashboard basico	Dashboard real-time + alerts
Task success	Log review	Langfuse traces	Automated + golden dataset regression
Tool success	`--verbose` flag	MCP server logs	Per-tool dashboards + auto-failover
Context util	`/context` command	Session logging	Real-time gauge + auto-compact
Cost	Helicone free tier	Langfuse cost tracking	Full FinOps + anomaly detection
Latency	Manual timing	p50/p95 dashboard	Full distribution + SLO alerts
Escalation	Count in notes	Tag in tracker	Automated classification + trend analysis

10. AGENT DRIFT DETECTION & BEHAVIORAL BASELINE

GAP-20: deteccion de drift comportamental en agentes de larga duracion — baseline de comportamiento, context quality score, eficiencia de tokens, y deteccion de regresion.

10.1 Que es Agent Drift

Agent drift es la degradacion gradual del comportamiento de un agente durante sesiones largas o a lo largo de multiples sesiones. A diferencia de un error discreto, el drift es silencioso — el agente sigue funcionando pero sus outputs se alejan progresivamente del comportamiento esperado.

Causas principales:

Context pollution: informacion irrelevante o contradictoria acumulada en el context window
Model updates: cambios en el modelo subyacente que alteran el comportamiento
Prompt rot: instrucciones que dejan de ser efectivas al cambiar el contexto del proyecto
Data drift: los datos de entrada cambian en distribucion pero las instrucciones no se adaptan
Tool evolution: APIs externas cambian su formato de respuesta

10.2 Behavioral Baseline

Establecer un golden behavior dataset: outputs esperados para N inputs estandar que se ejecutan periodicamente para detectar desviaciones.

behavioral_baseline:
  version: "1.0.0"
  schedule: "weekly"  # Frecuencia de ejecucion
  similarity_threshold: 0.85  # Alerta si cae debajo

  golden_behaviors:
    - id: "BB-001"
      description: "Generar test unitario para funcion de utilidad"
      input:
        task: "Escribir test para una funcion sum(a, b) que retorna la suma"
        context: "Proyecto TypeScript con Jest"
      expected_patterns:
        - "describe("
        - "expect("
        - "toBe("
      expected_structure:
        has_imports: true
        has_describe_block: true
        has_at_least_n_assertions: 2
      quality_score_baseline: 0.92

    - id: "BB-002"
      description: "Revisar codigo con problema de seguridad"
      input:
        task: "Revisar este codigo: const query = `SELECT * FROM users WHERE id = ${userId}`"
        context: "Proyecto Node.js"
      expected_patterns:
        - "SQL injection"
        - "parameterized"
        - "prepared statement"
      quality_score_baseline: 0.95

    - id: "BB-003"
      description: "Crear ADR para decision de arquitectura"
      input:
        task: "Documentar la decision de usar PostgreSQL en vez de MongoDB"
        context: "Proyecto con datos relacionales y transacciones ACID"
      expected_patterns:
        - "Contexto"
        - "Decision"
        - "Consecuencias"
      quality_score_baseline: 0.88

  # Comparar outputs actuales vs baseline semanalmente
  comparison:
    method: "semantic_similarity + pattern_match"
    alert_if_similarity_below: 0.85
    alert_channels: ["slack:#agent-quality", "email:tech-lead"]

Workflow semanal:

1. Ejecutar golden behaviors contra agente actual
2. Comparar outputs con baseline (semantic similarity + patterns)
3. Si similarity < 0.85 → alerta + investigar causa
4. Si similarity > 0.95 → actualizar baseline (el agente mejoro)
5. Documentar resultados en .factory/drift-reports/

10.3 Context Quality Score (CQS)

Formula para medir la calidad del contexto en una sesion activa:

CQS = (useful_tokens / total_tokens) × (1 - error_rate) × recency_factor

Donde:

useful_tokens: tokens que contribuyeron a outputs correctos (estimado via tool call success y task completion)
total_tokens: tokens totales consumidos en la sesion
error_rate: ratio de tool calls fallidas / total tool calls
recency_factor: factor de decaimiento basado en la edad de la sesion

"""Context Quality Score calculator."""
import math
from dataclasses import dataclass

@dataclass
class SessionMetrics:
    """Metricas de una sesion activa."""
    total_tokens: int
    useful_tokens: int  # tokens en calls exitosas
    tool_calls_total: int
    tool_calls_failed: int
    session_age_minutes: int

def calculate_cqs(metrics: SessionMetrics) -> float:
    """Calcula el Context Quality Score (0.0 - 1.0).

    Args:
        metrics: Metricas de la sesion activa.

    Returns:
        CQS entre 0.0 y 1.0. Target >= 0.70.
    """
    # Token efficiency ratio
    token_ratio = metrics.useful_tokens / max(metrics.total_tokens, 1)

    # Error rate
    error_rate = metrics.tool_calls_failed / max(metrics.tool_calls_total, 1)

    # Recency factor: decae exponencialmente despues de 120 minutos
    # 0-120 min: ~1.0, 240 min: ~0.5, 360 min: ~0.25
    half_life_minutes = 120
    recency_factor = math.exp(
        -0.693 * max(metrics.session_age_minutes - half_life_minutes, 0)
        / half_life_minutes
    )
    recency_factor = min(recency_factor, 1.0)

    cqs = token_ratio * (1 - error_rate) * recency_factor
    return round(cqs, 4)

Umbrales de CQS:

CQS	Estado	Accion
>= 0.80	Verde	Operacion normal
0.60 - 0.79	Amarillo	Considerar compaction
0.40 - 0.59	Naranja	Compaction urgente o relay
< 0.40	Rojo	Iniciar nueva sesion inmediatamente

10.4 Token Efficiency Analysis

Metricas para evaluar la eficiencia de consumo de tokens:

Tokens por accion exitosa (TPA): metrica principal de eficiencia — menor es mejor.

panels:
  - title: "Tokens per Successful Action (TPA)"
    type: timeseries
    query: |
      rate(agent_tokens_total[1h]) /
      rate(agent_tasks_success_total[1h])
    thresholds:
      green: "< 5000"
      yellow: "< 15000"
      red: ">= 15000"
    description: "Eficiencia: tokens consumidos por cada accion exitosa"

  - title: "Context Utilization Curve"
    type: timeseries
    queries:
      - label: "CQS"
        query: agent_context_quality_score
      - label: "Token ratio"
        query: agent_useful_tokens / agent_total_tokens
      - label: "Compaction threshold"
        query: 0.60
    description: "CQS a lo largo de la sesion — indica momento optimo para compaction"

  - title: "Session Age vs CQS"
    type: scatter
    query: |
      agent_context_quality_score by (session_age_bucket)
    description: "Correlacion entre edad de sesion y calidad de contexto"

Timing optimo de compaction basado en CQS:

Inicio sesion ──> CQS ~1.0
     │
     │ (120 min) CQS ~0.80 → monitorear
     │
     │ (180 min) CQS ~0.65 → programar compaction
     │
     │ (240 min) CQS ~0.50 → compaction urgente o relay
     │
     v (300 min) CQS < 0.40 → nueva sesion obligatoria

10.5 Regression Detection

Ejecutar el golden behavior dataset semanalmente para detectar regresiones:

# Ejecucion semanal (CI job o cron)
# 1. Ejecutar behavioral baseline
python3 baseline/scripts/fab-eval-builders.py \
  --golden-dataset observability/agent_behavioral_baseline.yaml \
  --output .factory/drift-reports/drift-$(date +%Y%m%d).json

# 2. Comparar con la semana anterior
python3 baseline/scripts/compare-evals.py \
  --baseline .factory/drift-reports/drift-$(date -d "7 days ago" +%Y%m%d).json \
  --candidate .factory/drift-reports/drift-$(date +%Y%m%d).json \
  --threshold 0.05

# 3. Si hay regresion, notificar
# compare-evals.py retorna exit code 1 si regression > threshold

Canales de alerta:

Nivel de regresion	Canal	Accion
0-5%	Log only	Monitorear tendencia
5-15%	Slack #agent-quality	Investigar causa, revisar prompts
15-30%	Slack + email tech lead	Congelar cambios, rollback si necesario
> 30%	PagerDuty	Incident response, rollback inmediato

10.6 Diagnostic Toolkit

Herramientas para diagnosticar y resolver agent drift:

Extension propuesta de fab-eval-builders.py:

# Check de drift contra behavioral baseline
python3 baseline/scripts/fab-eval-builders.py --drift-check \
  --baseline observability/agent_behavioral_baseline.yaml \
  --period 7d

# Output esperado:
# DRIFT REPORT — Last 7 days
# ──────────────────────────
# BB-001 (unit test gen):    0.93 → 0.91  (-2.1%) ✅ OK
# BB-002 (security review):  0.95 → 0.88  (-7.4%) ⚠️  WARN
# BB-003 (ADR creation):     0.88 → 0.85  (-3.4%) ✅ OK
# ──────────────────────────
# Overall drift: -4.3%  ⚠️  INVESTIGATE BB-002

Grafana dashboard template para agent drift:

panels:
  - title: "Behavioral Drift Trend (weekly)"
    type: timeseries
    queries:
      - label: "Overall similarity"
        query: agent_behavioral_similarity_avg
      - label: "Threshold"
        query: 0.85
    description: "Tendencia semanal de similitud con behavioral baseline"

  - title: "CQS Distribution by Session"
    type: histogram
    query: agent_context_quality_score
    description: "Distribucion de CQS al final de cada sesion"

  - title: "Drift by Behavior Category"
    type: bar
    query: |
      agent_behavioral_similarity by (behavior_id)
    description: "Similitud por categoria de comportamiento"

  - title: "Token Efficiency Trend"
    type: timeseries
    query: |
      rate(agent_tokens_total[24h]) /
      rate(agent_tasks_success_total[24h])
    description: "TPA trending — detectar degradacion de eficiencia"

Umbrales de alerta:

Metrica	Warn	Critical	Accion
Behavioral similarity	< 85%	< 70%	Investigar / rollback
CQS promedio	< 0.65	< 0.50	Ajustar compaction timing
TPA (tokens/accion)	> 10K	> 20K	Revisar context loading
Weekly regression	> 15%	> 30%	Congelar cambios

11. Checklist de Implementacion

Quick Start (Dia 1)

Instalar Helicone proxy O Langfuse
Instrumentar primera llamada LLM con cost tracking
Verificar que spans tienen los 10 attributes obligatorios

Produccion (Semana 1)

Configurar dashboard con 8 panels esenciales
Crear alertas para budget y error rate
Implementar semantic cache básico
Configurar cost attribution by feature

Madurez (Mes 1)

Model routing automatico por complejidad
HANDOFF.md automation al final de sesiones
EU AI Act audit trail completo
Review mensual de costos con equipo

Agent Drift (Mes 2)

Crear golden behavior dataset con al menos 5 casos
Configurar ejecucion semanal de behavioral baseline
Implementar CQS calculator en sesiones de agentes
Configurar dashboard de drift en Grafana
Establecer alertas de regresion (warn 15%, critical 30%)
Integrar compare-evals.py en CI job semanal

Documento generado como companion operativo del AI-First Engineering Framework v7.6.0 — GAP-18 (Tool Landscape), GAP-19 (Agent Production Monitoring), y GAP-20 (Agent Drift Detection) cerrados