F09 — Deploy & Operations

09 — Observabilidad, FinOps & Alertas

AI-First Engineering Framework — Baseline v6.5

Versión: 3.0.0 | Estado: Activo | Fecha: Marzo 2026 | Tipo: Framework Baseline

1. Los 4 Pilares de Observabilidad AI

┌────────────────────────────────────────────────────────────────┐
│  PILAR 1: LOGS ESTRUCTURADOS                                   │
│  JSON structured logs con: timestamp, level, trace_id,         │
│  tenant_id, agent_id, session_id, event_type                   │
├────────────────────────────────────────────────────────────────┤
│  PILAR 2: DISTRIBUTED TRACING                                  │
│  OpenTelemetry para correlacionar: API request → Agent call    │
│  → Tool call → LLM call → Response                            │
├────────────────────────────────────────────────────────────────┤
|  PILAR 3: METRICAS                                             |
|  Plataforma de metricas (e.g., Prometheus, Datadog, CloudWatch) |
|  para: latencia, tokens, costos, errores,                       |
|  cache hits, hallucination rate, RAG recall                    |
├────────────────────────────────────────────────────────────────┤
|  PILAR 4: LLMOps (e.g., Langfuse, LangSmith, Arize)            |
|  Trazabilidad especifica de IA: prompts, completions,           |
│  evals, scores, user feedback, A/B tests de prompts            │
└────────────────────────────────────────────────────────────────┘

2. LLMOps — Ejemplo con Langfuse

Nota: Este ejemplo usa Langfuse. Adaptar a la herramienta elegida en project-config.yaml. Alternativas: LangSmith, Arize, Phoenix. Los patrones (tracing, scoring, feedback) son universales.

2.1 Configuracion

from langfuse import Langfuse
from langfuse.decorators import langfuse_context, observe
from contextlib import contextmanager
import os

langfuse = Langfuse(
    public_key=os.environ["LANGFUSE_PUBLIC_KEY"],
    secret_key=os.environ["LANGFUSE_SECRET_KEY"],
    host=os.environ.get("LANGFUSE_HOST", "https://cloud.langfuse.com")
)

class LangfuseTracer:
    @observe(name="agent.run")
    async def trace_agent_run(
        self,
        message: str,
        session_id: str,
        tenant_id: str,
        agent_fn
    ):
        langfuse_context.update_current_observation(
            user_id=session_id,
            session_id=session_id,
            metadata={"tenant_id": tenant_id}
        )

        result = await agent_fn(message)

        langfuse_context.update_current_observation(
            output=result.text,
            usage={
                "input": result.input_tokens,
                "output": result.output_tokens,
                "total": result.total_tokens
            }
        )

        return result

    def record_user_feedback(
        self,
        trace_id: str,
        score: float,
        comment: str = None
    ):
        """Registra feedback del usuario (thumbs up/down, 1-5 stars)"""
        langfuse.score(
            trace_id=trace_id,
            name="user_feedback",
            value=score,  # 1.0 = positivo, 0.0 = negativo
            comment=comment
        )

2.2 Tracing de RAG

@observe(name="rag.pipeline")
async def traced_rag(query: str, tenant_id: str):
    # Trace del retrieval
    langfuse_context.update_current_observation(
        name="rag.retrieval",
        metadata={"query": query, "tenant_id": tenant_id}
    )

    chunks = await retriever.search(query, tenant_id)

    langfuse_context.update_current_observation(
        output=chunks,
        metadata={
            "chunks_retrieved": len(chunks),
            "top_score": chunks[0].score if chunks else 0,
            "sources": [c.source for c in chunks]
        }
    )

    # Trace de la generación
    response = await llm.generate(
        prompt=build_rag_prompt(query, chunks)
    )

    # Evaluar calidad automáticamente
    langfuse.score(
        trace_id=langfuse_context.get_current_trace_id(),
        name="faithfulness",
        value=await evaluate_faithfulness(response, chunks)
    )

    return response

3. Métricas de Negocio IA

3.1 Definición de Métricas

from prometheus_client import Counter, Histogram, Gauge, start_http_server

# Contadores
llm_requests_total = Counter(
    'llm_requests_total',
    'Total de llamadas al LLM',
    ['agent_id', 'model', 'tenant_id', 'status']
)

llm_tokens_total = Counter(
    'llm_tokens_total',
    'Total de tokens consumidos',
    ['agent_id', 'model', 'token_type']  # token_type: input/output
)

llm_cost_usd_total = Counter(
    'llm_cost_usd_total',
    'Costo total en USD',
    ['agent_id', 'model', 'tenant_id']
)

tool_executions_total = Counter(
    'tool_executions_total',
    'Total de ejecuciones de tools MCP',
    ['tool_name', 'agent_id', 'status']
)

cache_hits_total = Counter(
    'cache_hits_total',
    'Cache hits en semantic cache',
    ['cache_type']  # semantic, exact
)

# Histogramas (latencia)
llm_latency_ms = Histogram(
    'llm_latency_ms',
    'Latencia de llamadas LLM en ms',
    ['model', 'agent_id'],
    buckets=[100, 300, 500, 1000, 2000, 5000, 10000]
)

rag_retrieval_latency_ms = Histogram(
    'rag_retrieval_latency_ms',
    'Latencia de búsqueda RAG en ms',
    buckets=[10, 50, 100, 200, 500, 1000]
)

# Gauges (valores actuales)
active_sessions = Gauge(
    'active_sessions',
    'Sesiones de agente activas',
    ['agent_id']
)

daily_budget_remaining_usd = Gauge(
    'daily_budget_remaining_usd',
    'Presupuesto diario restante en USD',
    ['tenant_id']
)

3.2 Métricas LLMOps Específicas

Métrica	Descripción	Umbral Alerta
`hallucination_rate`	% respuestas sin soporte en contexto	> 10%
`faithfulness_avg`	Fidelidad promedio de respuestas RAG	< 0.8
`rag_recall_at_5`	Recall@5 del sistema RAG	< 85%
`user_satisfaction`	Score promedio de feedback	< 3.5/5
`prompt_injection_blocked`	Intentos de inyección bloqueados	> 10/hora
`cost_per_session_usd`	Costo promedio por sesión	> $0.50
`token_efficiency`	Output tokens / Input tokens	< 0.3

4. FinOps AI

4.1 Model Router para Reducción de Costos

from dataclasses import dataclass

@dataclass
class ModelOption:
    name: str
    cost_per_1k_input: float
    cost_per_1k_output: float
    max_context: int
    capabilities: list[str]

MODELS = {
    "claude-haiku-3.5": ModelOption(
        name="claude-haiku-3.5",
        cost_per_1k_input=0.00025,
        cost_per_1k_output=0.00125,
        max_context=200000,
        capabilities=["fast", "simple_tasks", "classification"]
    ),
    "claude-sonnet-4.5": ModelOption(
        name="claude-sonnet-4.5",
        cost_per_1k_input=0.003,
        cost_per_1k_output=0.015,
        max_context=200000,
        capabilities=["complex_reasoning", "code", "analysis", "agentic"]
    ),
    "gpt-4o-mini": ModelOption(
        name="gpt-4o-mini",
        cost_per_1k_input=0.00015,
        cost_per_1k_output=0.0006,
        max_context=128000,
        capabilities=["fast", "simple_tasks", "classification", "extraction"]
    ),
    "llama-local": ModelOption(
        name="ollama/llama3.3:70b",
        cost_per_1k_input=0.0,      # On-premise, solo costo de compute
        cost_per_1k_output=0.0,
        max_context=131072,
        capabilities=["any", "private_data"]
    )
}

class IntelligentModelRouter:
    def select_model(self, task: TaskContext) -> str:
        # Datos sensibles → siempre on-premise
        if task.data_sensitivity == "high":
            return "llama-local"

        # Clasificación simple → modelo más barato
        if task.type in ["classification", "extraction", "simple_qa"]:
            return "gpt-4o-mini"

        # Contexto muy largo → Claude (mayor ventana)
        if task.estimated_tokens > 100000:
            return "claude-sonnet-4.5"

        # Tarea de agente con tools → Claude (mejor razonamiento agéntico)
        if task.requires_tools:
            return "claude-sonnet-4.5"

        # Default: Haiku para ahorro
        return "claude-haiku-3.5"

    def estimate_cost(self, model_name: str, input_tokens: int, output_tokens: int) -> float:
        model = MODELS[model_name]
        input_cost = (input_tokens / 1000) * model.cost_per_1k_input
        output_cost = (output_tokens / 1000) * model.cost_per_1k_output
        return input_cost + output_cost

4.2 Cost Guard

class CostGuard:
    def __init__(self):
        self.redis = redis.Redis()
        self.DAILY_LIMITS = {
            "starter": 5.0,       # $5/día
            "pro": 50.0,          # $50/día
            "enterprise": 500.0   # $500/día
        }

    async def check_and_record(
        self,
        tenant_id: str,
        tenant_plan: str,
        estimated_cost: float
    ) -> CostCheckResult:
        today = datetime.utcnow().strftime("%Y-%m-%d")
        key = f"cost:{tenant_id}:{today}"

        limit = self.DAILY_LIMITS[tenant_plan]
        current = float(await self.redis.get(key) or 0)

        if current + estimated_cost > limit:
            return CostCheckResult(
                allowed=False,
                reason=f"Límite diario alcanzado: ${current:.2f}/${limit:.2f}"
            )

        # Registrar costo
        await self.redis.incrbyfloat(key, estimated_cost)
        await self.redis.expire(key, 86400)

        # Alerta al 80% del límite
        if (current + estimated_cost) / limit > 0.8:
            await self._send_budget_alert(tenant_id, current + estimated_cost, limit)

        return CostCheckResult(allowed=True, remaining=limit - current - estimated_cost)

4.3 Dashboard de Costos (Grafana)

// Panel: Costo diario por modelo
{
  "title": "Daily LLM Cost by Model",
  "type": "timeseries",
  "query": "sum by (model) (increase(llm_cost_usd_total[$__interval]))",
  "unit": "currencyUSD"
}

// Panel: Cache savings
{
  "title": "Cost Saved by Semantic Cache",
  "type": "stat",
  "query": "increase(cache_hits_total[24h]) * 0.003",  // Costo promedio evitado
  "unit": "currencyUSD",
  "color": "green"
}

// Panel: Token efficiency
{
  "title": "Token Efficiency (Output/Input ratio)",
  "type": "gauge",
  "query": "rate(llm_tokens_total{token_type='output'}[1h]) / rate(llm_tokens_total{token_type='input'}[1h])",
  "thresholds": [{"value": 0.3, "color": "red"}, {"value": 0.5, "color": "yellow"}]
}

5. Alertas

5.1 Catálogo de Alertas

groups:
  - name: ai_critical
    rules:
      # Costo excesivo
      - alert: LLMCostSpikeHigh
        expr: increase(llm_cost_usd_total[1h]) > 50
        for: 5m
        annotations:
          summary: "Spike de costo LLM > $50/hora"
          runbook: "${WIKI_URL}/runbooks/cost-spike"
        labels:
          severity: critical

      # Alta tasa de errores
      - alert: AgentErrorRateHigh
        expr: rate(llm_requests_total{status="error"}[5m]) / rate(llm_requests_total[5m]) > 0.1
        for: 5m
        annotations:
          summary: "Tasa de errores del agente > 10%"
        labels:
          severity: critical

      # Latencia alta
      - alert: LLMLatencyHigh
        expr: histogram_quantile(0.95, rate(llm_latency_ms_bucket[5m])) > 5000
        for: 10m
        annotations:
          summary: "Latencia p95 LLM > 5 segundos"
        labels:
          severity: warning

  - name: ai_quality
    rules:
      # Calidad RAG degradada
      - alert: RAGQualityDegraded
        expr: avg_over_time(rag_faithfulness[1h]) < 0.75
        for: 30m
        annotations:
          summary: "Calidad RAG degradada (faithfulness < 0.75)"
        labels:
          severity: warning

      # Alta tasa de hallucinations
      - alert: HallucinationRateHigh
        expr: rate(hallucinations_total[1h]) / rate(llm_requests_total[1h]) > 0.15
        for: 15m
        annotations:
          summary: "Tasa de alucinaciones > 15%"
        labels:
          severity: critical

      # Intentos de inyección
      - alert: PromptInjectionSurge
        expr: increase(tool_executions_total{status="blocked_injection"}[1h]) > 50
        for: 1m
        annotations:
          summary: "Surge de intentos de prompt injection (>50/hora)"
        labels:
          severity: critical

5.2 Runbooks

## Runbook: Cost Spike (LLMCostSpikeHigh)

### Síntomas
- Alerta: llm_cost_usd_total > $50/hora
- Dashboard Grafana: Cost panel en rojo

### Diagnóstico (< 5 min)
1. Identificar tenant con mayor consumo:
   `SELECT tenant_id, SUM(cost_usd) FROM conversation_messages WHERE created_at > NOW() - INTERVAL '1h' GROUP BY 1 ORDER BY 2 DESC LIMIT 5`
2. Verificar si hay loop en agente:
   `SELECT session_id, COUNT(*) FROM conversation_messages WHERE created_at > NOW() - INTERVAL '1h' GROUP BY 1 HAVING COUNT(*) > 100`
3. Revisar modelo utilizado en Langfuse

### Acciones
- Loop detectado: pausar sesión afectada vía API
- Tenant malicioso: bloquear en Redis (`SET blocked:tenant:{id} 1`)
- Bug en agente: rollback al release anterior

### Escalada
Si no se resuelve en 15 min: alertar a ${ON_CALL_CHANNEL}

6. Dashboard Mínimo Viable (Grafana)

Row 1: HEALTH
├── API Response Time p95 (< 800ms)
├── Error Rate (< 1%)
├── Active Sessions
└── LLM Availability (%)

Row 2: COSTOS
├── Daily Spend by Model ($ timeseries)
├── Cost per Tenant (tabla)
├── Cache Hit Rate (%)
└── Tokens Used Today

Row 3: CALIDAD IA
├── Hallucination Rate (%)
├── User Satisfaction Score (avg)
├── RAG Faithfulness (avg)
└── Prompt Injections Blocked

Row 4: OPERACIONES
├── Active Agents
├── Tool Executions/min
├── HITL Pending Approvals
└── DB Connection Pool Usage

AI-First Engineering Framework — Baseline v6.5