Ir al contenido

F09 — Deploy & Operations

AI-First Engineering Framework — Baseline v6.5

Sección titulada «AI-First Engineering Framework — Baseline v6.5»

Versión: 3.0.0 | Estado: Activo | Fecha: Marzo 2026 | Tipo: Framework Baseline


┌────────────────────────────────────────────────────────────────┐
│ PILAR 1: LOGS ESTRUCTURADOS │
│ JSON structured logs con: timestamp, level, trace_id, │
│ tenant_id, agent_id, session_id, event_type │
├────────────────────────────────────────────────────────────────┤
│ PILAR 2: DISTRIBUTED TRACING │
│ OpenTelemetry para correlacionar: API request → Agent call │
│ → Tool call → LLM call → Response │
├────────────────────────────────────────────────────────────────┤
| PILAR 3: METRICAS |
| Plataforma de metricas (e.g., Prometheus, Datadog, CloudWatch) |
| para: latencia, tokens, costos, errores, |
| cache hits, hallucination rate, RAG recall |
├────────────────────────────────────────────────────────────────┤
| PILAR 4: LLMOps (e.g., Langfuse, LangSmith, Arize) |
| Trazabilidad especifica de IA: prompts, completions, |
│ evals, scores, user feedback, A/B tests de prompts │
└────────────────────────────────────────────────────────────────┘

Nota: Este ejemplo usa Langfuse. Adaptar a la herramienta elegida en project-config.yaml. Alternativas: LangSmith, Arize, Phoenix. Los patrones (tracing, scoring, feedback) son universales.

observability/langfuse_tracer.py
from langfuse import Langfuse
from langfuse.decorators import langfuse_context, observe
from contextlib import contextmanager
import os
langfuse = Langfuse(
public_key=os.environ["LANGFUSE_PUBLIC_KEY"],
secret_key=os.environ["LANGFUSE_SECRET_KEY"],
host=os.environ.get("LANGFUSE_HOST", "https://cloud.langfuse.com")
)
class LangfuseTracer:
@observe(name="agent.run")
async def trace_agent_run(
self,
message: str,
session_id: str,
tenant_id: str,
agent_fn
):
langfuse_context.update_current_observation(
user_id=session_id,
session_id=session_id,
metadata={"tenant_id": tenant_id}
)
result = await agent_fn(message)
langfuse_context.update_current_observation(
output=result.text,
usage={
"input": result.input_tokens,
"output": result.output_tokens,
"total": result.total_tokens
}
)
return result
def record_user_feedback(
self,
trace_id: str,
score: float,
comment: str = None
):
"""Registra feedback del usuario (thumbs up/down, 1-5 stars)"""
langfuse.score(
trace_id=trace_id,
name="user_feedback",
value=score, # 1.0 = positivo, 0.0 = negativo
comment=comment
)
@observe(name="rag.pipeline")
async def traced_rag(query: str, tenant_id: str):
# Trace del retrieval
langfuse_context.update_current_observation(
name="rag.retrieval",
metadata={"query": query, "tenant_id": tenant_id}
)
chunks = await retriever.search(query, tenant_id)
langfuse_context.update_current_observation(
output=chunks,
metadata={
"chunks_retrieved": len(chunks),
"top_score": chunks[0].score if chunks else 0,
"sources": [c.source for c in chunks]
}
)
# Trace de la generación
response = await llm.generate(
prompt=build_rag_prompt(query, chunks)
)
# Evaluar calidad automáticamente
langfuse.score(
trace_id=langfuse_context.get_current_trace_id(),
name="faithfulness",
value=await evaluate_faithfulness(response, chunks)
)
return response

observability/metrics.py
from prometheus_client import Counter, Histogram, Gauge, start_http_server
# Contadores
llm_requests_total = Counter(
'llm_requests_total',
'Total de llamadas al LLM',
['agent_id', 'model', 'tenant_id', 'status']
)
llm_tokens_total = Counter(
'llm_tokens_total',
'Total de tokens consumidos',
['agent_id', 'model', 'token_type'] # token_type: input/output
)
llm_cost_usd_total = Counter(
'llm_cost_usd_total',
'Costo total en USD',
['agent_id', 'model', 'tenant_id']
)
tool_executions_total = Counter(
'tool_executions_total',
'Total de ejecuciones de tools MCP',
['tool_name', 'agent_id', 'status']
)
cache_hits_total = Counter(
'cache_hits_total',
'Cache hits en semantic cache',
['cache_type'] # semantic, exact
)
# Histogramas (latencia)
llm_latency_ms = Histogram(
'llm_latency_ms',
'Latencia de llamadas LLM en ms',
['model', 'agent_id'],
buckets=[100, 300, 500, 1000, 2000, 5000, 10000]
)
rag_retrieval_latency_ms = Histogram(
'rag_retrieval_latency_ms',
'Latencia de búsqueda RAG en ms',
buckets=[10, 50, 100, 200, 500, 1000]
)
# Gauges (valores actuales)
active_sessions = Gauge(
'active_sessions',
'Sesiones de agente activas',
['agent_id']
)
daily_budget_remaining_usd = Gauge(
'daily_budget_remaining_usd',
'Presupuesto diario restante en USD',
['tenant_id']
)
MétricaDescripciónUmbral Alerta
hallucination_rate% respuestas sin soporte en contexto> 10%
faithfulness_avgFidelidad promedio de respuestas RAG< 0.8
rag_recall_at_5Recall@5 del sistema RAG< 85%
user_satisfactionScore promedio de feedback< 3.5/5
prompt_injection_blockedIntentos de inyección bloqueados> 10/hora
cost_per_session_usdCosto promedio por sesión> $0.50
token_efficiencyOutput tokens / Input tokens< 0.3

finops/model_router.py
from dataclasses import dataclass
@dataclass
class ModelOption:
name: str
cost_per_1k_input: float
cost_per_1k_output: float
max_context: int
capabilities: list[str]
MODELS = {
"claude-haiku-3.5": ModelOption(
name="claude-haiku-3.5",
cost_per_1k_input=0.00025,
cost_per_1k_output=0.00125,
max_context=200000,
capabilities=["fast", "simple_tasks", "classification"]
),
"claude-sonnet-4.5": ModelOption(
name="claude-sonnet-4.5",
cost_per_1k_input=0.003,
cost_per_1k_output=0.015,
max_context=200000,
capabilities=["complex_reasoning", "code", "analysis", "agentic"]
),
"gpt-4o-mini": ModelOption(
name="gpt-4o-mini",
cost_per_1k_input=0.00015,
cost_per_1k_output=0.0006,
max_context=128000,
capabilities=["fast", "simple_tasks", "classification", "extraction"]
),
"llama-local": ModelOption(
name="ollama/llama3.3:70b",
cost_per_1k_input=0.0, # On-premise, solo costo de compute
cost_per_1k_output=0.0,
max_context=131072,
capabilities=["any", "private_data"]
)
}
class IntelligentModelRouter:
def select_model(self, task: TaskContext) -> str:
# Datos sensibles → siempre on-premise
if task.data_sensitivity == "high":
return "llama-local"
# Clasificación simple → modelo más barato
if task.type in ["classification", "extraction", "simple_qa"]:
return "gpt-4o-mini"
# Contexto muy largo → Claude (mayor ventana)
if task.estimated_tokens > 100000:
return "claude-sonnet-4.5"
# Tarea de agente con tools → Claude (mejor razonamiento agéntico)
if task.requires_tools:
return "claude-sonnet-4.5"
# Default: Haiku para ahorro
return "claude-haiku-3.5"
def estimate_cost(self, model_name: str, input_tokens: int, output_tokens: int) -> float:
model = MODELS[model_name]
input_cost = (input_tokens / 1000) * model.cost_per_1k_input
output_cost = (output_tokens / 1000) * model.cost_per_1k_output
return input_cost + output_cost
class CostGuard:
def __init__(self):
self.redis = redis.Redis()
self.DAILY_LIMITS = {
"starter": 5.0, # $5/día
"pro": 50.0, # $50/día
"enterprise": 500.0 # $500/día
}
async def check_and_record(
self,
tenant_id: str,
tenant_plan: str,
estimated_cost: float
) -> CostCheckResult:
today = datetime.utcnow().strftime("%Y-%m-%d")
key = f"cost:{tenant_id}:{today}"
limit = self.DAILY_LIMITS[tenant_plan]
current = float(await self.redis.get(key) or 0)
if current + estimated_cost > limit:
return CostCheckResult(
allowed=False,
reason=f"Límite diario alcanzado: ${current:.2f}/${limit:.2f}"
)
# Registrar costo
await self.redis.incrbyfloat(key, estimated_cost)
await self.redis.expire(key, 86400)
# Alerta al 80% del límite
if (current + estimated_cost) / limit > 0.8:
await self._send_budget_alert(tenant_id, current + estimated_cost, limit)
return CostCheckResult(allowed=True, remaining=limit - current - estimated_cost)
// Panel: Costo diario por modelo
{
"title": "Daily LLM Cost by Model",
"type": "timeseries",
"query": "sum by (model) (increase(llm_cost_usd_total[$__interval]))",
"unit": "currencyUSD"
}
// Panel: Cache savings
{
"title": "Cost Saved by Semantic Cache",
"type": "stat",
"query": "increase(cache_hits_total[24h]) * 0.003", // Costo promedio evitado
"unit": "currencyUSD",
"color": "green"
}
// Panel: Token efficiency
{
"title": "Token Efficiency (Output/Input ratio)",
"type": "gauge",
"query": "rate(llm_tokens_total{token_type='output'}[1h]) / rate(llm_tokens_total{token_type='input'}[1h])",
"thresholds": [{"value": 0.3, "color": "red"}, {"value": 0.5, "color": "yellow"}]
}

monitoring/alerts.yml
groups:
- name: ai_critical
rules:
# Costo excesivo
- alert: LLMCostSpikeHigh
expr: increase(llm_cost_usd_total[1h]) > 50
for: 5m
annotations:
summary: "Spike de costo LLM > $50/hora"
runbook: "${WIKI_URL}/runbooks/cost-spike"
labels:
severity: critical
# Alta tasa de errores
- alert: AgentErrorRateHigh
expr: rate(llm_requests_total{status="error"}[5m]) / rate(llm_requests_total[5m]) > 0.1
for: 5m
annotations:
summary: "Tasa de errores del agente > 10%"
labels:
severity: critical
# Latencia alta
- alert: LLMLatencyHigh
expr: histogram_quantile(0.95, rate(llm_latency_ms_bucket[5m])) > 5000
for: 10m
annotations:
summary: "Latencia p95 LLM > 5 segundos"
labels:
severity: warning
- name: ai_quality
rules:
# Calidad RAG degradada
- alert: RAGQualityDegraded
expr: avg_over_time(rag_faithfulness[1h]) < 0.75
for: 30m
annotations:
summary: "Calidad RAG degradada (faithfulness < 0.75)"
labels:
severity: warning
# Alta tasa de hallucinations
- alert: HallucinationRateHigh
expr: rate(hallucinations_total[1h]) / rate(llm_requests_total[1h]) > 0.15
for: 15m
annotations:
summary: "Tasa de alucinaciones > 15%"
labels:
severity: critical
# Intentos de inyección
- alert: PromptInjectionSurge
expr: increase(tool_executions_total{status="blocked_injection"}[1h]) > 50
for: 1m
annotations:
summary: "Surge de intentos de prompt injection (>50/hora)"
labels:
severity: critical
## Runbook: Cost Spike (LLMCostSpikeHigh)
### Síntomas
- Alerta: llm_cost_usd_total > $50/hora
- Dashboard Grafana: Cost panel en rojo
### Diagnóstico (< 5 min)
1. Identificar tenant con mayor consumo:
`SELECT tenant_id, SUM(cost_usd) FROM conversation_messages WHERE created_at > NOW() - INTERVAL '1h' GROUP BY 1 ORDER BY 2 DESC LIMIT 5`
2. Verificar si hay loop en agente:
`SELECT session_id, COUNT(*) FROM conversation_messages WHERE created_at > NOW() - INTERVAL '1h' GROUP BY 1 HAVING COUNT(*) > 100`
3. Revisar modelo utilizado en Langfuse
### Acciones
- Loop detectado: pausar sesión afectada vía API
- Tenant malicioso: bloquear en Redis (`SET blocked:tenant:{id} 1`)
- Bug en agente: rollback al release anterior
### Escalada
Si no se resuelve en 15 min: alertar a ${ON_CALL_CHANNEL}

Row 1: HEALTH
├── API Response Time p95 (< 800ms)
├── Error Rate (< 1%)
├── Active Sessions
└── LLM Availability (%)
Row 2: COSTOS
├── Daily Spend by Model ($ timeseries)
├── Cost per Tenant (tabla)
├── Cache Hit Rate (%)
└── Tokens Used Today
Row 3: CALIDAD IA
├── Hallucination Rate (%)
├── User Satisfaction Score (avg)
├── RAG Faithfulness (avg)
└── Prompt Injections Blocked
Row 4: OPERACIONES
├── Active Agents
├── Tool Executions/min
├── HITL Pending Approvals
└── DB Connection Pool Usage

AI-First Engineering Framework — Baseline v6.5