← Volver al listado de tecnologías ← Índice de kubectl + FastAPI

Capítulo 8: Health Checks y Probes

17 de diciembre de 2024 Por: Artiko

kuberneteshealth-checksprobeskubectlfastapi

¿Por qué Probes?

Sin probes, Kubernetes no sabe si tu app:

Arrancó correctamente
Está lista para recibir tráfico
Se colgó y necesita reiniciarse

Los 3 tipos de Probe

Probe	Pregunta	Si falla
Startup	¿Ya arrancó?	Espera, no envía tráfico
Readiness	¿Puede recibir tráfico?	Saca el pod del Service
Liveness	¿Sigue vivo?	Reinicia el container

Endpoints en FastAPI

from fastapi import FastAPI
import time

app = FastAPI()
start_time = time.time()

@app.get("/health/live")
def liveness():
    """¿El proceso está vivo?"""
    return {"status": "alive"}

@app.get("/health/ready")
def readiness():
    """¿Puede recibir tráfico?"""
    # Acá podrías verificar conexión a DB, cache, etc.
    return {"status": "ready", "uptime": time.time() - start_time}

@app.get("/health/startup")
def startup():
    """¿Terminó de inicializar?"""
    return {"status": "started"}

Configurar Probes en el Deployment

apiVersion: apps/v1
kind: Deployment
metadata:
  name: fastapi-app
  namespace: dev
  labels:
    app: fastapi-app
spec:
  replicas: 2
  selector:
    matchLabels:
      app: fastapi-app
  template:
    metadata:
      labels:
        app: fastapi-app
    spec:
      containers:
        - name: fastapi
          image: fastapi-k8s:v1
          ports:
            - containerPort: 8000
          startupProbe:
            httpGet:
              path: /health/startup
              port: 8000
            failureThreshold: 10
            periodSeconds: 3
          readinessProbe:
            httpGet:
              path: /health/ready
              port: 8000
            initialDelaySeconds: 5
            periodSeconds: 10
            failureThreshold: 3
          livenessProbe:
            httpGet:
              path: /health/live
              port: 8000
            initialDelaySeconds: 10
            periodSeconds: 15
            failureThreshold: 3
          envFrom:
            - configMapRef:
                name: fastapi-config
            - secretRef:
                name: fastapi-secrets
          resources:
            requests:
              memory: "128Mi"
              cpu: "100m"
            limits:
              memory: "256Mi"
              cpu: "500m"

Parámetros explicados

initialDelaySeconds — esperar N segundos antes del primer check
periodSeconds — cada cuánto chequear
failureThreshold — cuántas fallas consecutivas antes de actuar
timeoutSeconds — timeout de cada request (default: 1s)

Readiness con dependencias

Un readiness check más realista que verifica la base de datos:

import asyncpg

db_pool = None

@app.on_event("startup")
async def init_db():
    global db_pool
    db_pool = await asyncpg.create_pool(os.getenv("DATABASE_URL"))

@app.get("/health/ready")
async def readiness():
    if not db_pool:
        return JSONResponse({"status": "not ready", "reason": "no db pool"}, status_code=503)
    try:
        async with db_pool.acquire() as conn:
            await conn.fetchval("SELECT 1")
        return {"status": "ready"}
    except Exception as e:
        return JSONResponse({"status": "not ready", "reason": str(e)}, status_code=503)

Retornar 503 hace que Kubernetes saque el pod del Service temporalmente.

Verificar estado de probes

# Ver condiciones del pod
kubectl describe pod <nombre-pod> -n dev | grep -A 5 "Conditions"

# Ver eventos de probe failures
kubectl get events -n dev --field-selector reason=Unhealthy

# Ver si un pod está Ready
kubectl get pods -n dev
# La columna READY muestra 1/1 si readiness pasa

Errores comunes con Probes

Pod reiniciándose constantemente:

El liveness probe falla antes de que la app arranque
Solución: agregar startupProbe o aumentar initialDelaySeconds

Pod Ready pero no recibe tráfico:

Readiness está OK pero el Service selector no matchea
kubectl get endpoints <service> -n dev

Timeout en probes:

Endpoint tarda más de 1 segundo en responder
Aumentar timeoutSeconds o optimizar el endpoint

En el siguiente capítulo: escalado y actualizaciones sin downtime.