Cap 7: OpenTelemetry Collector

Por: Artiko
opentelemetrycollectorotelcolpipelinereceiversexporters

Qué es el Collector

El OTel Collector es un proxy de telemetría: recibe datos de tus aplicaciones, los procesa y los reenvía a uno o más backends. Es vendor-neutral y se configura en YAML.

flowchart LR
    APP1[App Python] -->|OTLP/gRPC| COL
    APP2[App Node.js] -->|OTLP/HTTP| COL
    APP3[App Go] -->|OTLP/gRPC| COL
    PROM[Prometheus\nscrape] -->|pull| COL

    subgraph COL[OTel Collector]
        R[Receivers] --> P[Processors] --> E[Exporters]
    end

    COL -->|traces| JAE[Jaeger / Tempo]
    COL -->|metrics| PRO[Prometheus / Thanos]
    COL -->|logs| LOK[Loki / Elasticsearch]
    COL -->|all signals| HNY[Honeycomb / Datadog]

Por qué usar el Collector en lugar de exportar directamente:

Arquitectura: Pipelines

La unidad de configuración es el pipeline. Cada pipeline maneja un tipo de señal (traces, metrics, logs) y conecta receivers → processors → exporters:

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [memory_limiter, batch]
      exporters: [jaeger, otlp/honeycomb]
    metrics:
      receivers: [otlp, prometheus]
      processors: [memory_limiter, batch]
      exporters: [prometheus/remote_write]
    logs:
      receivers: [otlp]
      processors: [batch]
      exporters: [loki]

Receivers

Los receivers reciben telemetría de las fuentes:

OTLP Receiver (el más común)

receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318
        cors:
          allowed_origins: ["https://my-app.com"]

Prometheus Receiver (scrape)

receivers:
  prometheus:
    config:
      scrape_configs:
        - job_name: 'my-service'
          static_configs:
            - targets: ['localhost:8080']
          scrape_interval: 30s

Otros receivers útiles

receivers:
  # Recibir logs de archivos (como Fluentd/Logstash)
  filelog:
    include: [/var/log/app/*.log]
    operators:
      - type: json_parser

  # Métricas del host
  hostmetrics:
    collection_interval: 30s
    scrapers:
      cpu:
      memory:
      disk:
      network:

  # Métricas de Docker
  docker_stats:
    endpoint: unix:///var/run/docker.sock

Processors

Los processors transforman la telemetría en tránsito:

memory_limiter — Control de memoria

Siempre el primero en el pipeline:

processors:
  memory_limiter:
    limit_mib: 512           # límite total
    spike_limit_mib: 128     # margen para spikes
    check_interval: 5s

batch — Agrupa antes de exportar

Siempre el último antes del exporter:

processors:
  batch:
    send_batch_size: 1000
    timeout: 5s
    send_batch_max_size: 1500

attributes — Modificar atributos

processors:
  attributes:
    actions:
      # Agregar atributo fijo
      - key: environment
        value: production
        action: insert

      # Hashear dato sensible
      - key: enduser.email
        action: hash

      # Eliminar atributo
      - key: http.request.header.authorization
        action: delete

      # Renombrar
      - key: old.attribute.name
        new_key: new.attribute.name
        action: update

filter — Descartar señales

processors:
  filter:
    # Descartar spans de health checks
    traces:
      span:
        - 'attributes["http.route"] == "/health"'
        - 'attributes["http.route"] == "/metrics"'
    # Descartar métricas con alta cardinalidad en dev
    metrics:
      metric:
        - 'name == "debug.internal.counter"'

resource — Modificar resource attributes

processors:
  resource:
    attributes:
      - key: service.namespace
        value: payments
        action: insert
      - key: cloud.provider
        value: aws
        action: insert

tail_sampling — Sampling inteligente

El tail sampling decide después de ver el trace completo si retenerlo:

processors:
  tail_sampling:
    decision_wait: 10s    # esperar 10s para ver el trace completo
    num_traces: 100000    # traces en buffer
    expected_new_traces_per_sec: 1000
    policies:
      # Siempre retener traces con errores
      - name: errors
        type: status_code
        status_code: {status_codes: [ERROR]}

      # Siempre retener traces lentos (>500ms)
      - name: slow-traces
        type: latency
        latency: {threshold_ms: 500}

      # 1% del resto
      - name: sample-all-else
        type: probabilistic
        probabilistic: {sampling_percentage: 1}

Exporters

Los exporters envían datos a los backends:

exporters:
  # OTLP genérico (Honeycomb, Datadog OTLP, etc.)
  otlp:
    endpoint: https://api.honeycomb.io:443
    headers:
      x-honeycomb-team: ${env:HONEYCOMB_API_KEY}

  # Jaeger
  jaeger:
    endpoint: jaeger:14250
    tls:
      insecure: true

  # Prometheus remote write
  prometheusremotewrite:
    endpoint: http://prometheus:9090/api/v1/write

  # Loki (logs)
  loki:
    endpoint: http://loki:3100/loki/api/v1/push

  # Debug (consola)
  debug:
    verbosity: detailed

  # Múltiples destinos
  otlp/datadog:
    endpoint: https://trace.agent.datadoghq.com
    headers:
      DD-API-KEY: ${env:DD_API_KEY}

Configuración completa de ejemplo

# otel-collector-config.yaml

receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318

processors:
  memory_limiter:
    limit_mib: 512
    spike_limit_mib: 128
    check_interval: 5s

  batch:
    send_batch_size: 512
    timeout: 5s

  filter:
    traces:
      span:
        - 'attributes["http.route"] == "/health"'

  resource:
    attributes:
      - key: deployment.environment
        value: ${env:ENV}
        action: insert

exporters:
  otlp/jaeger:
    endpoint: jaeger:4317
    tls:
      insecure: true

  prometheusremotewrite:
    endpoint: http://prometheus:9090/api/v1/write

  loki:
    endpoint: http://loki:3100/loki/api/v1/push

  debug:
    verbosity: basic

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [memory_limiter, filter, resource, batch]
      exporters: [otlp/jaeger, debug]

    metrics:
      receivers: [otlp]
      processors: [memory_limiter, resource, batch]
      exporters: [prometheusremotewrite]

    logs:
      receivers: [otlp]
      processors: [memory_limiter, resource, batch]
      exporters: [loki]

  telemetry:
    logs:
      level: info
    metrics:
      address: 0.0.0.0:8888  # métricas del propio Collector

Deployment: Agent vs Gateway

flowchart TD
    subgraph HOSTS[Nodos / Pods]
        A1[App] --> AG1[Collector\nAgent]
        A2[App] --> AG2[Collector\nAgent]
        A3[App] --> AG3[Collector\nAgent]
    end

    AG1 --> GW[Collector\nGateway]
    AG2 --> GW
    AG3 --> GW

    GW --> JAEGER[Jaeger]
    GW --> PROM[Prometheus]
    GW --> LOKI[Loki]

Agent (sidecar o daemonset) — Cercano a la aplicación. Bajo overhead. Batching y buffering local.

Gateway — Centraliza fan-out a múltiples backends. Tail sampling. Autenticación centralizada.

Levantar con Docker Compose

# docker-compose.yaml
services:
  otel-collector:
    image: otel/opentelemetry-collector-contrib:latest
    command: ["--config=/etc/otel-collector-config.yaml"]
    volumes:
      - ./otel-collector-config.yaml:/etc/otel-collector-config.yaml
    ports:
      - "4317:4317"   # OTLP gRPC
      - "4318:4318"   # OTLP HTTP
      - "8888:8888"   # Métricas del Collector
    environment:
      - ENV=development

Distributions: Core vs Contrib

DistributionDescripción
otel/opentelemetry-collectorCore — solo componentes oficiales
otel/opentelemetry-collector-contribContrib — incluye componentes de la comunidad
grafana/otelcol-distributionsDistribución de Grafana Labs

Para la mayoría de casos usar contrib — tiene más receivers y exporters disponibles.