Observability Part 2 - OTeL Collector

OTeL collector is the workhorse of telemetry collection.

 

Parts in this series:

What is OTeL Collector

OpenTelemetry describes OpenTelemetry Collector as:

The OpenTelemetry Collector offers a vendor-agnostic implementation of how to receive, process, and export telemetry data. It removes the need to run, operate, and maintain multiple agents/collectors. This works with improved scalability and supports open-source observability data formats (e.g. Jaeger, Prometheus, Fluent Bit, etc.) sending to one or more open source or commercial backends.

It operates on two modes for receiving telemetry - receivers and scrapers:

  • Receiving telemetry by exposing HTTP and gRPC ports
  • Scrapining endpoints exposed by tooling such as cAdvisor or Node Exporter

Configuration

OTeL Collector starts as a bare nothing when it comes to configuration.
It is configured in yaml format and the structure is quite easy to understand.

You have different sections:


receivers:
extensions:
processors:
exporters:
service:
  pipelines:

Receivers

Receivers are prebuilt "plugins" You can use to configure how You want to receive Your telemetry.

 

We are interested in 3 for our journey - prometheus, filelog and otlp.
OTLP stands for OpenTelemetry Protocol, which is a standardized way to send and receive telemetry.
We will use prometheus receiver to collect data from Node Exporter and cAdvisor, filelog to tail system logs like syslog, auth.log and kern.log and finally otlp receiver to expose HTTP and gRPC endpoints for our application to send telemetry.

 

Due to some restrictions with rootless docker, OTeL Collectors hostmetric receiver cannot get access to full system resources.
That is why, for the time being, we will use Node Exporter as a binary on the system for full access.

 

Because we have cAdvisor running in a container and Node Exporter running as a binary on the host, the receiver will look slightly different.
First, let's define cAdvisor receiver.

 

We will use docker-compose.yml to bring up both otel-collector and cAdvisor in the same network. That is why we will use its container name to connect to it. See deploying observers for more information on Docker-related information.

 

NB! Note the /cadvisor and /node and others in this configuration. This is the way to "name" plugins, so You can use one plugin for more than one receiver/processor/exporter.

 


receivers:
  prometheus/cadvisor:
    config:
      scrape_configs:
        - job_name: cadvisor
          scrape_interval: 10s
          static_configs:
            - targets: ['cadvisor:8080']

 

For Node Exporter, it looks similar, but we have to set the host IP for otel-collector to be able to reach it from outside the docker network.


receivers:
  ...

  prometheus/node:
    config:
      scrape_configs:
        - job_name: node
          scrape_interval: 10s
          static_configs:
            - targets: ['192.168.1.X:9100']

Don't forget to replace the placeholder IP with Your current host IP.
You can check it out with ip addr show and look for line inet 192.168....

 

Next, we have filelog receiver.
Your system might have different log formats in use.

 

For RFC 5424 (time format 2024-04-15T19:30:22.123456+00:00), use:


regex: ^(?P<time>\d{4}-\d{2}-\d{2}T\d{2}:\d{2}:\d{2}\.\d{6}\+\d{2}:\d{2}) (?P<user>\S+) (?P<service>[^:\[]+)(\[(?P<pid>\d+)\])?: (?P<msg>.*)$

For RFC 3164 (time format Apr 13 19:30:22, use:


regex: ^(?P<time>[A-Z][a-z]{2} [ 0-9]{2} [0-9]{2}:[0-9]{2}:[0-9]{2}) (?P<host>\S+) (?P<service>[^\[:]+)(\[(?P<pid>\d+)\])?: (?P<msg>.*)$

We need to adjust the log parser a bit to extract the labels we want to pass on to Loki.


receivers:
  ...
  filelog/syslog:
    include: [/var/log/syslog, /var/log/auth.log, /var/log/kern.log]
    operators:
      - type: regex_parser
        id: syslog_parser
        regex: ^(?P<time>\d{4}-\d{2}-\d{2}T\d{2}:\d{2}:\d{2}\.\d{6}\+\d{2}:\d{2}) (?P<user>\S+) (?P<service>[^:\[]+)(\[(?P<pid>\d+)\])?: (?P<msg>.*)$
        attributes:
          source: syslog

      - type: regex_parser
        id: detect_error_level
        regex: '(?i)(?P<level>error)'
        parse_from: attributes.msg
        if: 'attributes.msg contains "error"'

      - type: add
        id: fallback_level
        if: 'attributes.level == nil'
        field: attributes.level
        value: INFO

So let's break this down a bit.
We want to be able to display logs in different panels in Grafana, so we are using regex to parse information into Loki labels.
Those labels are:

  • time
  • user
  • service
  • pid
  • msg
  • error

That way we can later make different panels to show auth, system and docker logs.

 

Finally, we have the otlp endpoints.


receivers:
  ...
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318

We are only going to use gRPC endpoint because of its efficient nature of the binary format.

 

Our receivers will now look like this.


receivers:
  prometheus/cadvisor:
    config:
      scrape_configs:
        - job_name: cadvisor
          scrape_interval: 10s
          static_configs:
            - targets: ['cadvisor:8080']

  prometheus/node:
    config:
      scrape_configs:
        - job_name: node
          scrape_interval: 10s
          static_configs:
            - targets: ['192.168.1.X:9100']

  filelog/syslog:
    include: [/var/log/syslog, /var/log/auth.log, /var/log/kern.log]
    operators:
      - type: regex_parser
        id: syslog_parser
        regex: ^(?P<time>\d{4}-\d{2}-\d{2}T\d{2}:\d{2}:\d{2}\.\d{6}\+\d{2}:\d{2}) (?P<user>\S+) (?P<service>[^:\[]+)(\[(?P<pid>\d+)\])?: (?P<msg>.*)$
        attributes:
          source: syslog

      - type: regex_parser
        id: detect_error_level
        regex: '(?i)(?P<level>error)'
        parse_from: attributes.msg
        if: 'attributes.msg contains "error"'

      - type: add
        id: fallback_level
        if: 'attributes.level == nil'
        field: attributes.level
        value: INFO

  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318

 

Of course, there are many, many more receivers to use.
See otel-collector-contrib receivers for the full list of available receivers.

Extensions

We are not using extensions in this series, at least not for now.
OpenTelemetry describes extensions as

Extensions provide capabilities on top of the primary functionality of the collector. Generally, extensions are used for implementing components that can be added to the Collector, but which do not require direct access to telemetry data and are not part of the pipelines (like receivers, processors, or exporters).
Example extensions are: Health Check extension that responds to health check requests or PProf extension that allows fetching Collector's performance profile.

Exporters

Exporters are "plugins" for exporting telemetry data to desired backends.
For us, those are Loki for logs, Prometheus for metrics, and Tempo for traces.

 

For this series, there are two exporters we can use - otlp and otlphttp.
Which one to use depends on the backends.
For us, Tempo accepts gRPC so we are using otlp and Prometheus, and Loki use http/protobuf, so we are using otlphttp for those.
We are also taking a first step in securing the observability stack by encrypting data on route with TLS. We will get back to the application-to-otel-collector TLS connection in the future.

 

For this series, I will denote backends as if they were docker containers.
Substitute them with Your appropriate backends.


exporters:
  otlp/tempo:
    endpoint: "https://tempo:4318"
    tls:
      ca_file: /etc/otel/certs/ca.pem
      cert_file: /etc/otel/certs/client_cert.pem
      key_file: /etc/otel/certs/client_key.pem
      insecure_skip_verify: false
  otlphttp/prometheus:
    endpoint: "https://prometheus:9090/api/v1/otlp"
    tls:
      ca_file: /etc/otel/certs/ca.pem
      cert_file: /etc/otel/certs/client_cert.pem
      key_file: /etc/otel/certs/client_key.pem
      insecure_skip_verify: false
  otlp/loki:
    endpoint: "https://loki:3100/otlp"
    tls:
      ca_file: /etc/otel/certs/ca.pem
      cert_file: /etc/otel/certs/client_cert.pem
      key_file: /etc/otel/certs/client_key.pem
      insecure_skip_verify: false
  debug:

We will also include debug in case You need to debug problems with pipelines.
NB! otlphttp knows to append /v1/metrics, so don't append it here.
We will need to append it later when we export OTeL Collector telemetry itself via http/protobuf

Processors

Processors are a neat way to manipulate or filter telemetry.
We are going to use a few to set common attributes.

 

First, we want to batch telemetry for exporting. It is more efficient that way.
This one is easy.


processors:
  batch:

Next, we want to filter out some Node Exporter metrics to reduce some noise in Prometheus.


processors:
  batch:
  filter/metrics:
    metrics:
      exclude:
        match_type: regexp
        metric_names:
          - go_.*
          - process_.*
          - misc_.*
          - scrape_.*
          - promhttp_.*
          

And that covers all processors for now. We will come back to this later to expand it as the series progresses.

Services

The last thing is services.
This ties together all components into pipelines. But first, we want to export otel-collector telemetry as well.


service:
  telemetry:
    metrics:
      readers:
        - periodic:
            interval: 15000
            exporter:
              otlp:
                protocol: http/protobuf
                endpoint:
                "https://prometheus:9090/api/v1/otlp/v1/metrics"
  pipelines:
    traces:
      receivers: [otlp]
      processors: [batch]
      exporters: [otlp/tempo]
    metrics:
      receivers: [prometheus/cadvisor, prometheus/node, otlp]
      processors: [batch, filter/metrics]
      exporters: [otlphttp/prometheus]
    logs/application:
      receivers: [otlp]
      processors: [batch]
      exporters: [otlphttp/loki]
    logs/system:
      receivers: [filelog/receiver]
      processors: [batch]
      exporters: [otlphttp/loki]

You should see now how all the receivers, processors, and exporters are tied together in this pipeline phase. You simply add the desired plugin by name to the appropriate place and it works.

 

The config file should now look like this:


receivers:
  prometheus/cadvisor:
    config:
      scrape_configs:
        - job_name: cadvisor
          scrape_interval: 10s
          static_configs:
            - targets: ['cadvisor:8080']

  prometheus/node:
    config:
      scrape_configs:
        - job_name: node
          scrape_interval: 10s
          static_configs:
            - targets: ['192.168.1.X:9100']

  filelog/syslog:
    include: [/var/log/syslog, /var/log/auth.log, /var/log/kern.log]
    operators:
      - type: regex_parser
        id: syslog_parser
        regex: ^(?P<time>\d{4}-\d{2}-\d{2}T\d{2}:\d{2}:\d{2}\.\d{6}\+\d{2}:\d{2}) (?P<user>\S+) (?P<service>[^:\[]+)(\[(?P<pid>\d+)\])?: (?P<msg>.*)$
        attributes:
          source: syslog

      - type: regex_parser
        id: detect_error_level
        regex: '(?i)(?P<level>error)'
        parse_from: attributes.msg
        if: 'attributes.msg contains "error"'

      - type: add
        id: fallback_level
        if: 'attributes.level == nil'
        field: attributes.level
        value: INFO

  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318

exporters:
  otlp/tempo:
    endpoint: "https://tempo:4318"
    tls:
      ca_file: /etc/otel/certs/ca.pem
      cert_file: /etc/otel/certs/client_cert.pem
      key_file: /etc/otel/certs/client_key.pem
      insecure_skip_verify: false
  otlphttp/prometheus:
    endpoint: "https://prometheus:9090/api/v1/otlp"
    tls:
      ca_file: /etc/otel/certs/ca.pem
      cert_file: /etc/otel/certs/client_cert.pem
      key_file: /etc/otel/certs/client_key.pem
      insecure_skip_verify: false
  otlp/loki:
    endpoint: "https://loki:3100/otlp"
    tls:
      ca_file: /etc/otel/certs/ca.pem
      cert_file: /etc/otel/certs/client_cert.pem
      key_file: /etc/otel/certs/client_key.pem
      insecure_skip_verify: false
  debug:

processors:
  batch:
  filter/metrics:
    metrics:
      exclude:
        match_type: regexp
        metric_names:
          - go_.*
          - process_.*
          - misc_.*
          - scrape_.*
          - promhttp_.*

service:
  telemetry:
    metrics:
      readers:
        - periodic:
            interval: 15000
            exporter:
              otlp:
                protocol: http/protobuf
                endpoint:
                "https://prometheus:9090/api/v1/otlp/v1/metrics"
  pipelines:
    traces:
      receivers: [otlp]
      processors: [batch]
      exporters: [otlp/tempo]
    metrics:
      receivers: [prometheus/cadvisor, prometheus/node, otlp]
      processors: [batch, filter/metrics]
      exporters: [otlphttp/prometheus]
    logs/application:
      receivers: [otlp]
      processors: [batch]
      exporters: [otlphttp/loki]
    logs/system:
      receivers: [filelog/receiver]
      processors: [batch]
      exporters: [otlphttp/loki]

Deploying observers

There is no point in instrumenting Your code if You have nothing that collects and transmits that telemetry.
That is why we first set up the "observer" stack on the same host where the application will be deployed.

Now create a docker-compose.yaml file and spin up the observers.


services:
  otel-collector:
    image: otel/opentelemetry-collector-contrib:latest
    container_name: otel-collector
    environment:
      SSL_CERT_FILE: /etc/otel/certs/ca.pem
    command:
      - --config=/etc/otel/config/otel-collector-config.yaml
    volumes:
      - ./ca.pem:/etc/otel/certs/ca.pem
      - ./client_cert.pem:/etc/otel/certs/client_cert.pem
      - ./client_key.pem:/etc/otel/certs/client_key.pem
      - ./otel-collector-config.yaml:/etc/otel/config/otel-collector-config.yaml
      - /var/log:/var/log:ro
    ports:
      - "4317:4317"
      - "4318:4318"
    networks:
      - observers

  cadvisor:
    image: gcr.io/cadvisor/cadvisor:v0.52.1
    container_name: cadvisor
    volumes:
      - /:/rootfs:ro
      - /var/run:/var/run:ro
      - /sys:/sys:ro
      - /run/user/1000:/var/run:ro
      - /home/<user>/bin/docker:/var/lib/docker:ro
      - /dev/disk:/dev/disk:ro
      - /etc/machine-id:/etc/machine-id
    devices:
      - /dev/kmsg
    ports:
      - "8080:8080"
    networks:
      - observers


networks:
  observers:

In the next part, we set up the observability stack so we can receive telemetry from OTeL Collector.