Observability Part 2 - OTeL Collector
OTeL collector is the workhorse of telemetry collection.
Parts in this series:
- Part 1 - Introduction
- Part 2 - OTeL Collector (current one)
What is OTeL Collector
OpenTelemetry describes OpenTelemetry Collector as:
The OpenTelemetry Collector offers a vendor-agnostic implementation of how to receive, process, and export telemetry data. It removes the need to run, operate, and maintain multiple agents/collectors. This works with improved scalability and supports open-source observability data formats (e.g. Jaeger, Prometheus, Fluent Bit, etc.) sending to one or more open source or commercial backends.
It operates on two modes for receiving telemetry - receivers and scrapers:
- Receiving telemetry by exposing HTTP and gRPC ports
- Scrapining endpoints exposed by tooling such as cAdvisor or Node Exporter
Configuration
OTeL Collector starts as a bare nothing when it comes to
configuration.
It is configured in yaml format and the structure is quite easy to
understand.
You have different sections:
receivers:
extensions:
processors:
exporters:
service:
pipelines:
Receivers
Receivers are prebuilt "plugins" You can use to configure how You want to receive Your telemetry.
We are interested in 3 for our journey - prometheus, filelog and otlp.
OTLP stands for OpenTelemetry Protocol, which is a standardized way to
send and receive telemetry.
We will use prometheus receiver to collect data from Node Exporter and
cAdvisor, filelog to tail system logs like syslog
, auth.log
and
kern.log
and finally otlp receiver to expose HTTP and gRPC endpoints
for our application to send telemetry.
Due to some restrictions with rootless docker, OTeL Collectors
hostmetric
receiver cannot get access to full system resources.
That is why, for the time being, we will use Node Exporter as a binary on
the system for full access.
Because we have cAdvisor running in a container and Node Exporter running
as a binary on the host, the receiver will look slightly different.
First, let's define cAdvisor receiver.
We will use docker-compose.yml
to bring up both otel-collector and
cAdvisor in the same network. That is why we will use its container name
to connect to it. See deploying observers for more
information on Docker-related information.
NB! Note the /cadvisor
and /node
and others in this
configuration. This is the way to "name" plugins, so You can use one
plugin for more than one receiver/processor/exporter.
receivers:
prometheus/cadvisor:
config:
scrape_configs:
- job_name: cadvisor
scrape_interval: 10s
static_configs:
- targets: ['cadvisor:8080']
For Node Exporter, it looks similar, but we have to set the host IP for otel-collector to be able to reach it from outside the docker network.
receivers:
...
prometheus/node:
config:
scrape_configs:
- job_name: node
scrape_interval: 10s
static_configs:
- targets: ['192.168.1.X:9100']
Don't forget to replace the placeholder IP with Your current host IP.
You can check it out with ip addr show
and look for line inet 192.168....
Next, we have filelog receiver.
Your system might have different log formats in use.
For RFC 5424 (time format 2024-04-15T19:30:22.123456+00:00
), use:
regex: ^(?P<time>\d{4}-\d{2}-\d{2}T\d{2}:\d{2}:\d{2}\.\d{6}\+\d{2}:\d{2}) (?P<user>\S+) (?P<service>[^:\[]+)(\[(?P<pid>\d+)\])?: (?P<msg>.*)$
For RFC 3164 (time format Apr 13 19:30:22
, use:
regex: ^(?P<time>[A-Z][a-z]{2} [ 0-9]{2} [0-9]{2}:[0-9]{2}:[0-9]{2}) (?P<host>\S+) (?P<service>[^\[:]+)(\[(?P<pid>\d+)\])?: (?P<msg>.*)$
We need to adjust the log parser a bit to extract the labels we want to pass on to Loki.
receivers:
...
filelog/syslog:
include: [/var/log/syslog, /var/log/auth.log, /var/log/kern.log]
operators:
- type: regex_parser
id: syslog_parser
regex: ^(?P<time>\d{4}-\d{2}-\d{2}T\d{2}:\d{2}:\d{2}\.\d{6}\+\d{2}:\d{2}) (?P<user>\S+) (?P<service>[^:\[]+)(\[(?P<pid>\d+)\])?: (?P<msg>.*)$
attributes:
source: syslog
- type: regex_parser
id: detect_error_level
regex: '(?i)(?P<level>error)'
parse_from: attributes.msg
if: 'attributes.msg contains "error"'
- type: add
id: fallback_level
if: 'attributes.level == nil'
field: attributes.level
value: INFO
So let's break this down a bit.
We want to be able to display logs in different panels in Grafana, so we
are using regex to parse information into Loki labels.
Those labels are:
- time
- user
- service
- pid
- msg
- error
That way we can later make different panels to show auth, system and docker logs.
Finally, we have the otlp endpoints.
receivers:
...
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
http:
endpoint: 0.0.0.0:4318
We are only going to use gRPC endpoint because of its efficient nature of the binary format.
Our receivers will now look like this.
receivers:
prometheus/cadvisor:
config:
scrape_configs:
- job_name: cadvisor
scrape_interval: 10s
static_configs:
- targets: ['cadvisor:8080']
prometheus/node:
config:
scrape_configs:
- job_name: node
scrape_interval: 10s
static_configs:
- targets: ['192.168.1.X:9100']
filelog/syslog:
include: [/var/log/syslog, /var/log/auth.log, /var/log/kern.log]
operators:
- type: regex_parser
id: syslog_parser
regex: ^(?P<time>\d{4}-\d{2}-\d{2}T\d{2}:\d{2}:\d{2}\.\d{6}\+\d{2}:\d{2}) (?P<user>\S+) (?P<service>[^:\[]+)(\[(?P<pid>\d+)\])?: (?P<msg>.*)$
attributes:
source: syslog
- type: regex_parser
id: detect_error_level
regex: '(?i)(?P<level>error)'
parse_from: attributes.msg
if: 'attributes.msg contains "error"'
- type: add
id: fallback_level
if: 'attributes.level == nil'
field: attributes.level
value: INFO
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
http:
endpoint: 0.0.0.0:4318
Of course, there are many, many more receivers to use.
See otel-collector-contrib receivers
for the full list of available receivers.
Extensions
We are not using extensions in this series, at least not for now.
OpenTelemetry describes extensions as
Extensions provide capabilities on top of the primary functionality of the collector. Generally, extensions are used for implementing components that can be added to the Collector, but which do not require direct access to telemetry data and are not part of the pipelines (like receivers, processors, or exporters).
Example extensions are: Health Check extension that responds to health check requests or PProf extension that allows fetching Collector's performance profile.
Exporters
Exporters are "plugins" for exporting telemetry data to desired
backends.
For us, those are Loki for logs, Prometheus for metrics, and Tempo for
traces.
For this series, there are two exporters we can use - otlp and otlphttp.
Which one to use depends on the backends.
For us, Tempo accepts gRPC so we are using otlp and Prometheus, and Loki
use http/protobuf, so we are using otlphttp for those.
We are also taking a first step in securing the observability stack by
encrypting data on route with TLS. We will get back to the application-to-otel-collector TLS connection in the future.
For this series, I will denote backends as if they were docker
containers.
Substitute them with Your appropriate backends.
exporters:
otlp/tempo:
endpoint: "https://tempo:4318"
tls:
ca_file: /etc/otel/certs/ca.pem
cert_file: /etc/otel/certs/client_cert.pem
key_file: /etc/otel/certs/client_key.pem
insecure_skip_verify: false
otlphttp/prometheus:
endpoint: "https://prometheus:9090/api/v1/otlp"
tls:
ca_file: /etc/otel/certs/ca.pem
cert_file: /etc/otel/certs/client_cert.pem
key_file: /etc/otel/certs/client_key.pem
insecure_skip_verify: false
otlp/loki:
endpoint: "https://loki:3100/otlp"
tls:
ca_file: /etc/otel/certs/ca.pem
cert_file: /etc/otel/certs/client_cert.pem
key_file: /etc/otel/certs/client_key.pem
insecure_skip_verify: false
debug:
We will also include debug
in case You need to debug problems with
pipelines.
NB! otlphttp knows to append /v1/metrics
, so don't append it here.
We will need to append it later when we export OTeL Collector telemetry
itself via http/protobuf
Processors
Processors are a neat way to manipulate or filter telemetry.
We are going to use a few to set common attributes.
First, we want to batch telemetry for exporting. It is more efficient
that way.
This one is easy.
processors:
batch:
Next, we want to filter out some Node Exporter metrics to reduce some noise in Prometheus.
processors:
batch:
filter/metrics:
metrics:
exclude:
match_type: regexp
metric_names:
- go_.*
- process_.*
- misc_.*
- scrape_.*
- promhttp_.*
And that covers all processors for now. We will come back to this later to expand it as the series progresses.
Services
The last thing is services.
This ties together all components into pipelines.
But first, we want to export otel-collector telemetry as well.
service:
telemetry:
metrics:
readers:
- periodic:
interval: 15000
exporter:
otlp:
protocol: http/protobuf
endpoint:
"https://prometheus:9090/api/v1/otlp/v1/metrics"
pipelines:
traces:
receivers: [otlp]
processors: [batch]
exporters: [otlp/tempo]
metrics:
receivers: [prometheus/cadvisor, prometheus/node, otlp]
processors: [batch, filter/metrics]
exporters: [otlphttp/prometheus]
logs/application:
receivers: [otlp]
processors: [batch]
exporters: [otlphttp/loki]
logs/system:
receivers: [filelog/receiver]
processors: [batch]
exporters: [otlphttp/loki]
You should see now how all the receivers, processors, and exporters are tied together in this pipeline phase. You simply add the desired plugin by name to the appropriate place and it works.
The config file should now look like this:
receivers:
prometheus/cadvisor:
config:
scrape_configs:
- job_name: cadvisor
scrape_interval: 10s
static_configs:
- targets: ['cadvisor:8080']
prometheus/node:
config:
scrape_configs:
- job_name: node
scrape_interval: 10s
static_configs:
- targets: ['192.168.1.X:9100']
filelog/syslog:
include: [/var/log/syslog, /var/log/auth.log, /var/log/kern.log]
operators:
- type: regex_parser
id: syslog_parser
regex: ^(?P<time>\d{4}-\d{2}-\d{2}T\d{2}:\d{2}:\d{2}\.\d{6}\+\d{2}:\d{2}) (?P<user>\S+) (?P<service>[^:\[]+)(\[(?P<pid>\d+)\])?: (?P<msg>.*)$
attributes:
source: syslog
- type: regex_parser
id: detect_error_level
regex: '(?i)(?P<level>error)'
parse_from: attributes.msg
if: 'attributes.msg contains "error"'
- type: add
id: fallback_level
if: 'attributes.level == nil'
field: attributes.level
value: INFO
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
http:
endpoint: 0.0.0.0:4318
exporters:
otlp/tempo:
endpoint: "https://tempo:4318"
tls:
ca_file: /etc/otel/certs/ca.pem
cert_file: /etc/otel/certs/client_cert.pem
key_file: /etc/otel/certs/client_key.pem
insecure_skip_verify: false
otlphttp/prometheus:
endpoint: "https://prometheus:9090/api/v1/otlp"
tls:
ca_file: /etc/otel/certs/ca.pem
cert_file: /etc/otel/certs/client_cert.pem
key_file: /etc/otel/certs/client_key.pem
insecure_skip_verify: false
otlp/loki:
endpoint: "https://loki:3100/otlp"
tls:
ca_file: /etc/otel/certs/ca.pem
cert_file: /etc/otel/certs/client_cert.pem
key_file: /etc/otel/certs/client_key.pem
insecure_skip_verify: false
debug:
processors:
batch:
filter/metrics:
metrics:
exclude:
match_type: regexp
metric_names:
- go_.*
- process_.*
- misc_.*
- scrape_.*
- promhttp_.*
service:
telemetry:
metrics:
readers:
- periodic:
interval: 15000
exporter:
otlp:
protocol: http/protobuf
endpoint:
"https://prometheus:9090/api/v1/otlp/v1/metrics"
pipelines:
traces:
receivers: [otlp]
processors: [batch]
exporters: [otlp/tempo]
metrics:
receivers: [prometheus/cadvisor, prometheus/node, otlp]
processors: [batch, filter/metrics]
exporters: [otlphttp/prometheus]
logs/application:
receivers: [otlp]
processors: [batch]
exporters: [otlphttp/loki]
logs/system:
receivers: [filelog/receiver]
processors: [batch]
exporters: [otlphttp/loki]
Deploying observers
There is no point in instrumenting Your code if You have nothing that
collects and transmits that telemetry.
That is why we first set up the "observer" stack on the same host where
the application will be deployed.
Now create a docker-compose.yaml
file and spin up the observers.
services:
otel-collector:
image: otel/opentelemetry-collector-contrib:latest
container_name: otel-collector
environment:
SSL_CERT_FILE: /etc/otel/certs/ca.pem
command:
- --config=/etc/otel/config/otel-collector-config.yaml
volumes:
- ./ca.pem:/etc/otel/certs/ca.pem
- ./client_cert.pem:/etc/otel/certs/client_cert.pem
- ./client_key.pem:/etc/otel/certs/client_key.pem
- ./otel-collector-config.yaml:/etc/otel/config/otel-collector-config.yaml
- /var/log:/var/log:ro
ports:
- "4317:4317"
- "4318:4318"
networks:
- observers
cadvisor:
image: gcr.io/cadvisor/cadvisor:v0.52.1
container_name: cadvisor
volumes:
- /:/rootfs:ro
- /var/run:/var/run:ro
- /sys:/sys:ro
- /run/user/1000:/var/run:ro
- /home/<user>/bin/docker:/var/lib/docker:ro
- /dev/disk:/dev/disk:ro
- /etc/machine-id:/etc/machine-id
devices:
- /dev/kmsg
ports:
- "8080:8080"
networks:
- observers
networks:
observers:
In the next part, we set up the observability stack so we can receive telemetry from OTeL Collector.