Open telemetry collector
StackState v6.0
The OpenTelemetry Collector offers a vendor-agnostic implementation to receive, process and export telemetry data. Applications instrumented with Open Telemetry SDKs can use the collector to send telemetry data to StackState (traces and metrics).
Your applications, when set up with OpenTelemetry SDKs, can use the collector to send telemetry data, like traces and metrics, straight to StackState. The collector is set up to receive this data by default via OTLP, the native open telemetry protocol. It can also receive data in other formats provided by other instrumentation SDKs like Jaeger and Zipkin for traces, and Influx and Prometheus for metrics.
Usually, the collector is running close to your application, like in the same Kubernetes cluster, making the process efficient.
For StackState integration, it's simple: StackState offers an OTLP endpoint using the gRPC protocol and uses bearer tokens for authentication. This means configuring your OpenTelemetry collector to send data to StackState is easy and standardized.
Pre-requisites
A Kubernetes cluster with an application that is instrumented with Open Telemetry
An API key for StackState
Permissions to deploy the open telemetry collector in a namespace on the cluster (i.e. create resources like deployments and configmaps in a namespace). To be able to enrich the data with Kubernetes attributes permission is needed to create a cluster role and role binding.
Kubernetes configuration and deployment
To install and configure the collector for usage with StackState we'll use the Open Telemetry Collector helm chart and add the configuration needed for StackState:
helm chart configuration
generating metrics from traces
sending the data to StackState
combine it all together in pipelines
Configure the collector
Here is the full values file needed, continue reading below the file for an explanation of the different parts. Or skip ahead to the next step, but make sure to replace:
<otlp-stackstate-endpoint>
with the OTLP endpoint of your StackState. If, for example, you access StackState onplay.stackstate.com
the OTLP endpoint isotlp-play.stackstate.com
. So simply prefixingotlp-
to the normal StackState url will do.<your-cluster-name>
with the cluster name you configured in StackState. This must be the same cluster name used when installing the StackState agent. Using a differnt cluster name will result in an empty traces perspective for Kubernetes components.
The Kubernetes attributes and the span metrics namespace are required for StackState to provide full functionality.
The suggested configuration includes tail sampling for traces. Sampling can be fully customized and, depending on your applications and the volume of traces, it may be needed to change this configuration. For example an increase (or decrease) in max_total_spans_per_second
. It is highly recommended to keep sampling enabled to keep resource usage and cost under control.
The config
section customizes the collector config itself and is discussed in the next section. The other parts are:
extraEnvsFrom
: Sets environment variables from the specified secret, in the next step this secret is created for storing the StackState API key (Receiver / Ingestion API Key)mode
: Run the collector as a Kubernetes deployment, when to use the other modes is discussed here.ports
: Used to enable the metrics port such that the collector can scrape its own metricspresets
: Used to enable the default configuration for adding Kubernetes metadata as attributes, this includes Kubernetes labels and metadata like namespace, pod, deployment etc. Enabling the metadata also introduces the cluster role and role binding mentioned in the pre-requisites.
Configuration
The service
section determines what components of the collector are enabled. The configuration for those components comes from the other sections (extensions, receivers, connectors, processors and exporters). The extensions
section enables:
health_check
, doesn't need additional configuration but adds an endpoint for Kubernetes liveness and readiness probesbearertokenauth
, this extension adds an authentication header to each request with the StackState API key. In its configuration, we can see it is getting the StackState API key from the environment variableAPI_KEY
.
The pipelines
section defines pipelines for the traces and metrics. The metrics pipeline defines:
receivers
, to receive metrics from instrumented applications (via the OTLP protocol,otlp
), from spans (thespanmetrics
connector) and by scraping Prometheus endpoints (theprometheus
receiver). The latter is configured by default in the collector Helm chart to scrape the collectors own metricsprocessors
: Thememory_limiter
helps to prevent out-of-memory errors. Thebatch
processor helps better compress the data and reduce the number of outgoing connections required to transmit the data. Theresource
processor adds additional resource attributes (discussed separately)exporters
: Thedebug
exporter simply logs to stdout which helps when troubleshooting. Theotlp/stackstate
exporter sends telemetry data to StackState using the OTLP protocol. It is configured to use the bearertokenauth extension for authentication to send data to the StackState OTLP endpoint.
For traces, there are 3 pipelines that are connected:
traces
: The pipeline that receives traces from SDKs (via theotlp
receiver) and does the initial processing using the same processors as for metrics. It exports into a router which routes all spans to both other traces pipelines. This setup makes it possible to calculate span metrics for all spans while applying sampling to the traces that are exported.traces/spanmetrics
: Use thespanmetrics
connector as an exporter to generate metrics from the spans (otel_span_duration
andotel_span_calls
). It is configured to not report time series anymore when no spans have been observed for 5 minutes. StackState expects the span metrics to be prefixed withotel_span_
, which is taken care of by thenamespace
configuration.traces/sampling
: The pipeline that exports traces to StackState using the OTLP protocol, but uses the tail sampling processor to make the trace volume that is sent to StackState predictable to keep the cost predictable as well. Sampling is discussed in a separate section.
The resource
processor is configured for both metrics and traces. It adds extra resource attributes:
The
k8s.cluster.name
is added by providing the cluster name in the configuration. StackState needs the cluster name and Open Telemetry does not have a consistent way of determining it. Because some SDKs, in some environments, provide a cluster name that does not match what StackState expects the cluster name is anupsert
(overwrites any pre-existing value).The
service.instance.id
is added based on the pod uid. It is recommended to always provide a service instance id, and the pod uid is an easy way to get a unique identifier if the SDKs don't provide one.
Trace Sampling
It is highly recommended to use sampling for traces:
To manage resource usage by only processing and storing the most relevant traces
To manage costs and have predictable costs
To reduce noise and focus on the important traces only, for example by filtering out health checks
There are 2 approaches for sampling, head sampling and tail sampling. This Open Telemetry docs page discusses the pros and cons of both approaches in detail. The collector configuration provided here uses tail sampling to support these requirements:
Have predictable cost by having a predictable trace volume
Have a large sample of all errors
Have a large sample of all slow traces
Have a sample of all other traces to see the normal application behavior
Criteria 2 and 3 can only be fulfilled by tail sampling. Let's look at the sampling policies used in the configuration of the tail sampler now:
There is only one top-level policy, it is a
composite
policy. It uses a rate limit, allowing at most 500 traces per second, giving a predictable trace volume. It uses other policies as sub-policies to make the actual sampling decissions.The
errors
policy is of typestatus_code
and is configured to only sample traces that contain errors. 33% of the rate limit is reserved for errors, via therate_allocation
section of the composite policy.The
slow-traces
policy is of typelatency
and filters all traces slower than 1 second. 33% of the rate limits is reserved for the slow traces.The
rest
policy is of thealways_sample
type. It will sample all traces until it hits the rate limit enforced by the composite policy, which is 34% of the total rate limit of 500 traces.
There are many more policies available that can be added to the configuration when needed. For example, it is possible to filter traces based on certain attributes (only for a specific application or customer). The tail sampler can also be replaced with the probabilistic sampler. For all configuration options please use the documentation of these processors:
Create a secret for the API key
The collector needs a Kubernetes secret with the StackState API key. Create that in the same namespace (here we are using the open-telemetry
namespace) where the collector will be installed (replace <stackstate-api-key>
with your API key):
StackState supports two types of keys:
Receiver API Key
Ingestion API Key
Receiver API Key
You can find the API key for StackState on the Kubernetes Stackpack installation screen:
Open StackState
Navigate to StackPacks and select the Kubernetes StackPack
Open one of the installed instances
Scroll down to the first set of installation instructions. It shows the API key as
STACKSTATE_RECEIVER_API_KEY
in text and as'stackstate.apiKey'
in the command.
Ingestion API Key
StackState supports creating multiple Ingestion Keys. This allows you to assign a unique key to each OpenTelemetry Collector for better security and access control. For instructions on generating an Ingestion API Key, refer to the documentation page.
Deploy the collector
To deploy the collector first make sure you have the Open Telemetry helm charts repository configured:
Now install the collector, using the configuration defined in the previous steps:
Configure applications
The collector as it is configured now is ready to receive and send telemetry data. The only thing left to do is to update the SDK configuration for your applications to send their telemetry via the collector to the agent.
Use the generic configuration for the SDKs to export data to the collector. Follow the language-specific instrumentation instructions to enable the SDK for your applications.
Related resources
The Open Telemetry documentation provides much more details on the configuration and alternative installation options:
Open Telemetry Collector configuration: https://opentelemetry.io/docs/collector/configuration/
Kubernetes installation of the collector: https://opentelemetry.io/docs/kubernetes/helm/collector/
Using the Kubernetes operator instead of the collector Helm chart: https://opentelemetry.io/docs/kubernetes/operator/
Open Telemetry sampling: https://opentelemetry.io/blog/2022/tail-sampling/
Last updated