Golden signals
StackState Self-hosted v4.5.x
Last updated
StackState Self-hosted v4.5.x
Last updated
This page describes StackState v4.5.x. The StackState 4.5 version range is End of Life (EOL) and no longer supported. We encourage customers still running the 4.5 version range to upgrade to a more recent release.
To assist in monitoring distributed systems with a defined SLO (Service Level Objective), StackState can be configured to alert you if an SLI (Service Level Indicator) falls below a defined threshold. StackState Agent V2 deployed on a Linux host will retrieve telemetry that can be used to monitor the four golden signals (sre.google). These metrics can then be used to build a check in StackState that responds to fluctuations in service level.
The checks described on this page do not ensure that you are meeting your SLO directly, but they can help prevent an SLO violation by catching and alerting on changes in your SLIs as soon as possible.
To work with golden signals and the checks described on this page, you need to have:
The StackState Agent V2 StackPack installed in StackState.
StackState Agent V2 version 2.12 or higher running on a Linux host with a network tracer.
To monitor the time it takes to service a request, StackState supports metric streams for HTTP response time for processes and services.
To add a latency stream, add a telemetry stream and select the following metric: http_response_time_seconds
. You can filter the stream on any HTTP response code or any of the predefined groups:
any
success (100-399)
1xx
2xx
3xx
4xx
5xx
By default, the following response time streams are set for processes and services that serve on HTTP requests:
HTTP total response time (s) (95th percentile)
HTTP 5xx error response time (s) (95th percentile)
HTTP 4xx error response time (s) (95th percentile)
HTTP Success response time (s) (95th percentile)
Similar to measuring the latency, StackState Agent V2 supports the http_requests_per_second
telemetry stream. The same response codes and predefined groups are also supported for the traffic stream.
By default, the following request rate streams are set for processes and services that serve on HTTP requests:
HTTP total rate (req/s)
HTTP 5xx error rate (req/s)
HTTP 4xx error rate (req/s)
HTTP Success rate (req/s)
StackState allows you to monitor on any specific HTTP error code or one of the 4xx or 5xx error groups, as explained above. If your SLO specifies a limit for the rate of errors in your system, you can add a check.
There are many ways StackState can help monitor the saturation of a system, for example:
HTTP Requests per second
CPU usage
Memory usage
To help you meet your SLA (Service Level Agreement) you can create checks in StackState. Examples of using a check function to monitor error percentage and response time are given below.
When selecting a metric stream for a health check, you will have some options to configure its behavior:
Windowing method - More details in this page.
Aggregation - The following aggregation methods are available:
MEAN
- mean
PERCENTILE_25
- 25 percentile
PERCENTILE_50
- 50 percentile
PERCENTILE_75
- 75 percentile
PERCENTILE_90
- 90 percentile
PERCENTILE_95
- 95 percentile
PERCENTILE_98
- 98 percentile
PERCENTILE_99
- 99 percentile
MAX
- maximum
MIN
- minimum
SUM
- sum
EVENT_COUNT
- the number of occurrences during bucket interval
SUM_NO_ZEROS
- sum of the values (missing values from a data source won't be filled with zeros)
EVENT_COUNT_NO_ZEROS
- the number of occurrences during bucket interval (missing values from a data source won't be filled with zeros)
Time window (or window size) - By default the time window is 300000 milliseconds (or 5 minutes). The time window will directly influence the number of positive or false negative alerts. The longer you configure the time window, the less sensitive it will be. However, if it is too short this may lead to a sudden spike in unwanted alerts, which might not help you meet your SLO. You should balance the time window based on the metric and how early you want to be alerted on spikes.
The Error percentage
check function can be used to monitor two streams - one reporting errors and one reporting a total. A DEVIATING or CRITICAL health state will be returned if the percentage of errors/total crosses the specified DeviatingThresholdPercentage
or CriticalThresholdPercentage
.
If your SLO defines that a service can have a maximum of 5% of requests failing, you can create a check using the Error percentage
function and set the CriticalThresholdPercentage
to 5.0
:
The Greater than or equal
check function can alert you when one of your telemetry streams is above a certain threshold. A DEVIATING or CRITICAL health state will be returned if the specified DeviatingThreshold
or CriticalThreshold
is crossed.
Use this function to make sure you meet your SLO for maximum response time, for example:
The metrics described in this page are gathered by the StackState Agent and can be disabled. Refer to the StackState Agent documentation for more information. Currently, the StackState Agent can only report on request rate and response time of HTTP/1 protocol. HTTP/2, HTTP/3 and HTTPS protocols are not yet supported.