To assist in monitoring distributed systems with a defined SLO (Service Level Objective), StackState can be configured to alert you if an SLI (Service Level Indicator) falls below a defined threshold. StackState Agent V2 deployed on a Linux host will retrieve telemetry that can be used to monitor the four golden signals (sre.google). These metrics can then be used to build a check in StackState that responds to fluctuations in service level.
The checks described on this page do not ensure that you are meeting your SLO directly, but they can help prevent an SLO violation by catching and alerting on changes in your SLIs as soon as possible.
To work with golden signals and the checks described on this page, you need to have:
To monitor the time it takes to service a request, StackState supports metric streams for HTTP response time for processes and services.
HTTP total response time (s)
- success (100-399)
By default, the following response time streams are set for processes and services that serve on HTTP requests:
- HTTP total response time (s) (95th percentile)
- HTTP 5xx error response time (s) (95th percentile)
- HTTP 4xx error response time (s) (95th percentile)
- HTTP Success response time (s) (95th percentile)
HTTP response code
Similar to measuring the latency, StackState Agent V2 supports the
http_requests_per_secondtelemetry stream. The same response codes and predefined groups are also supported for the traffic stream.
By default, the following request rate streams are set for processes and services that serve on HTTP requests:
- HTTP total rate (req/s)
- HTTP 5xx error rate (req/s)
- HTTP 4xx error rate (req/s)
- HTTP Success rate (req/s)
HTTP total requests per second
StackState allows you to monitor on any specific HTTP error code or one of the 4xx or 5xx error groups, as explained above. If your SLO specifies a limit for the rate of errors in your system, you can add a check.
HTTP 5xx error rate
There are many ways StackState can help monitor the saturation of a system, for example:
- HTTP Requests per second
- CPU usage
- Memory usage
Error percentagecheck function can be used to monitor two streams - one reporting errors and one reporting a total. A
CRITICALhealth state will be returned if the percentage of errors/total crosses the specified
If your SLO defines that a service can have a maximum of 5% of requests failing, you can create a check using the
Error percentagefunction and set the
Error percentage check
Greater than or equalcheck function can alert you when one of your telemetry streams is above a certain threshold. A DEVIATING or CRITICAL health state will be returned if the specified
Use this function to make sure you meet your SLO for maximum response time, for example:
Response time check
Note that StackState Agent can only report on request rate and response time of HTTP/1 protocol. HTTP/2, HTTP/3 and HTTPS protocols are not currently supported.