Agent check API
StackState Self-hosted v5.1.x
Overview
The Agent check API can be used to create checks that run on the StackState Agent. This page explains how to work with the Agent check API to write checks that send topology, metrics, events and service status information to StackState.
Code examples for the open source StackState Agent checks can be found on GitHub at: https://github.com/StackVista/stackstate-agent-integrations.
Agent checks
From Agent 2.18, we have introduced AgentChecksV2 which has some key difference to historic Agent Checks. The key differences being:
V2 Agent Check checks requires a return value in the form of a CheckResponse
V2 Agent Check includes two new check base classes:
Agent Check V2 (Agent 2.18+)
An Agent Check is a Python class that inherits from AgentCheckV2
and implements the check
method:
Error Handling
In the event of a check error, the exception should be returned as part of the check response:
A more comprehensive example can be found in the StackState Agent Integrations repo
StatefulAgentCheck (Agent 2.18+)
An Stateful Agent Check is a Python class that inherits from StatefulAgentCheck
and implements the stateful_check
method. This is intended to be used for Agent checks that requires the ability to persist data across check runs and be available in the event of Agent failure. If an Agent failure occurs, the persisted state will be used in the next check run. Persistent state is persisted even in the event of check failure. The StatefulAgentCheck
receives the current persistent state as an input parameter. The persistent_state
parameter of the CheckResponse
return type is then set as the new persistent state value.
A more comprehensive example can be found in the StackState Agent Integrations repo
TransactionalAgentCheck (Agent 2.18+)
An Transactional Agent Check is a Python class that inherits from TransactionalAgentCheck
and implements the transactional_check
method. This is intended to be used for Agent checks that require transactional behavior for updating it's state. A Agent Check transaction is considered a success if the data submitted by the Agent Check reaches StackState. This enables checks to never process / submit data that has already been received by StackState. Persistent state is persisted even in the event of check failure, while transactional state is only persistent once a transaction has succeeded. The TransactionalAgentCheck
receives the current transactional and persistent state as input parameters. The transactional_state
and persistent_state
parameters of the CheckResponse
return type are then correspondingly set as the new state values.
A more comprehensive example can be found in the StackState Agent Integrations repo
Agent Check (To be deprecated)
An Agent Check is a Python class that inherits from AgentCheck
and implements the check
method:
Agent Checks (all)
The Agent creates an object of type MyCheck
for each element contained in the instances
sequence of the corresponding Agent Check configuration file:
All mapping included in the instances
section of the Agent Check configuration file is passed to the check
method using the declared instance
value.
The AgentCheck
, AgentCheckV2
, StatefulAgentCheck
, TransactionalAgentCheck
class provides the following methods and attributes:
self.name
- a name of the checkself.init_config
-init_config
that corresponds in the check configurationself.log
- a Python logger (python.org)
Scheduling
Multiple instances of the same check can run concurrently. If a check is already running, it isn't necessary to schedule another one.
Send data
Topology
Topology elements can be sent to StackState with the following methods:
self.component
- Create a component in StackState. See send components.self.relation
- Create a relation between two components in StackState. See send relations.self.start_snapshot()
- Start a topology snapshot for a specific topology instance source.self.stop_snapshot()
- Stop a topology snapshot for a specific topology instance source.
Send components
Components can be sent to StackState using the self.component(id, type, data)
method.
The method requires the following details:
id - string. A unique ID for this component. This has to be unique for this instance.
type - string. A named parameter for this type.
data - dictionary. A JSON blob of arbitrary data. The fields within this object can be referenced in the
ComponentTemplateFunction
and theRelationTemplateFunction
within StackState.
See the example of creating a component in StackState in the StackState MySQL check (github.com).
All submitted topologies are collected by StackState and flushed together with all the other Agent metrics at the end of check
function.
Send relations
Relations can be sent to StackState using the self.relation(source_id, target_id, type, data)
method.
The method requires the following details:
source_id - string. The source component externalId.
target_id - string. The target component externalId.
type - string. The type of relation.
data - dictionary. A JSON blob of arbitrary data. The fields within this object can be referenced in the
ComponentTemplateFunction
and theRelationTemplateFunction
within StackState.
See the example of creating a relation in StackState in the StackState SAP check (github.com).
All submitted topologies are collected by StackState and flushed together with all the other Agent metrics at the end of check
function.
Metrics
Metrics can be sent to StackState with the following methods:
self.gauge
- Sample a gauge metric.self.count
- Sample a raw count metric.self.rate
- Sample a point, with the rate calculated at the end of the check.self.increment
- Increment a counter metric.self.decrement
- Decrement a counter metric.self.histogram
- Sample a histogram metric.self.historate
- Sample a histogram based on rate metrics.self.monotonic_count
- Sample an increasing counter metric.
Each method accepts the following metric details:
name - the name of the metric.
value - the value for the metric. Defaults to 1 on increment, -1 on decrement.
tags - optional. A list of tags to associate with this metric.
hostname - optional. A hostname to associate with this metric. Defaults to the current host.
All submitted metrics are collected and flushed with all the other Agent metrics at the end of check
function.
Check the example to send metrics in the StackState MySQL check (github.com).
Events
Events can be sent to StackState with the self.event(event_dict)
method.
The event-dict
is a valid event JSON dictionary.
Note that msg_title
and msg_text
are required fields from Agent V2.11.0.
All events will be collected and flushed with the rest of the Agent payload at the end of the check
function.
Status (Agent Check only)
Reporting status of a service is handled by calling the service_check
method:
The method can accept the following arguments:
name
- the name of the service checkstatus
- a constant describing the service status defined in theAgentCheck
class:AgentCheck.OK
for success status.AgentCheck.WARNING
for failure status.AgentCheck.CRITICAL
for failure status.AgentCheck.UNKNOWN
for indeterminate status.
tags
- a list of tags to associate with the check. (optional)message
- additional information about the current status. (optional)
Check the usage in the following example.
This will be fully deprecated in Agent Check V2 in favour of the CheckResponse
.
Health
Health information can be sent to StackState with the following methods:
self.health.check_state
- send a check state as part of a snapshot.self.health.start_snapshot()
- start a health snapshot. Stackstate will only process health information if it's sent as part of a snapshot.self.health.stop_snapshot()
- stop the snapshot, signaling that all submitted data is complete. This should be done at the end of the check after all data has been submitted. If exceptions occur in the check or not all data can be produced for some other reason, this function should not be called.
Set up a health stream
To make the self.health
API available, override the get_health_stream
function to define a URN identifier for the health synchronization stream.
The HealthStream
class has the following options:
urn - HealthStreamUrn. The stream urn under which the health information will be grouped.
sub_stream - string. Optional. Allows for separating disjoint data sources within a single health synchronization stream. For example, the data for the streams is reported separately from different hosts.
repeat_interval_seconds - integer. Optional. The interval with which data will be repeated, defaults to
collection_interval
(min_collection_interval
for Agent V2.14.x or earlier). This allows StackState to detect when data arrives later than expected.expiry_seconds - integer. Optional. The time after which all data from the stream or substream should be removed. Set to '0' to disable expiry (this is only possible when the
sub_stream
parameter is omitted). Default 4*repeat_interval_seconds
.
For more information on urns, health synchronization streams, snapshots and how to debug, see health Synchronization.
Send check states
Components can be sent to StackState using the self.component(id, type, data)
method.
The method requires the following details:
check_state_id - string. Uniquely identifies the check state within the (sub)stream.
name - string. Display name for the health check state.
health_value - Health. The StackState health value, can be
CLEAR
,DEVIATING
orCRITICAL
.topology_element_identifier - string. The component or relation identifier that the check state should bind to. The check state will be associated with all components/relations that have the specified identifier.
message - string. Optional. Extended message to display with the health state. Supports Markdown.
For an example of how to create a component, see the StackState Static Health check (github.com).
Checks and streams
Streams and health checks can be sent to StackState together with a topology component. These can then be mapped together in StackState by a StackPack to give you telemetry streams and health states on your components.
All telemetry classes and methods can be imported from stackstate_checks.base
. The following stream types can be added:
Metric stream - a metric stream and associated metric health checks.
Events stream - a log stream with events and associated event health checks.
Service check stream - a log stream with service check statuses for a specific integration and associated event health checks.
In the example below, a MetricStream
is created on the metric system.cpu.usage
with some conditions specific to a component. A health check (check) maximum_average
is then created on this metric stream using this_host_cpu_usage.identifier
. The stream and check are then added to the streams and checks list for the component this-host
.
Events stream
Log streams containing events can be added to a component using the EventStream
class.
Each events stream has the following details:
name - The name for the stream in StackState.
conditions - A dictionary of key:value arguments that are used to filter the event values for the stream.
Event stream health check
Event stream health checks can optionally be mapped to an events stream using the stream identifier. The following event stream health checks are supported out of the box:
contains_key_value
Checks that the last event contains (at the top-level), the specified value for a key.
use_tag_as_health
Checks that returns the value of a tag in the event as the health state.
custom_health_check
This method provides the functionality to send in a custom event health check.
For details see the EventHealthChecks class (github.com).
An event stream health check includes the details listed below. Note that a custom_health_check only requires a name and check_arguments:
stream_id - the identifier of the stream the check should run on.
name - the name the check will have in StackState.
description - the description for the check in StackState.
remediation_hint - the remediation hint to display when the check return a CRITICAL health state.
contains_key - for check
contains_key_value
only. The key that should be contained in the event.contains_value - for check
contains_key_value
only. The value that should be contained in the event.found_health_state - for check
contains_key_value
only. The health state to return when this tag and value is found.missing_health_state - for check
contains_key_value
only. The health state to return when the tag/value isn't found.tag_name - for check
use_tag_as_health
only. The key of the tag that should be used as the health state.
For details see the EventHealthChecks class (github.com).
Metric stream
Metric streams can be added to a component using the MetricStream
class.
Each metric stream has the following details:
name - The name for the stream in StackState.
metricField - The name of the metric to select.
conditions - A dictionary of key:value arguments that are used to filter the metric values for the stream.
unit_of_measure - Optional. The unit of measure for the metric points, it gets appended after the stream name:
name (unit_of_measure)
aggregation - Optional. sets the aggregation function for the metrics in StackState. See aggregation methods.
priority - Optional. The stream priority in StackState, one of
NONE
,LOW
,MEDIUM
,HIGH
. HIGH priority streams are used for anomaly detection in StackState.
Metric stream health check
Metric stream health checks can optionally be mapped to a metric stream using the stream identifier. Note that some metric health checks require multiple streams for ratio calculations.
The following metric stream health checks are supported out of the box:
maximum_average
Calculates the health state by comparing the average of all metric points in the time window against the configured maximum values.
maximum_last
Calculates the health state only by comparing the last value in the time window against the configured maximum values.
maximum_percentile
Calculates the health state by comparing the specified percentile of all metric points in the time window against the configured maximum values. For the median specify 50 for the percentile. The percentile parameter must be a value > 0 and <= 100.
maximum_ratio
Calculates the ratio between the values of two streams and compares it against the critical and deviating value. If the ratio is larger than the specified critical or deviating value, the corresponding health state is returned.
minimum_average
Calculates the health state by comparing the average of all metric points in the time window against the configured minimum values.
minimum_last
Calculates the health state only by comparing the last value in the time window against the configured minimum values.
minimum_percentile
Calculates the health state by comparing the specified percentile of all metric points in the time window against the configured minimum values. For the median specify 50 for the percentile. The percentile must be a value > 0 and <= 100.
failed_ratio
Calculates the ratio between the last values of two streams (one is the normal metric stream and one is the failed metric stream). This ratio is compared against the deviating or critical value.
custom_health_check
Provides the functionality to send in a custom metric health check.
For details see the MetricHealthChecks class (github.com).
A metric stream health check has the details listed below. Note that a custom_health_check only requires a name and check_arguments:
name - the name the health check will have in StackState.
description - the description for the health check in StackState.
deviating_value - the threshold at which point the check will return a DEVIATING health state.
critical_value - the threshold at which point the check will return a CRITICAL health state.
remediation_hint - the remediation hint to display when the check returns a CRITICAL health state.
max_window - the max window size for the metrics.
percentile - for
maximum_percentile
andminimum_percentile
checks only. The percentile value to use for the calculation.stream identifier(s):
stream_id - for
maximum_percentile
,maximum_last
,maximum_average
,minimum_average
,minimum_last
,minimum_percentile
checks. The identifier of the stream the check should run on.denominator_stream_id - for
maximum_ratio
checks only. The identifier of the denominator stream the check should run on.numerator_stream_id - for
maximum_ratio
checks only. The identifier of the numerator stream the check should run on.success_stream_id - for
failed_ratio
checks only. The identifier of the success stream this check should run on.failed_stream_id - for
failed_ratio
checks only. The identifier of the failures stream this check should run on.
For details see the MetricHealthChecks class (github.com).
Service check stream
A Service Check stream can be added to a component using the ServiceCheckStream
class. It expects a stream name
and conditions
for the metric telemetry query in StackState. Service Check Streams has one out of the box supported check which can be mapped using the stream identifier.
Logging
The self.log
field is a Python logger (python.org) instance that prints to the main Agent log file. The log level can be set in the Agent configuration file stackstate.yaml
.
Example taken from the StackState MySQL Agent check (github.com).
Error handling
A check should raise a significant exception when it can't work correctly, for example due to a wrong configuration or runtime error. Exceptions are logged and shown in the Agent status page. The warning
method can be used to log a warning message and display it on the Agent status page.
Example taken from the StackState MySQL Agent check (github.com).
See also
Last updated