Comment on page
Monitor functions
StackState Self-hosted v5.0.x
This page describes StackState version 5.0.
Monitor functions, much like any other type of function in StackState, are represented using the STJ file format.
The following snippet represents an example monitor function definition:
{
"_version": "1.0.39",
"timestamp": "2022-05-23T13:16:27.369269Z[GMT]",
"nodes": [
{
"_type": "MonitorFunction",
"name": "Metric Threshold",
"identifier": "urn:system:default:monitor-function:metric-threshold",
"description": "Validates the provided metric stream against the supplied threshold. Values above the threshold result in a CRITICAL health state..",
"parameters": [
...
],
"script": {
"_type": "ScriptFunctionBody",
"scriptBody": "..."
}
}
]
}
Much like the monitor definition, the file contains the requisite STJ paramateres and defines a single node instance. For brevity, the specific parameter definitions and the function body have been ommited. These parts will be elaborated upon in the following sections.
The supported fields are:
- name - a human-readable name of the function that ideally indicates its purpose.
- identifier a StackState-URN-formatted values that uniquely identifies this particular function, it is used by the monitor Definition during the invocation of this function.
- description - a more thorough description of the logic encapsulated by this function.
- parameters - a set of parameters "accepted" by this function, these allow the computation to be parameterized.
- script - names or lists out the concrete algorithm to use for this computation; currently, monitors only support scripts of type
ScriptFunctionBody
written using the Groovy scripting language.
A monitor function can use any number of parameters of an array of available types. These parameter types are not unlike the ones available to the Check functions with the one notable exception being the
SCRIPT_METRIC_QUERY
parameter type.SCRIPT_METRIC_QUERY
parameter type represents a Telemetry Query and ensures that a well-formed Telemetry Query builder value is supplied to the function. Here's an example declaration of such a parameter:"parameters": [{
"_type": "Parameter",
"name": "metrics",
"type": "SCRIPT_METRIC_QUERY",
"required": true,
"multiple": false
}]
The above declaration represents a single Telemetry Query accepting parameter named
metrics
that is required and allows no duplicates. Arguments of an invocation of a function using the above declaration:"arguments": [{
"_type": "ArgumentScriptMetricQueryVal",
"parameter": {{ get "<identifier-of-the-function>" "Type=Parameter;Name=metrics" }},
"script": "Telemetry.query('StackState Metrics', '').groupBy('tags.pid', 'tags.createTime', 'host').metricField('cpu_systemPct').start('-1m').aggregation('mean', '15s')"
}]
After the invocation, the
metrics
parameter assumes the value of the following telemetry query expression which in itself indicates the specific metric values to fetch along with the aggregation method and the time window to use:Telemetry
.query('StackState Metrics', '')
.groupBy('tags.pid', 'tags.createTime', 'host')
.metricField('cpu_systemPct')
.start('-1m')
.aggregation('mean', '15s')
A specialized telemetry query parameter type is useful as it ensures well-formedness of the above query - in case of any syntactic or type errors a suitable error will be reported and the system will prevent execution of the monitor function with potentially bogus values.
Similar to other kinds of functions in StackState, the monitor function follows an arbitrarily complex algorithm encoded as a Groovy source script. It is fairly flexible in what it can do allowing for Turing-complete computation and even communication with external services.
The monitor runner, which is responsible for the execution of monitor functions, expects every function to return a specific result type - an array of JSON-encoded
MonitorHealthState
objects. For example:return [
[
_type: "MonitorHealthState",
id: uniqueHealthStateIdentifier,
topologyIdentifier: topologyElementURN,
state: healthState,
displayTimeSeries: optionalChartConfiguration
],
// More results to follow...
]
Each object in this array represents a single monitor result - usually, it is a one to one mapping between a specific metrics' time series and a topology element, but in general it is possible to express many-to-one mappings as well. The supported fields are:
- id - the identifier for this concrete monitor result, it is used to deduplicate results arriving from different sources of computation,
- topologyIdentifier - indicates to which Topology element this result "belongs",
- state - indicates the resulting health state of the validation rule,
- displayTimeSeries - an optional configuration of chart(s) that will be displayed on the StackState interface together with the monitor result.
The
displayTimeSeries
field can optionally be used to instrument StackState to display a helpful metric chart (or multiple charts) together with the monitor result. It is expected to be an array of JSON-encoded objects of type DisplayTimeSeries
of the following shape:[
_type: "DisplayTimeSeries",
name: "A descriptive name for the chart",
query: aQueryToUseWhenFetchingMetrics,
timeSeriesId: aSpecificTimeSeriesId
]
The meaning of different
DisplayTimeSeries
fields:- name - a short name for the chart, it is displayed as the chart title on the StackState interface,
- query - a Telemetry query used to fetch the metrics data to display on the interface,
- timeSeriesId - an identifier of the concrete time series to use if the
query
produces multiple lines; when defined, the displayed chart will be limitted to just one line.
The computation performed by a monitor function is restricted in numerous ways, including by restricting its access to certain classes (sandboxing) and by resource utilization (both in execution time and memory, compute usage). Some of these limits can be lifted by changing the default StackState configuration.
Monitor functions can leverage existing StackState Script APIs, including:
- Telemetry - used to fetch Metric and Log data,
- Async - allowing for combining multiple asynchronous results in one computation,
- View - StackState View related operations,
- Component - StackState Component related operations.
Additionally, the following Script APIs are optionally available. They are considered to be experimental:
- Topology - used to fetch Topology data,
- Http - used to fetch external data via the HTTP protocol,
- Graph - a generic way to query the StackGraph database.
To use the above, experimental APIs they must be explicitly named in your StackState configuration file by appending the following line at the end of the
etc/application_stackstate.conf
file.stackstate.featureSwitches.monitorEnableExperimentalAPIs = true
The following example describes a step-by-step process of creating a monitor function. In this case, a metric thereshold rule is introduced parameterized with the exact metrics query to use and the threshold itself.
The first step is to create an STJ import file following the usual format and file organization:
{
"_version": "1.0.39",
"timestamp": "2022-06-23T23:23:23.269369Z[GMT]",
"nodes": [
...
]
}
StackState expects a
_version
property to be present in each and every export file. monitor functions are supported in versions 1.0.39
and up. Each export file can contain multiple monitor function nodes and also nodes of other types.The next step is the function node itself.
{
"_type": "MonitorFunction",
"name": "The name of the monitor function",
"description": "A longer, meaningful description of a monitor function",
"identifier": "urn:system:default:monitor-function:a-name-of-the-monitor-function",
"parameters": [
...
],
"script": {
"_type": "ScriptFunctionBody",
"scriptBody": ...
}
}
Here you can define the basic information about the function such as its name and description to help you distinguish different monitor functions. An important field is the
identifier
- it is a unique value of the StackState URN format that can be used to refer to this objects in various other places, such as an invocation of a monitor function. The identifier should be formatted as follows:urn : <prefix> : monitor-function : <unique-function-identification>
The
prefix
is described in more detail in topology identifiers, while the unique-function-identification
is user-definable and free-form.The most important fields of the monitor function node are its
scriptBody
and parameters
. In this step, we will focus on the body of the function and we'll determine the required parameters next.As described above, the function we're creating will check a given metric against a given threshold and based on that produce health states for the affected topology. To make things concrete, let's start with a simple CPU usage metric:
def metrics =
Telemetry.query("StackState Metrics", "cluster='cluster-a.example.com'")
.metricField("system.cpu.system")
.groupBy("tags.host")
.start("-10m")
.aggregation("mean", "30s")
The above snippet queries
StackState Metrics
datasource for a CPU usage, grouped per each hostname of a cluster-a.example.com
Kubernetes cluster, starting at 10 minutes in the past, averaged every 30 seconds. The results are grouped per hostname and represent the CPU usage of that host averaged across 10 minutes. Next we need to check the returned values against a threshold of 90%:def metrics = /* Same as above. */
def threshold = 90.0
def checkThreshold(timeSeries, threshold) {
timeSeries.points.any { point -> point.last() > threshold }
}
metrics.map { result ->
def state = "CLEAR"
if (checkThreshold(result.timeSeries, threshold)) {
state = "CRITICAL";
}
return state
}
For each hostname, we check the resulting values to see if any of them are above the threshold of 90%, and if so we indicate a
CRITICAL
health state, otherwise a CLEAR
one is reported. To make these results conform to the required format, we need to include a few more mandatory fields. The most important one is the topologyIdentifier
which determines to which topology element a given monitor result belongs. We also give each monitor result a uniqe id
(derived from the metrics that were used to create it), so that the system can match and replace successively supplied results accross time:def metrics = /* Same as above. */
def threshold = /* Same as above. */
def topologyIdentifierPattern = "urn:host:/${tags.host}"
def checkThreshold = /* Same as above. */
metrics.map { result ->
def state = "CLEAR"
if (checkThreshold(result.timeSeries, threshold)) {
state = "CRITICAL";
}
return [
_type: "MonitorHealthState",
id: result.timeSeries.id.toIdentifierString(),
state: state,
topologyIdentifier: StringTemplate.runForTimeSeriesId(topologyIdentifierPattern, result.timeSeries.id)
]
}
With the above change, each timeseries is now validated against the threshold and reported as a monitor health state. Each such result is uniquely identified by the metrics that were used to compute it, so any future changes based on the same metrics will result in health state updates instead of new states being created. Furthermore, we used the same metrics to extract the
tags.host
grouping from it and turned it into a specific topology identifier of a hostname represented in StackState. This topology identifier reconstruction, called topology mapping, is performed for all the groupped metric timeseries meaning it automatically applies to all the hosts in the topology that have CPU metrics associated with them. An improvement to the above definition can be made by introducing a result chart with each result. This is done by populating the optional displayTimeseries
property of the monitor health state:def metrics = /* Same as above. */
def threshold = /* Same as above. */
def topologyIdentifierPattern = /* Same as above. */
def checkThreshold = /* Same as above. */
metrics.map { result ->
def state = "CLEAR"
if (checkThreshold(result.timeSeries, threshold)) {
state = "CRITICAL";
}
return [
_type: "MonitorHealthState",
id: result.timeSeries.id.toIdentifierString(),
state: state,
topologyIdentifier: StringTemplate.runForTimeSeriesId(topologyIdentifierPattern, result.timeSeries.id),
displayTimeSeries: [
[
_type: "DisplayTimeSeries",
name: "The resulting metric values",
query: result.query,
timeSeriesId: result.timeSeries.id
]
]
]
}
Each chart configuration supplied this way will be shown together with the monitor result on the StackState UI. In the above case, the specific timeseries that resulted in the monitor health state being computed will be displayed for host affected by this monitor.
The full function body created so far:
def metrics =
Telemetry.query("StackState Metrics", "cluster='cluster-a.example.com'")
.metricField("system.cpu.system")
.groupBy("tags.host")
.start("-10m")
.aggregation("mean", "30s")
def threshold = 90.0
def topologyIdentifierPattern = "urn:host:/${tags.host}"
def checkThreshold(timeSeries, threshold) {
timeSeries.points.any { point -> point.last() > threshold }
}
metrics.map { result ->
def state = "CLEAR"
if (checkThreshold(result.timeSeries, threshold)) {
state = "CRITICAL";
}
return [
_type: "MonitorHealthState",
id: result.timeSeries.id.toIdentifierString(),
state: state,
topologyIdentifier: StringTemplate.runForTimeSeriesId(topologyIdentifierPattern, result.timeSeries.id),
displayTimeSeries: [
[
_type: "DisplayTimeSeries",
name: "The resulting metric values",
query: result.query,
timeSeriesId: result.timeSeries.id
]
]
]
}
To parameterize this function and make it reusable for different metrics and thresholds, we can extract the
metrics
, threshold
and topologyIdentifierPattern
into StackState functions parameters. Furthermore, in order to use this function as part of an STJ import file it needs to be encoded as a JSON string property:{
"_type": "ScriptFunctionBody",
"scriptBody": "def checkThreshold(timeSeries, threshold) {\n timeSeries.points.any { point -> point.last() > threshold }\n}\n\nmetrics.map { result ->\n def state = \"CLEAR\"\n if (checkThreshold(result.timeSeries, threshold)) {\n state = \"CRITICAL\";\n }\n\n return [\n _type: \"MonitorHealthState\",\n id: result.timeSeries.id.toIdentifierString(),\n state: state,\n topologyIdentifier: StringTemplate.runForTimeSeriesId(topologyIdentifierPattern, result.timeSeries.id)\n displayTimeSeries: [\n [\n _type: \"DisplayTimeSeries\",\n name: \"The resulting metric values\",\n query: result.query,\n timeSeriesId: result.timeSeries.id\n ]\n ]\n ]\n}"
}
In the previous steps, we created a monitor function and implemented the validation rule for CPU usage metrics, we determined that in order to make the function reusable, we need to extract three parameters -
metrics
, threshold
and the topologyIdentifierPattern
.The
threshold
and topologyIdentifierPattern
are simple basic types, a floating point value and a string respectively. For the metrics
parameter, we can utilize the aforementioned SCRIPT_METRIC_QUERY
parameter type, which ensures well-formedness of the metrics query supplied as the value of this parameter:{
"_type": "Parameter",
"name": "threshold",
"type": "DOUBLE",
"required": true,
"multiple": false
},
{
"_type": "Parameter",
"name": "topologyIdentifierPattern",
"type": "STRING",
"required": true,
"multiple": false
},
{
"_type": "Parameter",
"name": "metrics",
"type": "SCRIPT_METRIC_QUERY",
"required": true,
"multiple": false
},
The above specification ensures that each function invocation is passed all the requisite parameters (all marked as
required
) and that their types will be safe to use in the context of the body of the function. Any type mismatches will be reported during the importing of this function definition.The final step is giving the function a descriptive name and uploading it to StackState by importing the file containing the function definition. The following snippet contains the full function created so far:
{
"_version": "1.0.39",
"timestamp": "2022-06-23T23:23:23.269369Z[GMT]",
"nodes": [
{
"_type": "MonitorFunction",
"name": "Metric above threshold",
"description": "Validates that a metric value stays below a given threshold, reports a CRITICAL state otherwise.",
"parameters": [
{
"_type": "Parameter",
"name": "threshold",
"type": "DOUBLE",
"required": true,
"multiple": false
},
{
"_type": "Parameter",
"name": "topologyIdentifierPattern",
"type": "STRING",
"required": true,
"multiple": false
},
{
"_type": "Parameter",
"name": "metrics",
"type": "SCRIPT_METRIC_QUERY",
"required": true,
"multiple": false
}
],
"identifier": "urn:system:default:monitor-function:metric-above-threshold",
"script": {
"_type": "ScriptFunctionBody",
"scriptBody": "def checkThreshold(timeSeries, threshold) {\n timeSeries.points.any { point -> point.last() > threshold }\n}\n\nmetrics.map { result ->\n def state = \"CLEAR\"\n if (checkThreshold(result.timeSeries, threshold)) {\n state = \"CRITICAL\";\n }\n\n return [\n _type: \"MonitorHealthState\",\n id: result.timeSeries.id.toIdentifierString(),\n state: state,\n topologyIdentifier: StringTemplate.runForTimeSeriesId(topologyIdentifierPattern, result.timeSeries.id)\n displayTimeSeries: [\n [\n _type: \"DisplayTimeSeries\",\n name: \"The resulting metric values\",\n query: result.query,\n timeSeriesId: result.timeSeries.id\n ]\n ]\n ]\n}"
}
}
]
}
The function can be uploaded to StackState in one of three ways:
- Use the Import/Export facility under StackState settings.
CLI: sts (new)
CLI: stac
sts settings import < path/to/file.stj
⚠️ PLEASE NOTE - from StackState v5.0, the old
sts
CLI has been renamed to stac
and there is a new sts
CLI. The command(s) provided here are for use with the new sts
CLI.stac graph import < path/to/file.stj
⚠️ PLEASE NOTE - from StackState v5.0, the old
sts
CLI is called stac
.In a future release of StackState, the new
sts
CLI will fully replace the stac
CLI. It is advised to install the new sts
CLI and upgrade any installed instance of the old sts
CLI to stac
. For details see:After the function has been uploaded, it will be generally available for any monitor definition to invoke it.
Last modified 1yr ago