Debug health synchronization
StackState Self-hosted v5.0.x

Overview

The StackState CLI can be used to troubleshoot a health synchronization and fix issues that might prevent health data from being correctly ingested and displayed in StackState. This page describes the general troubleshooting steps to take when debugging a health synchronization, as well as the CLI commands used, and a description of the error messages returned.

General troubleshooting steps

When debugging the health synchronization there are some common verification steps that can be made no matter what the specific issue is:
  1. 2.
    If you are using sub streams, verify that the sub stream exists. The response will also show the number of check states on the sub stream. This lets you know if the data is being ingested and processed.
  2. 3.
    Investigate further:
    • Stream present - Check the stream status, this will show the metrics latency of the stream and any errors.
    • Streams / sub streams present, but there are no check states - Confirm that the payload sent to the Receiver API adheres to the health payload specification.
    • No streams / sub streams are present - Use the CLI command below to verify that health data sent to the Receiver API is arriving in StackState:
CLI: stac
CLI: sts (new)
1
stac topic show sts_health_sync
Copied!
Not running the stac CLI yet?
Command not currently available in the new sts CLI. Use the stac CLI.

Common issues

Check state not visible on the component

There can be two reasons for a check state not to show on a component in StackState:
  • The health check state has not been created. Follow the general troubleshooting steps to confirm that the stream / sub stream has been created and that data is arriving in StackState.
  • The health check state was created, but its topologyElementIdentifier does not match any identifiers from the StackState topology. Use the CLI command show sub stream status to verify if there are any Check states with identifier which has no matching topology element.

Check state slow to update in StackState

The main reason for this is that the latency of the health synchronization is higher than expected. Use the CLI command show stream status to confirm the latency of the stream as well as the throughput of messages and specific check operations. It may be necessary to tweak the data sent to the health synchronization, or the frequency with which data is sent.

Useful CLI commands

List streams

Returns a list of all current synchronized health streams and the number of sub streams included in each.
CLI: stac
CLI: sts (new)
1
# List streams
2
stac health list-streams
3
​
4
stream urn sub stream count
5
-------------------------------------------------- ------------------
6
urn:health:sourceId:streamId 1
Copied!
Not running the stac CLI yet?
Command not currently available in the new sts CLI. Use the stac CLI.

List sub streams

Returns a list of all sub streams for a given stream URN, together with the number of check states in each.
CLI: stac
CLI: sts (new)
1
# List sub streams
2
stac health list-sub-streams urn:health:sourceId:streamId
3
​
4
sub stream id check state count
5
------------------------------ -------------------
6
subStreamId1 20
7
subStreamId2 17
Copied!
Not running the stac CLI yet?
Command not currently available in the new sts CLI. Use the stac CLI.

Show stream status

The stream status command returns the aggregated stream latency and throughput metrics. This is helpful when debugging why a health check takes a long time to land on the expected topology elements. It will help diagnose if the frequency of data sent to StackState should be adjusted. The output contains a section Errors for non-existing sub streams: as some errors are only relevant when a sub stream could not be created, for example StreamMissingSubStream. Sub stream errors can be any of the documented error messages.
CLI: stac
CLI: sts (new)
1
# Show a stream status
2
stac health show urn:health:sourceId:streamId
3
​
4
Aggregate metrics for the stream and all substreams:
5
​
6
metric value between now and 300 seconds ago value between 300 and 600 seconds ago value between 600 and 900 seconds ago
7
--------------------------------- --------------------------------------- --------------------------------------- ---------------------------------------
8
latency (Seconds) 1.102 1.102 -
9
messages processed (per second) 0.256 0.16 -
10
check states created (per second) 0.10555555555555556 0.10666666666666667 -
11
check states updated (per second) - - -
12
check states deleted (per second) - - -
13
​
14
Errors for non-existing sub streams:
15
​
16
error message error occurrence count
17
---------------------------------------------------------------------------------------------- ------------------------
18
Sub stream `substream with id `subStreamId2`` not started when receiving snapshot stop 6
Copied!
Not running the stac CLI yet?
Command not currently available in the new sts CLI. Use the stac CLI.

Show sub stream status

The sub stream status provides useful information to verify that check states sent to StackState from an external system could be bound and linked to existing topology elements. This information is helpful to debug why a specific check is not visible on the expected topology element.
CLI: stac
CLI: sts (new)
1
# Show a sub stream status.
2
stac health show urn:health:sourceId:streamId -s "subStreamId3"
3
​
4
Synchronized check state count: 32
5
Repeat interval (Seconds): 120
6
Expiry (Seconds): 240
7
​
8
Synchronization errors:
9
​
10
code level message occurrence count
11
------ ------- --------- ------------------
12
​
13
Synchronization metrics:
14
​
15
metric value between now and 300 seconds ago value between 300 and 600 seconds ago value between 600 and 900 seconds ago
16
--------------------------------- --------------------------------------- --------------------------------------- ---------------------------------------
17
latency (Seconds) 0.23 0.125 0.265
18
messages processed (per second) 0.256 0.2773333333333333 0.256
19
check states created (per second) - - -
20
check states updated (per second) - - -
21
check states deleted (per second) -
Copied!
Not running the stac CLI yet?
Command not currently available in the new sts CLI. Use the stac CLI.
A sub stream status will show the metadata related to the consistency model:
  • Repeat Snapshots - Show repeat interval and expiry
  • Repeat States - Show repeat interval and expiry
  • Transactional Increments - Show checkpoint offset and checkpoint batch index
The sub stream status can be expanded to include details of matched and unmatched check states using the -t command line argument. This is helpful to identify any health states that are not attached to a topology element. In the example below, checkStateId2 is listed under Check states with identifier which has no matching topology element. This means that it was not possible to match the check state to a topology element with the identifier server-2.
CLI: stac
CLI: sts (new)
1
# Show a sub stream status matched/unmatched check states.
2
stac health show urn:health:sourceId:streamId -s "subStreamId3" -t
3
# If we configured our stream to not use explicit substreams then a default
4
# sub stream can be reached by omitting the optional substreamId parameter as in:
5
#stac health show urn:health:sourceId:streamId -t
6
​
7
Check states with identifier matching exactly 1 topology element: 32
8
​
9
Check states with identifier which has no matching topology element:
10
​
11
check state id topology element identifier
12
---------------- -----------------------------
13
checkStateId2 server-2
14
​
15
Check states with identifier which has multiple matching topology elements:
16
​
17
check state id topology element identifier number of matched topology elements
18
---------------- ----------------------------- -------------------------------------
Copied!
Not running the stac CLI yet?
Command not currently available in the new sts CLI. Use the stac CLI.

Delete a health stream

The delete stream functionality is helpful while setting up a health synchronization in StackState. It allows you to experiment, delete the data and start over again clean. You can also delete a stream and drop its data when you are sure that you do not want to keep using it.
CLI: stac
CLI: sts (new)
1
# Delete a health synchronization stream
2
stac health delete urn:health:sourceId:streamId
Copied!
Not running the stac CLI yet?
Command not currently available in the new sts CLI. Use the stac CLI.

Clear health stream errors

The clear-errors option removes all errors from a health stream. This is helpful while setting up a health synchronization in StackState, or, for the case of the TRANSACTIONAL_INCREMENTS consistency model, when some errors can't be removed organically. For example, a request to delete a check state might raise an error if the check state is not known to StackState. The only way to suppress such an error would be to use the clear-errors command.
CLI: stac
CLI: sts (new)
1
# Clear health stream errors
2
stac health clear-errors urn:health:sourceId:streamId
Copied!
Not running the stac CLI yet?
Command not currently available in the new sts CLI. Use the stac CLI.

Error messages

Errors will be closed once the described issue has been remediated.
For example a SubStreamStopWithoutStart will be closed once the health synchronization observes a start snapshot message followed by a stop snapshot message.
Error
Description
StreamMissingSubStream
Raised when the health synchronization receives messages without a previous stream setup message as start_snapshot or expiry.
StreamConsistencyModelMismatch
Raised when a message is received that belongs to a different consistency model than that specified when the stream was created.
StreamMissingSubStream
Raised when the health synchronization receives messages with a previous start snapshot in place.
SubStreamRepeatIntervalTooHigh
Raised when the health synchronization receives a repeat_interval_s greater than the configured max of 30 minutes.
SubStreamStartWithoutStop
Raised when the health synchronization receives a second message to open a snapshot when a previous snapshot was still open.
SubStreamCheckStateOutsideSnapshot
Raised when the health synchronization receives external check states without previously opening a snapshot.
SubStreamStopWithoutStart
Raised when the health synchronization receives a stop snapshot message without having started a snapshot at all.
SubStreamMissingStop
Raised when the health synchronization does not receive a stop snapshot after time out period of two times the repeat_interval_s established in the start snapshot message. In this case an automatic stop snapshot will be applied.
SubStreamExpired
Raised when the health synchronization stops receiving data on a particular sub stream for longer than the configured expiry_interval_s. In this case, the sub stream will be deleted.
SubStreamLateData
Raised when the health synchronization does not receive a complete snapshot timely based on the established repeat_interval_s.
SubStreamTransformerError
Raised when the health synchronization is unable to interpret the payload sent to the receiver. For example, "Missing required field 'name'" with payload {"checkStateId":"checkStateId3","health":"deviating","message":"Unable to provision the device. ","topologyElementIdentifier":"server-3"} and transformation Default Transformation.
SubStreamMissingCheckpoint
Raised when a Transactional increments sub stream previously observed a checkpoint, but the received message is missing the previous_checkpoint
SubStreamInvalidCheckpoint
Raised when a Transactional increments sub stream previously observed a checkpoint, but the received message has a previous_checkpoint that is not equivalent to the last observed one.
SubStreamOutdatedCheckpoint
Raised when a Transactional increments sub stream previously observed a checkpoint, but the received message has a checkpoint that precedes the last observed one, meaning that its data that StackState already received.
SubStreamUnknownCheckState
Raised when deleting a Transactional increments check_state and the check_state_id is not present on the sub stream.

See also