Stackstate-Consul Integration

Overview

The StackState Agent collects many metrics from Consul nodes, including those for:

  • Total Consul peers

  • Service health - for a given service, how many of its nodes are up, passing, warning, critical?

  • Node health - for a given node, how many of its services are up, passing, warning, critical?

  • Network coordinates - inter- and intra-datacenter latencies

The Consul Agent can provide further metrics via StsStatsD. These metrics are more related to the internal health of Consul itself, not to services which depend on Consul. There are metrics for:

  • Serf events and member flaps
  • The Raft protocol
  • DNS performance

And many more.

Finally, in addition to metrics, the StackState Agent also sends a service check for each of Consul’s health checks, and an event after each new leader election.

Setup

Configuration

Connect StackState Agent to Consul Agent

Create a consul.yaml in the StackState Agent’s conf.d directory:

init_config:

instances:
    # where the Consul HTTP Server Lives
    # use 'https' if Consul is configured for SSL
    - url: http://localhost:8500
      # again, if Consul is talking SSL
      # client_cert_file: '/path/to/client.concatenated.pem'

      # submit per-service node status and per-node service status?
      catalog_checks: yes

      # emit leader election events
      self_leader_check: yes

      network_latency_checks: yes
See the sample consul.yaml for all available configuration options.

Restart the Agent to start sending Consul metrics to StackState.

Connect Consul Agent to StsStatsD

In the main Consul configuration file, add your stsstatsd_addr nested under the top-level telemetry key:

{
  ...
  "telemetry": {
    "stsstatsd_addr": "127.0.0.1:8125"
  },
  ...
}

Reload the Consul Agent to start sending more Consul metrics to StsStatsD

Validation

StackState Agent to Consul Agent

Run the Agent’s info subcommand and look for consul under the Checks section:

Checks
======

  [...]

  consul
  ------
      - instance #0 [OK]
      - Collected 8 metrics & 0 events

Also, if your Consul nodes have debug logging enabled, you’ll see the StackState Agent’s regular polling in the Consul log:

    2017/03/27 21:38:12 [DEBUG] http: Request GET /v1/status/leader (59.344µs) from=127.0.0.1:53768
    2017/03/27 21:38:12 [DEBUG] http: Request GET /v1/status/peers (62.678µs) from=127.0.0.1:53770
    2017/03/27 21:38:12 [DEBUG] http: Request GET /v1/health/state/any (106.725µs) from=127.0.0.1:53772
    2017/03/27 21:38:12 [DEBUG] http: Request GET /v1/catalog/services (79.657µs) from=127.0.0.1:53774
    2017/03/27 21:38:12 [DEBUG] http: Request GET /v1/health/service/consul (153.917µs) from=127.0.0.1:53776
    2017/03/27 21:38:12 [DEBUG] http: Request GET /v1/coordinate/datacenters (71.778µs) from=127.0.0.1:53778
    2017/03/27 21:38:12 [DEBUG] http: Request GET /v1/coordinate/nodes (84.95µs) from=127.0.0.1:53780

Consul Agent to StsStatsD

Use netstat to verify that Consul is sending its metrics, too:

$ sudo netstat -nup | grep "127.0.0.1:8125.*ESTABLISHED"
udp        0      0 127.0.0.1:53874         127.0.0.1:8125          ESTABLISHED 23176/consul

Data Collected

Metrics

consul.catalog.nodes_critical
(gauge)
# of Nodes with service status `critical` from those registered
shown as node
consul.catalog.nodes_passing
(gauge)
# of Nodes with service status `passing` from those registered
shown as node
consul.catalog.nodes_up
(gauge)
# of Nodes
shown as node
consul.catalog.nodes_warning
(gauge)
# of Nodes with service status `warning` from those registered
shown as node
consul.catalog.services_critical
(gauge)
Total critical services on nodes
shown as service
consul.catalog.services_passing
(gauge)
Total passing services on nodes
shown as service
consul.catalog.services_up
(gauge)
Total services registered on nodes
shown as service
consul.catalog.services_warning
(gauge)
Total warning services on nodes
shown as service

See Consul’s Telemetry doc for a description of metrics the Consul Agent sends to StsStatsD.

See Consul’s Network Coordinates doc if you’re curious about how the network latency metrics are calculated.

Service Checks

consul.check: The StackState Agent submits a service check for each of Consul’s health checks, tagging each with:

  • service:<name>, if Consul reports a ServiceName
  • consul_service_id:<id>, if Consul reports a ServiceID

Events

consul.new_leader:

The StackState Agent emits an event when the Consul cluster elects a new leader, tagging it with prev_consul_leader, curr_consul_leader, and consul_datacenter.