Stackstate-Hadoop HDFS Integration

Overview

Capture NameNode and DataNode HDFS metrics in Stackstate to:

  • Visualize cluster health, performance and utilization.
  • Analyze and inspect individual node utilization

Configuration

  1. Configure the NameNode agent to connect to the JMX URI: Edit conf.d/hdfs_namenode.yaml

    init_config:
    
    instances:
      #
      # The HDFS NameNode check retrieves metrics from the HDFS NameNode's JMX
      # interface. This check must be installed on the NameNode. The HDFS
      # NameNode JMX URI is composed of the NameNode's hostname and port.
      #
      # The hostname and port can be found in the hdfs-site.xml conf file under
      # the property dfs.http.address or dfs.namenode.http-address
      #
      -  hdfs_namenode_jmx_uri: http://localhost:50070
    
  2. Configure the DataNode agent to connect to the JMX URI: Edit conf.d/hdfs_datanode.yaml

    init_config:
    
    instances:
      #
      # The HDFS DataNode check retrieves metrics from the HDFS DataNode's JMX
      # interface. This check must be installed on a HDFS DataNode. The HDFS
      # DataNode JMX URI is composed of the DataNode's hostname and port.
      #
      # The hostname and port can be found in the hdfs-site.xml conf file under
      # the property dfs.datanode.http.address
      #
      - hdfs_datanode_jmx_uri: http://localhost:50075
    
  3. Restart the Agent

For more details about configuring this integration refer to the following file(s) on GitHub:

Validation

Execute the info command and verify that the integration check has passed. The output of the command should contain a section similar to the following:

Checks
======

  [...]
  hdfs_datanode
  -------------
      - instance #0 [OK]
      - Collected 8 metrics & 0 events
  hdfs_namenode
  -------------
      - instance #0 [OK]
      - Collected 8 metrics & 0 events

Metrics

The metrics available are collected using df from Spotify’s Snakebite. hdfs.in_use is calculated by dividing used by capacity.

You may experience reduced functionality if using hadoop before v2.2.0. For earlier versions, we need to use Snakebite v1.3.9. If using HA Mode, you may want to upgrade.

hdfs.namenode.capacity_total
(gauge)
Total disk capacity in bytes
shown as byte
hdfs.namenode.capacity_used
(gauge)
Disk usage in bytes
shown as byte
hdfs.namenode.capacity_remaining
(gauge)
Remaining disk space left in bytes
shown as byte
hdfs.namenode.total_load
(gauge)
Total load on the file system
hdfs.namenode.fs_lock_queue_length
(gauge)
Lock queue length
hdfs.namenode.blocks_total
(gauge)
Total number of blocks
shown as block
hdfs.namenode.max_objects
(gauge)
Maximum number of files HDFS supports
shown as object
hdfs.namenode.files_total
(gauge)
Total number of files
shown as file
hdfs.namenode.pending_replication_blocks
(gauge)
Number of blocks pending replication
shown as block
hdfs.namenode.under_replicated_blocks
(gauge)
Number of under replicated blocks
shown as block
hdfs.namenode.scheduled_replication_blocks
(gauge)
Number of blocks scheduled for replication
shown as block
hdfs.namenode.pending_deletion_blocks
(gauge)
Number of pending deletion blocks
shown as block
hdfs.namenode.num_live_data_nodes
(gauge)
Total number of live data nodes
shown as node
hdfs.namenode.num_dead_data_nodes
(gauge)
Total number of dead data nodes
shown as node
hdfs.namenode.num_decom_live_data_nodes
(gauge)
Number of decommissioning live data nodes
shown as node
hdfs.namenode.num_decom_dead_data_nodes
(gauge)
Number of decommissioning dead data nodes
shown as node
hdfs.namenode.volume_failures_total
(gauge)
Total volume failures
hdfs.namenode.estimated_capacity_lost_total
(gauge)
Estimated capacity lost in bytes
shown as byte
hdfs.namenode.num_decommissioning_data_nodes
(gauge)
Number of decommissioning data nodes
shown as node
hdfs.namenode.num_stale_data_nodes
(gauge)
Number of stale data nodes
shown as node
hdfs.namenode.num_stale_storages
(gauge)
Number of stale storages
hdfs.namenode.missing_blocks
(gauge)
Number of missing blocks
shown as block
hdfs.namenode.corrupt_blocks
(gauge)
Number of corrupt blocks
shown as block
hdfs.datanode.dfs_remaining
(gauge)
The remaining disk space left in bytes
shown as byte
hdfs.datanode.dfs_capacity
(gauge)
Disk capacity in bytes
shown as byte
hdfs.datanode.dfs_used
(gauge)
Disk usage in bytes
shown as byte
hdfs.datanode.cache_capacity
(gauge)
Cache capacity in bytes
shown as byte
hdfs.datanode.cache_used
(gauge)
Cache used in bytes
shown as byte
hdfs.datanode.num_failed_volumes
(gauge)
Number of failed volumes
hdfs.datanode.last_volume_failure_date
(gauge)
The date/time of the last volume failure in milliseconds since epoch
shown as millisecond
hdfs.datanode.estimated_capacity_lost_total
(gauge)
The estimated capacity lost in bytes
shown as byte
hdfs.datanode.num_blocks_cached
(gauge)
The number of blocks cached
shown as block
hdfs.datanode.num_blocks_failed_to_cache
(gauge)
The number of blocks that failed to cache
shown as block
hdfs.datanode.num_blocks_failed_to_uncache
(gauge)
The number of failed blocks to remove from cache
shown as block