Stackstate-Hadoop MapReduce Integration

Overview

Capture MapReduce metrics to:

  • Analyze and inspect individual MapReduce jobs and tasks.
  • Visualize performance of individual tasks.

Installation

Install Stackstate Agent on the Master Node where the ResourceManager is running.

Configuration

  1. Configure the agent to connect to the ResourceManager: Edit conf.d/mapreduce.yaml

    instances:
      # The MapReduce check retrieves metrics from YARN's ResourceManager. This
      # check must be run from the Master Node and the ResourceManager URI must
      # be specified below. The ResourceManager URI is composed of the
      # ResourceManager's hostname and port.
      # The ResourceManager port can be found in the yarn-site.xml conf file under
      # the property yarn.resourcemanager.webapp.address
      - resourcemanager_uri: http://localhost:8088
    
    init_config:
     general_counters:
        - counter_group_name: 'org.apache.hadoop.mapreduce.TaskCounter'
          counters:
            - counter_name: 'MAP_INPUT_RECORDS'
            - counter_name: 'MAP_OUTPUT_RECORDS'
            - counter_name: 'REDUCE_INPUT_RECORDS'
            - counter_name: 'REDUCE_OUTPUT_RECORDS'
    
        # Additional counter's can be specified as following
        # - counter_group_name: 'org.apache.hadoop.mapreduce.FileSystemCounter'
        #   counters:
        #     - counter_name: 'HDFS_BYTES_READ'
    
  2. Restart the Agent

For more details about configuring this integration refer to the following file(s) on GitHub:

Validation

Execute the info command and verify that the integration check has passed. The output of the command should contain a section similar to the following:

Checks
======

  [...]

  mapreduce
  ---------
      - instance #0 [OK]
      - Collected 8 metrics & 0 events

Metrics

The metrics available are collected using df from Spotify’s Snakebite. hdfs.in_use is calculated by dividing used by capacity.

mapreduce.job.elapsed_time.max
(gauge)
Max elapsed time since the application started
shown as millisecond
mapreduce.job.elapsed_time.avg
(gauge)
Average elapsed time since the application started
shown as millisecond
mapreduce.job.elapsed_time.median
(gauge)
Median elapsed time since the application started
shown as millisecond
mapreduce.job.elapsed_time.95percentile
(gauge)
95th percentile elapsed time since the application started
shown as millisecond
mapreduce.job.elapsed_time.count
(rate)
Number of times the elapsed time was sampled
mapreduce.job.maps_total
(rate)
Total number of maps
shown as task/second
mapreduce.job.maps_completed
(rate)
Number of completed maps
shown as task/second
mapreduce.job.reduces_total
(rate)
Number of reduces
shown as task/second
mapreduce.job.reduces_completed
(rate)
Number of completed reduces
shown as task/second
mapreduce.job.maps_pending
(rate)
Number of pending maps
shown as task/second
mapreduce.job.maps_running
(rate)
Number of running maps
shown as task/second
mapreduce.job.reduces_pending
(rate)
Number of pending reduces
shown as task/second
mapreduce.job.reduces_running
(rate)
Number of running reduces
shown as task/second
mapreduce.job.new_reduce_attempts
(rate)
Number of new reduce attempts
shown as task/second
mapreduce.job.running_reduce_attempts
(rate)
Number of running reduce attempts
shown as task/second
mapreduce.job.failed_reduce_attempts
(rate)
Number of failed reduce attempts
shown as task/second
mapreduce.job.killed_reduce_attempts
(rate)
Number of killed reduce attempts
shown as task/second
mapreduce.job.successful_reduce_attempts
(rate)
Number of successful reduce attempts
shown as task/second
mapreduce.job.new_map_attempts
(rate)
Number of new map attempts
shown as task/second
mapreduce.job.running_map_attempts
(rate)
Number of running map attempts
shown as task/second
mapreduce.job.failed_map_attempts
(rate)
Number of failed map attempts
shown as task/second
mapreduce.job.killed_map_attempts
(rate)
Number of killed map attempts
shown as task/second
mapreduce.job.successful_map_attempts
(rate)
Number of successful map attempts
shown as task/second
mapreduce.job.counter.reduce_counter_value
(rate)
Counter value of reduce tasks
shown as task/second
mapreduce.job.counter.map_counter_value
(rate)
Counter value of map tasks
shown as task/second
mapreduce.job.counter.total_counter_value
(rate)
Counter value of all tasks
shown as task/second
mapreduce.job.map.task.elapsed_time.max
(gauge)
Max of all map tasks elapsed time
shown as millisecond
mapreduce.job.map.task.elapsed_time.avg
(gauge)
Average of all map tasks elapsed time
shown as millisecond
mapreduce.job.map.task.elapsed_time.median
(gauge)
Median of all map tasks elapsed time
shown as millisecond
mapreduce.job.map.task.elapsed_time.95percentile
(gauge)
95th percentile of all map tasks elapsed time
shown as millisecond
mapreduce.job.map.task.elapsed_time.count
(rate)
Number of times the map tasks elapsed time were sampled
mapreduce.job.reduce.task.elapsed_time.max
(gauge)
Max of all reduce tasks elapsed time
shown as millisecond
mapreduce.job.reduce.task.elapsed_time.avg
(gauge)
Average of all reduce tasks elapsed time
shown as millisecond
mapreduce.job.reduce.task.elapsed_time.median
(gauge)
Median of all reduce tasks elapsed time
shown as millisecond
mapreduce.job.reduce.task.elapsed_time.95percentile
(gauge)
95th percentile of all reduce tasks elapsed time
shown as millisecond
mapreduce.job.reduce.task.elapsed_time.count
(rate)
Number of times the reduce tasks elapsed time were sampled