This wiki is obsolete, see the NorduGrid web pages for up to date information.

Gangliarc

From NorduGrid
Jump to navigationJump to search

Gangliarc provides a way to monitor various properties of an ARC server installation through an existing ganglia installation. It uses gmetric to dynamically add ARC-related metrics to the ganglia feed from the ARC host. The code can be found in subversion.

Installation

A package nordugrid-arc-gangliarc is provided for the usual operating systems supported by ARC. To start it run /etc/init.d/gangliarc start

Gangliarc can also be installed from source: check out the code from subversion

svn co http://svn.nordugrid.org/repos/nordugrid/contrib/gangliarc/trunk gangliarc
cd gangliarc

Follow the instructions in README. For those who don't want to read instructions and just want to get it going, run (as root)

python setup.py install
/etc/init.d/gangliarc start

Look in /var/log/arc/gangliarc.log to check everything is running ok and watch the graphs appear on ganglia.

Note that the latest svn version is not guaranteed to work or be free from bugs.

Metrics

It is possible to configure which metrics are used - see the README for details.

Currently Available

  • Last modification time of the A-REX heartbeat
  • Free space in each cache and total free cache space
  • Free space in each session directory and total session directory free space
  • Number of files in various states of the new data staging framework, eg transferring, waiting on cache etc.
  • Number of downloaders/uploaders, if old data staging is used
  • Number of current running jobs (jobs between PREPARING and FINISHING states)
  • Number of failed jobs among last 100 finished
  • Number of jobs in each A-REX state (ACCEPTING, PREPARING etc.)

Example

The screenshot below shows several gangliarc metrics:

  • (top left) The time since the A-REX heartbeat was modified. This graph shows the time since the last modification time of the gm-heartbeat file in the control directory. In normal operation this value should be low (under 2 minutes) like it is here. If A-REX is stopped or stuck then this metric will gradually increase in value.
  • (top right) The free space in the cache file system. Here we can see that the cache space decreased in stages until around 12:35 when it increased due to the cache cleaner. Note that since ganglia does not allow the slash character in metric names, it is replaced by an underscore, so this graph shows free space in /var/arc/cache
  • (bottom left) The number of failed jobs in the last 100. This graph shows how many jobs out of the last 100 that finished were failed jobs.
  • (bottom right) The number of processing jobs. This includes all jobs from ACCEPTING to FINISHING, i.e. the jobs active inside A-REX. At the start of the graph there are 100 jobs running but they gradually finished, but then at 12:15 another batch of jobs were submitted.

Gangliarc-Screenshot-2.png

Integration with Nagios

Some experimental combination of Nagios and Gangliarc metrics has been tried. The following screenshot shows monitoring of the number of failed jobs out of the previous 100 and the time since the last A-REX heartbeat (in WARNING state since it is greater than 120s, probably meaning A-REX has stopped):

Screenshot-nagios.png

It is farirly simple to add these to a standard Nagios setup, it just involved creating a new configuration file to define the metrics to monitor:

define servicegroup {
 servicegroup_name ganglia-metrics
 alias Ganglia Metrics
}

define command {
 command_name check_ganglia
 command_line /etc/nagios/check_ganglia.py -h $HOSTNAME$ -m $ARG1$ -w $ARG2$ -c $ARG3$
}

define service {
 use local-service
 name ganglia-service
 hostgroup_name test
 servicegroups ganglia-metrics
 notifications_enabled 0
 register 0
}

define service {
 use ganglia-service
 service_description ARC Heartbeat
 check_command check_ganglia!ARC_AREX_HEARTBEAT_LAST_SEEN!120!300
}

define service {
 use ganglia-service
 service_description ARC Failed Jobs
 check_command check_ganglia!ARC_JOBS_FAILED_PER_100!50!80
}

Just change hostgroup_name to the group of hosts exposing Gangliarc data.

check_ganglia.py comes with ganglia version 3.1 and greater, but can also be downloaded from online sources, eg here and works fine with a 3.0 installation.