This wiki is obsolete, see the NorduGrid web pages for up to date information.
Gangliarc
Gangliarc provides a way to monitor various properties of an ARC server installation through an existing ganglia installation. It uses gmetric to dynamically add ARC-related metrics to the ganglia feed from the ARC host. The code can be found in subversion.
Installation
A package nordugrid-arc-gangliarc is provided for the usual operating systems supported by ARC. To start it run /etc/init.d/gangliarc start
Gangliarc can also be installed from source: check out the code from subversion
svn co http://svn.nordugrid.org/repos/nordugrid/contrib/gangliarc/trunk gangliarc cd gangliarc
Follow the instructions in README. For those who don't want to read instructions and just want to get it going, run (as root)
python setup.py install /etc/init.d/gangliarc start
Look in /var/log/arc/gangliarc.log to check everything is running ok and watch the graphs appear on ganglia.
Note that the latest svn version is not guaranteed to work or be free from bugs.
Metrics
It is possible to configure which metrics are used - see the README for details.
Currently Available
- Last modification time of the A-REX heartbeat
- Free space in each cache and total free cache space
- Free space in each session directory and total session directory free space
- Number of files in various states of the new data staging framework, eg transferring, waiting on cache etc.
- Number of downloaders/uploaders, if old data staging is used
- Number of current running jobs (jobs between PREPARING and FINISHING states)
- Number of failed jobs among last 100 finished
- Number of jobs in each A-REX state (ACCEPTING, PREPARING etc.)
Example
The screenshot below shows several gangliarc metrics:
- (top left) The time since the A-REX heartbeat was modified. This graph shows the time since the last modification time of the gm-heartbeat file in the control directory. In normal operation this value should be low (under 2 minutes) like it is here. If A-REX is stopped or stuck then this metric will gradually increase in value.
- (top right) The free space in the cache file system. Here we can see that the cache space decreased in stages until around 12:35 when it increased due to the cache cleaner. Note that since ganglia does not allow the slash character in metric names, it is replaced by an underscore, so this graph shows free space in /var/arc/cache
- (bottom left) The number of failed jobs in the last 100. This graph shows how many jobs out of the last 100 that finished were failed jobs.
- (bottom right) The number of processing jobs. This includes all jobs from ACCEPTING to FINISHING, i.e. the jobs active inside A-REX. At the start of the graph there are 100 jobs running but they gradually finished, but then at 12:15 another batch of jobs were submitted.
Integration with Nagios
Some experimental combination of Nagios and Gangliarc metrics has been tried. The following screenshot shows monitoring of the number of failed jobs out of the previous 100 and the time since the last A-REX heartbeat (in WARNING state since it is greater than 120s, probably meaning A-REX has stopped):
It is farirly simple to add these to a standard Nagios setup, it just involved creating a new configuration file to define the metrics to monitor:
define servicegroup { servicegroup_name ganglia-metrics alias Ganglia Metrics } define command { command_name check_ganglia command_line /etc/nagios/check_ganglia.py -h $HOSTNAME$ -m $ARG1$ -w $ARG2$ -c $ARG3$ } define service { use local-service name ganglia-service hostgroup_name test servicegroups ganglia-metrics notifications_enabled 0 register 0 } define service { use ganglia-service service_description ARC Heartbeat check_command check_ganglia!ARC_AREX_HEARTBEAT_LAST_SEEN!120!300 } define service { use ganglia-service service_description ARC Failed Jobs check_command check_ganglia!ARC_JOBS_FAILED_PER_100!50!80 }
Just change hostgroup_name to the group of hosts exposing Gangliarc data.
check_ganglia.py comes with ganglia version 3.1 and greater, but can also be downloaded from online sources, eg here and works fine with a 3.0 installation.