This wiki is obsolete, see the NorduGrid web pages for up to date information.

Service Monitoring

From NorduGrid
Jump to navigationJump to search

EGI uses a NAGIOS-based system, called the SAM (https://tomtools.cern.ch/confluence/display/SAMDOC/SAM+Overview) to remotely monitor services. Several ARC-related probes have been used in production to monitor ARC services. As a result of the EMI project, the next generation ARC nagios probes (http://git.nbi.ku.dk/downloads/NorduGridARCNagiosPlugins/) were developed. The update 22 of the SAM framework (https://tomtools.cern.ch/confluence/display/SAMDOC/Update-22) is the first SAM release that comes with the new ARC probes integrated. Since this SAM version the new probes are the official ARC probes.

The integration of an ARC site with EGI monitoring means that the ARC deployment successfully pass all the ARC Nagios probes remotely executed by the EGI SAM monitoring framework. Below a short summary is provided on howto enable successful EGI monitoring of ARC services.

The SAM framework

The Service Availability Monitoring (SAM) Framework is a distributed monitoring system developed by EGI, partly based on commodity components like Nagios and ActiveMQ. As part of this infrastructure, NGIs run SAM-Nagios installations which runs an automatically configured Nagios, MySQL, Apache, and various EGI components. SAM-Nagios instances determines what to probe based on data from GOCDB, BDII, and other central sources, and carry out the probing of the CEs using both direct queries of the service and job submissions which may include scripts for further testing on the worker nodes. The results are published to central systems for availability computations. See SAM Overview for few summary pages and graphs.

The installation guide for SAM Nagios can be found at https://tomtools.cern.ch/confluence/display/SAMDOC/SAM-Nagios+Administrator+Guide, though you normally don't need that as a site admin.

SAM version (or update) 22 is the first version that supports the next generation ARC probes. Please make sure you have the right SAM version.

NeIC (NDGF) runs its own SAM deployment and the Kosice TestBed will also have a SAM framework deployed locally.

ARC Nagios probes

The ARC Nagios probes are part of the official ARC releases and distributed as a separate package. The probes should NOT be installed on ARC CE, instead these are typically deployed on the SAM server.

The Nagios plugins are mainly used to remotely monitor NorduGrid ARC Compute Elements and related services (e.g. information system components, storage services), but some probes should also be usable to test non-ARC resources. The package includes the following probes:

  • check_arcce_submit [1] - Submit a job to a CE to perform configured tests.
  • check_arcce_monitor [2] - Monitors jobs, fetches finished ones, and published results to passive services.
  • check_arcce_clean [3] - Run occasionally (e.g. once a day) to tidy the job list.
  • check_aris [4] - Test the integrity of the ARIS data and optionally check that parameters are within acceptable ranges.
  • check_egiis [5] - Test the integrity of EGIIS data.
  • check_arcglue2 [6] - Check the integrity of GLUE2 over LDAP.
  • check_arcservice [7] - Check CE health using GLUE2 over SOAP.
  • check_gridstorage [8] - Perform read, list, write, read-back-and-compare checks on SEs using one of the ARC-supported storage protocols.

For a detailed description, please consult the official documentation: http://git.nbi.ku.dk/downloads/NorduGridARCNagiosPlugins

Configuration and deployment of the ARC probes

In the current production SAM service running at () the ARC probes are deployed in the following manner: The following services are set up for each CE. The main entries are active services with the probe in paratheses, the sub-entries are the corresponding passives services with the corresponding options in parenthesis.

  • org.nordugrid.ARC-CE-ARIS (check_arcce_aris)
  • org.nordugrid.ARC-CE-LFC-submit (check_arcce_submit)
    • org.nordugrid.ARC-CE-LFC-result
    • org.nordugrid.ARC-CE-lfc (--test dist-stage-lfc)
  • org.nordugrid.ARC-CE-SRM-submit (check_arcce_submit)
    • org.nordugrid.ARC-CE-SRM-result
    • org.nordugrid.ARC-CE-srm (--test dist-stage-srm)
  • org.nordugrid.ARC-CE-submit (check_arcce_submit)
    • org.nordugrid.ARC-CE-result
    • org.nordugrid.ARC-CE-IGTF (--test dist-caversion)
    • org.nordugrid.ARC-CE-sw-csh (--test dist-sw-csh)
    • org.nordugrid.ARC-CE-sw-gcc (--test dist-sw-gcc)
    • org.nordugrid.ARC-CE-sw-perl (--test dist-sw-perl)
    • org.nordugrid.ARC-CE-sw-python (--test dist-sw-python)

The actual services will have an -ops or other suffix corresponding to the VO.

Sample Job Description and Script

The job description generated for the org.nordugrid.ARC-CE-submit service looks like:

 <?xml version="1.0" encoding="UTF-8"?>
 <JobDefinition xmlns="http://schemas.ggf.org/jsdl/2005/11/jsdl"
 	       xmlns:jsdl-arc="http://www.nordugrid.org/ws/schemas/jsdl-arc">
   <JobDescription>
     <JobIdentification>
       <JobName>org.nordugrid.ARC-CE-result-ops</JobName>
     </JobIdentification>
     <Application>
       <ApplicationName>ARCCE-probe</ApplicationName>
       <POSIXApplication xmlns="http://schemas.ggf.org/jsdl/2005/11/jsdl-posix">
 	<Executable>job.sh</Executable>
         <Output>stdout.txt</Output>
         <Error>stderr.txt</Error>
 	<WallTimeLimit>600</WallTimeLimit>
 	<MemoryLimit>536870912</MemoryLimit>
       </POSIXApplication>
     </Application>
     <DataStaging>
       <FileName>job.sh</FileName>
       <Source><URI>file:/var/spool/nagios/plugins/arcce/ops/gateway03.dcsc.ku.dk/job.sh</URI></Source>
       <CreationFlag>overwrite</CreationFlag>
     </DataStaging>
     <DataStaging>
       <FileName>arcce_igtf.py</FileName>
       <Source><URI>file:/usr/share/nordugrid-arc-nagios-plugins/jobscripts/arcce_igtf.py</URI></Source>
       <DeleteOnTermination>false</DeleteOnTermination>
       <CreationFlag>overwrite</CreationFlag>
     </DataStaging>
     <DataStaging>
       <FileName>ca-policy-egi-core.release</FileName>
       <Source><URI>http://repository.egi.eu/sw/production/cas/1/current/meta/ca-policy-egi-core.release</URI><URIOption>cache=no</URIOption></Source>
       <DeleteOnTermination>false</DeleteOnTermination>
       <CreationFlag>overwrite</CreationFlag>
     </DataStaging>
     <DataStaging>
       <FileName>ca-policy-egi-core.list</FileName>
       <Source><URI>http://repository.egi.eu/sw/production/cas/1/current/meta/ca-policy-egi-core.list</URI><URIOption>cache=no</URIOption></Source>
       <DeleteOnTermination>false</DeleteOnTermination>
       <CreationFlag>overwrite</CreationFlag>
     </DataStaging>
     <DataStaging>
       <FileName>ca-policy-egi-core.obsoleted</FileName>
       <Source><URI>http://repository.egi.eu/sw/production/cas/1/current/meta/ca-policy-egi-core.obsoleted</URI><URIOption>cache=no</URIOption></Source>
       <DeleteOnTermination>false</DeleteOnTermination>
       <CreationFlag>overwrite</CreationFlag>
     </DataStaging>
     <DataStaging>
       <FileName>caversion.out</FileName>
       <DeleteOnTermination>false</DeleteOnTermination>
       <CreationFlag>overwrite</CreationFlag>
     </DataStaging>
     <DataStaging>
       <FileName>csh.out</FileName>
       <DeleteOnTermination>false</DeleteOnTermination>
       <CreationFlag>overwrite</CreationFlag>
     </DataStaging>
     <DataStaging>
       <FileName>gcc.out</FileName>
       <DeleteOnTermination>false</DeleteOnTermination>
       <CreationFlag>overwrite</CreationFlag>
     </DataStaging>
     <DataStaging>
       <FileName>python.out</FileName>
       <DeleteOnTermination>false</DeleteOnTermination>
       <CreationFlag>overwrite</CreationFlag>
     </DataStaging>
     <DataStaging>
       <FileName>perl.out</FileName>
       <DeleteOnTermination>false</DeleteOnTermination>
       <CreationFlag>overwrite</CreationFlag>
     </DataStaging>
     <Resources>
       <jsdl-arc:RunTimeEnvironment>
 	<jsdl-arc:Software>
 	  <jsdl-arc:Name>ENV/PROXY</jsdl-arc:Name>
 	</jsdl-arc:Software>
       </jsdl-arc:RunTimeEnvironment>
     </Resources>
   </JobDescription>
 </JobDefinition>

where the shell script is

 #! /bin/sh
 
 echo "Job started `date -Is`."
 # Scripted test org.nordugrid.ARC-CE-IGTF-ops
 python arcce_igtf.py >caversion.out
 err=$?
 [ $err -eq 0 ] || echo "__exit $err" >>caversion.out
 
 # Scripted test org.nordugrid.ARC-CE-sw-csh-ops
 echo >csh-test.csh '#! /bin/csh'; echo >>csh-test.csh 'env >csh.out';   chmod +x csh-test.csh; ./csh-test.csh
 err=$?
 [ $err -eq 0 ] || echo "__exit $err" >>csh.out
 
 # Scripted test org.nordugrid.ARC-CE-sw-gcc-ops
 gcc -v >gcc.out 2>&1
 err=$?
 [ $err -eq 0 ] || echo "__exit $err" >>gcc.out
 
 # Scripted test org.nordugrid.ARC-CE-sw-python-ops
 python -V >python.out 2>&1
 err=$?
 [ $err -eq 0 ] || echo "__exit $err" >>python.out
 
 # Scripted test org.nordugrid.ARC-CE-sw-perl-ops
 perl -v >perl.out 2>&1
 err=$?
 [ $err -eq 0 ] || echo "__exit $err" >>perl.out
 
 echo "Present files before termination:"
 ls -l
 echo "Job finished `date -Is`, status = $status."
 exit $status

The commands involved are standard except for arcce_igtf.py which comes from the nordugrid-arc-nagios-plugins package. The __exit strings and the *.out files will be parsed by the org.nordugrid.ARC-CE-monitor service and submitted back to the passives services shown above.

The job description for the LFC test looks like

 <?xml version="1.0" encoding="UTF-8"?>
 <JobDefinition xmlns="http://schemas.ggf.org/jsdl/2005/11/jsdl"
 	       xmlns:jsdl-arc="http://www.nordugrid.org/ws/schemas/jsdl-arc">
   <JobDescription>
     <JobIdentification>
       <JobName>org.nordugrid.ARC-CE-LFC-result-ops</JobName>
     </JobIdentification>
     <Application>
       <ApplicationName>ARCCE-probe</ApplicationName>
       <POSIXApplication xmlns="http://schemas.ggf.org/jsdl/2005/11/jsdl-posix">
 	<Executable>job.sh</Executable>
         <Output>stdout.txt</Output>
         <Error>stderr.txt</Error>
 	<WallTimeLimit>600</WallTimeLimit>
 	<MemoryLimit>536870912</MemoryLimit>
       </POSIXApplication>
     </Application>
     <DataStaging>
       <FileName>job.sh</FileName>
       <Source><URI>file:/var/spool/nagios/plugins/arcce/ops/ce00.example.org#dist-stage-lfc/job.sh</URI></Source>
       <CreationFlag>overwrite</CreationFlag>
     </DataStaging>
     <DataStaging>
       <FileName>dist-stage-lfc-in-0</FileName>
       <Source><URI>lfc://prod-lfc-shared-central.cern.ch/grid/ops/nagios-sam.example.org/arcce/lfc-input</URI></Source>
       <DeleteOnTermination>false</DeleteOnTermination>
       <CreationFlag>overwrite</CreationFlag>
     </DataStaging>
     <DataStaging>
       <FileName>dist-stage-lfc-out-0</FileName>
       <Target><URI>lfc://srm://srm.example.org/ops/nagios-ngi-sam.example.org/arcce/lfc-20130726T1211-ce00.example.org@prod-lfc-shared-central.cern.ch/grid/ops/nagios-sam.example.org/arcce/lfc-20130726T1211-ce00.example.org</URI></Target>
       <DeleteOnTermination>false</DeleteOnTermination>
       <CreationFlag>overwrite</CreationFlag>
     </DataStaging>
   </JobDescription>
 </JobDefinition>

Testing

In the near future a test public SAM framework deployed together with the latest ARC probes will be provided by the Kosice test team. Instructions on howto request an ARC service to be included into the monitoring will be provided here. The foreseen procedure will be very lightweight and look similar to

  1. send an email to contact@testbed.kosice with the list of service endpoints to be monitored. For each endpoint specify ???
  2. obtain the certificate, CA file ???
  3. authorize the DN under which the probes are executed ???
  4. check the results at public nagios website: ???


Site configuration for EGI production SAM

Since the ARC Nagios probes are executed remotely very little configuration is needed on the site hosting the monitored services themselves:

  • Ensure all the necessary certificates are installed and up to date. The meta-package for the IGFT CA bundles is called ca-policy-egi-core on RHEL and similar systems.
  • Authorize the ops VO to submit jobs on the CEs. This authorizes the Nagios box to submit jobs to your CE. The VO attributes of the proxy certificate used by Nagios looks something like the following:
    === VO ops extension information ===
    VO        : ops
    subject   : <nagios-operator-dn>
    issuer    : /DC=ch/DC=cern/OU=computers/CN=lcg-voms.cern.ch
    attribute : /ops/Role=NULL/Capability=NULL
    attribute : /ops/NGI/Role=NULL/Capability=NULL
    attribute : /ops/NGI/<ngi-name>/Role=NULL/Capability=NULL
    timeleft  : 10:32:19
    uri       : lcg-voms.cern.ch:15009
    

    To authorized ops you can use something like

    [vo]
    id="vo_ops"
    vo="ops"
    file="/etc/grid-security/grid-mapfile"
    source="vomss://voms.cern.ch:8443/voms/ops?/ops"
    mapped_unixid="ops-user"
    

    though it may be necessary to adjust the details to your local setup. See also How_to_plug_an_ARC_site_into_EGI/NDGF#Monitoring_.28SAM_.2F_Nagios.29. The VOMS string if you need it is

    "ops" "lcg-voms.cern.ch" "15009" "/DC=ch/DC=cern/OU=computers/CN=lcg-voms.cern.ch" "ops"
    
  • Install the ENV/PROXY runtime environment on the CEs. This is used to check the IGTF CA ceritificates. Here is the reference implementation which should work for most sites:
    #!/bin/bash
     
    x509_cert_dir="/etc/grid-security/certificates"
     
    case $1 in
      0) mkdir -pv $joboption_directory/arc/certificates/
         cp -rv $x509_cert_dir/ $joboption_directory/arc
         cat ${joboption_controldir}/job.${joboption_gridid}.proxy >$joboption_directory/user.proxy
         ;;
      1) export X509_USER_PROXY=$RUNTIME_JOB_DIR/user.proxy
         export X509_USER_CERT=$RUNTIME_JOB_DIR/user.proxy
         export X509_CERT_DIR=$RUNTIME_JOB_DIR/arc/certificates
         ;;
      2) :
         ;;
    esac
    

To install the runtime environment, place the script into a file PROXY in a subdirectory ENV under the directory specified as runtimedir in arc.conf.

Monitoring your site/services in EGI

In EGI the GOCDB site and service endpoint entries have a few parameters that define whether a specific site or service is being monitored by the central production EGI SAM system. Please check, change the following parameters to turn on/off SAM monitoring of your service/site:

  • GOCDB Service Endpoint: Production (Y/N)
  • GOCDB Service Endpoint: Monitored (Y/N)
  • GOCDB Site: Visible to EGI (Y/N)

Comments (please read the official GOCDB documentation for further details): All production resources MUST be monitored. A failing test of production service endpoints generates an alarm in the ROD Operations Dashboard. Non-production service endpoints can be either Monitored or Not Monitored, depending on the Administrator's choice. If Monitored is set to YES, Service Availability Monitoring (SAM) will test the service endpoint, but SAM test results are ignored by the Availability Computation Engine (ACE). If Monitored is set NO, the service endpoint is ignored by SAM and no alarms are raised in the Operations Dashboard in case of CRITICAL failure.

You may check out if your service/site properly appears in the EGI service monitor at this URL: http://operations-portal.egi.eu/