This wiki is obsolete, see the NorduGrid web pages for up to date information.

Logging of CE performance numbers

As agreed on the ARC4eInfrastruture kick-off meeting at CERN, a script for collecting CE performance numbers on production CEs should be created

The following persons are assigned to this task each responsible for the below areas:

  • Aleksandr - CE (frontend)
  • Balazs (Florido) - Information system
  • David - Data staging
  • Martin (Christian) - Backend scripts

Performance numbers to collect

  • Server load per measurement
  • Find way of correlating load, profiling and timings
  • Data should be uploaded to central log server
  • Collect system architecture (Filesystem, network bandwidth etc)
  • Number of jobs in batch system
  • How many jobs in ARC queue vs how many jobs in global batch system
  • How big(no o' files) is the control dir

System (general data)

data to be collected by a helper script called systemdata

  • number of files in controldir
  • disk space free/used
  • cpuload
  • cpuload by the ARC processes
  • memory status
  • opsys version
  • arc version

CE (frontend)

  • Time spent processing for every job state change
  • Time spent scanning for new jobs
  • Time spent scanning for old jobs
  • Time spent scanning for job state change requests
  • Time spent generating XML (maybe already in the log)

Information system

  • Time spent reading each file in controldir
  • Time spent reading each file of a job in controldir
  • Time spent reading all files in controldir
  • Time spent executing LRMS commands

Data staging

  • Load on each delivery server
  • Total number of files in DTR
  • File and byte throughput on each server
  • Time of delivery process vs time between DTR scheduling the process and picking it up when finished

Backend scripts

  • Time to create a script
  • Time to submit a script
  • Time every call to the lrms (also for infosys)
  • Time controldir access
  • Time log parsing
  • Time processing of finished jobs

Proposed metrics format

  • timestamp, ARC-CE host name, ARC version, system, subsystem, subsubsystem, metrics-structure
    • metrics-structure: metrics name, metrics parameters, metrics value
  • full example from infosys system:
    • timestamp, ARC-CE host name, ARC version, system, subsystem, subsubsystem, controldirreadtiming, job, jobid, durationinseconds
    • 2016-02-21 12:12:34,, ARC 5.0.4, infosys, subsystem, subsubsystem, controldirreadtiming, job,, 245
  • full example from a-rex system (collected in dedicated local file hence no need for host name, version and system):
    • 2016-02-29T16:04:27Z ACCEPTED-PREPARING yS8LDmJUfunnOSAtDmABFKDmABFKDmABFKDmucNKDmABFKDmEzxRhn 22045244 nS
  • full example from back-end system:
    • 2016-02-21 12:12:34,, ARC 5.0.4, submit-SLURM-job, lrms submit, <jobid>, duration, 100us
  • full example from data staging
    • 2016-02-21 12:12:34,, ARC 5.0.4, dtr, dtr_id, transfer_time, 9.345s

Backend subsystem metrics

  • JobCreationTiming, lrmstype, jobid, durationinseconds
  • lrmscalltiming, lrmstype, lrms full command, durationinseconds
  • controldirscantiming, lrmstype, durationinseconds
  • logscantiming, lrmstype, all,durationinseconds
  • logscantiming, lrmstype, jobid, durationinseconds
  • finishedjobprocesssingtiming, lrmstype, jobid, finishstate(success,fail etc), durationinseconds

Infosys subsystem metrics

  • lrmscalltiming, lrmstype, lrms full command, durationinseconds
  • controldirreadtiming, all, directoryname, durationinseconds
  • controldirreadtiming, file, filename, durationinseconds
  • controldirreadtiming, job, jobid, durationinseconds

Data staging metrics

  • deliveryload, deliveryhost, 1min load
  • numberofdtr, total number of DTRs
  • dtrthroughput, dtr_id, average bytes/sec per dtr
  • schedulertime, dtr_id, durationinseconds
  • transfertime, dtr_id, durationinseconds

A-REX metrics

  • action, job id, duration time

A-REX subsystems and their metrics

  • old_state_name-new_state_name, job_identifier
  • SCAN-JOBS, *


visualisation of data

  • Elasticsearch? Something to process all the data.
  • Need to have consistent units etc.

details, agreements

  • each subsystem stores performance records in a dedicated file in /var/log/arc/perfdata/arex.perflog , infosys.perflog, data.perflog, backends.perflog and sysinfo.perflog according to the metrics syntax defined above.
  • CEs push perf data files to a central service (perf datastore) by using a dedicated publisher script, the perforator that is
    • distributed with arc packages
    • run by helper
    • using host certs for authentication.
    • curl to push the files
  • the central perf datastore will be an https server run in Copenhagen.
    • it is essential to have a very open policy wrt which ARC CEs can post data to the perflog datastore. trust any host cert issued by igtf.
    • for data mining purposes the perf datastore will offer ??? interface
  • sysadmins can turn on/off perf data reporting by [common] variable enable_perflog_reporting=yes/no, default no.
  • Who(which subsystem) makes sure the perfdata directory exists?