Logging of CE performance numbers

As agreed on the ARC4eInfrastruture kick-off meeting at CERN, a script for collecting CE performance numbers on production CEs should be created

The following persons are assigned to this task each responsible for the below areas:

Aleksandr - CE (frontend)
Balazs (Florido) - Information system
David - Data staging
Martin (Christian) - Backend scripts

Performance numbers to collect

Server load per measurement
Find way of correlating load, profiling and timings
Data should be uploaded to central log server
Collect system architecture (Filesystem, network bandwidth etc)
Number of jobs in batch system
How many jobs in ARC queue vs how many jobs in global batch system
How big(no o' files) is the control dir

System (general data)

data to be collected by a helper script called systemdata

number of files in controldir
disk space free/used
cpuload
cpuload by the ARC processes
memory status
opsys version
arc version

CE (frontend)

Time spent processing for every job state change
Time spent scanning for new jobs
Time spent scanning for old jobs
Time spent scanning for job state change requests
Time spent generating XML (maybe already in the log)

Information system

Time spent reading each file in controldir
Time spent reading each file of a job in controldir
Time spent reading all files in controldir
Time spent executing LRMS commands

Data staging

Load on each delivery server
Total number of files in DTR
File and byte throughput on each server
Time of delivery process vs time between DTR scheduling the process and picking it up when finished

Backend scripts

Time to create a script
Time to submit a script
Time every call to the lrms (also for infosys)
Time controldir access
Time log parsing
Time processing of finished jobs

Proposed metrics format

timestamp, ARC-CE host name, ARC version, system, subsystem, subsubsystem, metrics-structure
- metrics-structure: metrics name, metrics parameters, metrics value
full example from infosys system:
- timestamp, ARC-CE host name, ARC version, system, subsystem, subsubsystem, controldirreadtiming, job, jobid, durationinseconds
- 2016-02-21 12:12:34, piff.lu.se, ARC 5.0.4, infosys, subsystem, subsubsystem, controldirreadtiming, job, job.id.2333467, 245
full example from a-rex system (collected in dedicated local file hence no need for host name, version and system):
- 2016-02-29T16:04:27Z ACCEPTED-PREPARING yS8LDmJUfunnOSAtDmABFKDmABFKDmABFKDmucNKDmABFKDmEzxRhn 22045244 nS
full example from back-end system:
- 2016-02-21 12:12:34, piff.lu.se, ARC 5.0.4, submit-SLURM-job, lrms submit, <jobid>, duration, 100us
full example from data staging
- 2016-02-21 12:12:34, piff.lu.se, ARC 5.0.4, dtr, dtr_id, transfer_time, 9.345s

Backend subsystem metrics

JobCreationTiming, lrmstype, jobid, durationinseconds
lrmscalltiming, lrmstype, lrms full command, durationinseconds
controldirscantiming, lrmstype, durationinseconds
logscantiming, lrmstype, all,durationinseconds
logscantiming, lrmstype, jobid, durationinseconds
finishedjobprocesssingtiming, lrmstype, jobid, finishstate(success,fail etc), durationinseconds

Infosys subsystem metrics

lrmscalltiming, lrmstype, lrms full command, durationinseconds
controldirreadtiming, all, directoryname, durationinseconds
controldirreadtiming, file, filename, durationinseconds
controldirreadtiming, job, jobid, durationinseconds

Data staging metrics

deliveryload, deliveryhost, 1min load
numberofdtr, total number of DTRs
dtrthroughput, dtr_id, average bytes/sec per dtr
schedulertime, dtr_id, durationinseconds
transfertime, dtr_id, durationinseconds

A-REX metrics

action, job id, duration time

A-REX subsystems and their metrics

old_state_name-new_state_name, job_identifier
SCAN-MARKS, *
SCAN-MARKS-NEW, *
SCAN-JOBS, *
SCAN-JOBS-NEW, *
SCAN-JOBS-OLD, *

IMPLEMENTATION

visualisation of data

Elasticsearch? Something to process all the data.
Need to have consistent units etc.

details, agreements

each subsystem stores performance records in a dedicated file in /var/log/arc/perfdata/arex.perflog , infosys.perflog, data.perflog, backends.perflog and sysinfo.perflog according to the metrics syntax defined above.
CEs push perf data files to a central service (perf datastore) by using a dedicated publisher script, the perforator that is
- distributed with arc packages
- run by helper
- using host certs for authentication.
- curl to push the files
the central perf datastore will be an https server run in Copenhagen.
- it is essential to have a very open policy wrt which ARC CEs can post data to the perflog datastore. trust any host cert issued by igtf.
- for data mining purposes the perf datastore will offer ??? interface

sysadmins can turn on/off perf data reporting by [common] variable enable_perflog_reporting=yes/no, default no.

Who(which subsystem) makes sure the perfdata directory exists?

Logging of CE performance numbers

Contents

Performance numbers to collect

System (general data)

CE (frontend)

Information system

Data staging

Backend scripts

Proposed metrics format

Backend subsystem metrics

Infosys subsystem metrics

Data staging metrics

A-REX metrics

A-REX subsystems and their metrics

IMPLEMENTATION

visualisation of data

details, agreements

Navigation menu

Page actions

Page actions

Personal tools

Navigation

Search

Tools