This wiki is obsolete, see the NorduGrid web pages for up to date information.
Logging of CE performance numbers
From NorduGrid
Jump to navigationJump to search
As agreed on the ARC4eInfrastruture kick-off meeting at CERN, a script for collecting CE performance numbers on production CEs should be created
The following persons are assigned to this task each responsible for the below areas:
- Aleksandr - CE (frontend)
- Balazs (Florido) - Information system
- David - Data staging
- Martin (Christian) - Backend scripts
Performance numbers to collect
- Server load per measurement
- Find way of correlating load, profiling and timings
- Data should be uploaded to central log server
- Collect system architecture (Filesystem, network bandwidth etc)
- Number of jobs in batch system
- How many jobs in ARC queue vs how many jobs in global batch system
- How big(no o' files) is the control dir
System (general data)
data to be collected by a helper script called systemdata
- number of files in controldir
- disk space free/used
- cpuload
- cpuload by the ARC processes
- memory status
- opsys version
- arc version
CE (frontend)
- Time spent processing for every job state change
- Time spent scanning for new jobs
- Time spent scanning for old jobs
- Time spent scanning for job state change requests
- Time spent generating XML (maybe already in the log)
Information system
- Time spent reading each file in controldir
- Time spent reading each file of a job in controldir
- Time spent reading all files in controldir
- Time spent executing LRMS commands
Data staging
- Load on each delivery server
- Total number of files in DTR
- File and byte throughput on each server
- Time of delivery process vs time between DTR scheduling the process and picking it up when finished
Backend scripts
- Time to create a script
- Time to submit a script
- Time every call to the lrms (also for infosys)
- Time controldir access
- Time log parsing
- Time processing of finished jobs
Proposed metrics format
- timestamp, ARC-CE host name, ARC version, system, subsystem, subsubsystem, metrics-structure
- metrics-structure: metrics name, metrics parameters, metrics value
- full example from infosys system:
- timestamp, ARC-CE host name, ARC version, system, subsystem, subsubsystem, controldirreadtiming, job, jobid, durationinseconds
- 2016-02-21 12:12:34, piff.lu.se, ARC 5.0.4, infosys, subsystem, subsubsystem, controldirreadtiming, job, job.id.2333467, 245
- full example from a-rex system (collected in dedicated local file hence no need for host name, version and system):
- 2016-02-29T16:04:27Z ACCEPTED-PREPARING yS8LDmJUfunnOSAtDmABFKDmABFKDmABFKDmucNKDmABFKDmEzxRhn 22045244 nS
- full example from back-end system:
- 2016-02-21 12:12:34, piff.lu.se, ARC 5.0.4, submit-SLURM-job, lrms submit, <jobid>, duration, 100us
- full example from data staging
- 2016-02-21 12:12:34, piff.lu.se, ARC 5.0.4, dtr, dtr_id, transfer_time, 9.345s
Backend subsystem metrics
- JobCreationTiming, lrmstype, jobid, durationinseconds
- lrmscalltiming, lrmstype, lrms full command, durationinseconds
- controldirscantiming, lrmstype, durationinseconds
- logscantiming, lrmstype, all,durationinseconds
- logscantiming, lrmstype, jobid, durationinseconds
- finishedjobprocesssingtiming, lrmstype, jobid, finishstate(success,fail etc), durationinseconds
Infosys subsystem metrics
- lrmscalltiming, lrmstype, lrms full command, durationinseconds
- controldirreadtiming, all, directoryname, durationinseconds
- controldirreadtiming, file, filename, durationinseconds
- controldirreadtiming, job, jobid, durationinseconds
Data staging metrics
- deliveryload, deliveryhost, 1min load
- numberofdtr, total number of DTRs
- dtrthroughput, dtr_id, average bytes/sec per dtr
- schedulertime, dtr_id, durationinseconds
- transfertime, dtr_id, durationinseconds
A-REX metrics
- action, job id, duration time
A-REX subsystems and their metrics
- old_state_name-new_state_name, job_identifier
- SCAN-MARKS, *
- SCAN-MARKS-NEW, *
- SCAN-JOBS, *
- SCAN-JOBS-NEW, *
- SCAN-JOBS-OLD, *
IMPLEMENTATION
visualisation of data
- Elasticsearch? Something to process all the data.
- Need to have consistent units etc.
details, agreements
- each subsystem stores performance records in a dedicated file in /var/log/arc/perfdata/arex.perflog , infosys.perflog, data.perflog, backends.perflog and sysinfo.perflog according to the metrics syntax defined above.
- CEs push perf data files to a central service (perf datastore) by using a dedicated publisher script, the perforator that is
- distributed with arc packages
- run by helper
- using host certs for authentication.
- curl to push the files
- the central perf datastore will be an https server run in Copenhagen.
- it is essential to have a very open policy wrt which ARC CEs can post data to the perflog datastore. trust any host cert issued by igtf.
- for data mining purposes the perf datastore will offer ??? interface
- sysadmins can turn on/off perf data reporting by [common] variable enable_perflog_reporting=yes/no, default no.
- Who(which subsystem) makes sure the perfdata directory exists?