This wiki is obsolete, see the NorduGrid web pages for up to date information.

LRMS Backends/workshop2015

From NorduGrid
Jump to navigationJump to search

Details

time: 10/9 2015 10:00 - 17:00

Location: NBI UCPH - Lille frokoststue (across from the kantine, Blegdamsvej 19)

fun: YES!

Provisional agenda:

10:00 - 10:30 localtransfer

10:30 - 12:00 common lrms calls between infosys and jobcontrol backends

Lunch

13:00 - 15:00 re-integrating python-lrms

coffee

15:30 - 17:00 API and Ze Future

Notes

localtransfer

localtransfer should be removed from the backends, and the old uploader and downloader should be removed. We should keep the option of reimplementing the feature if there are requests. The option should be kept in arc.conf and the backends should log a suitable warning if it is turned on.

Actions:

CUS - remove from the backends, documentation, add warning etc.

AW - remove uploaders and downloaders from distribution

Oxana - register as known issue

common lrms commands

We need a process that uses batch system command to fill in a table with full job and queue information. A full batch system state. This should be the full information needed for scan and the information system. When this job is done, it moves the information to a common file used by both backends and informationsystem.

Actions:

We are targeting SLURM and CONDOR.

List all information needed for infosystem and scan - Florido, Chrulle

Decide on a "format" for common file. - Florido, Chrulle

Get the information from lrms - Florido, Chrulle


Reintegrating python

actions:

Add ssh option to gridmanager configuration section - Martin

Add new backend options to a-rex, conf and move over backends from branch - Christian?

Create separate package for python backends containing the inline python - AW

API

Finish overview table. Exactly who is writing/reading what in the control dir - Christian

Actions

  • Investigate whether localtransfer is used by any sites - Christian
 It does not seem to be in use anywhere. The following sites are not using it: UCPH, IJS, Arnes, HPC2N, UiO, CSC, Glasgow, Bristol, RAL, brunel, Cern,  UA,  UNIBE
 No sites have reported it in use
  • Create a full overview of the "API" i.e. all information flowing between arc and the backends - Christian, Florido
information description source comment
config
a-rex
joboption_lrms Which lrms is configured backends set this should not have joboption prefix. It is not a joboption.
  • Investigate what lrms functions are called and with which options -Christian, Florido
Backend jobcontrol infosystem
LL

llcancel <jobid>
llq -r %st %id <jobids>
llq -l <jobid>
llclass -l
llclass -l <queuename>
llsub <jobscript>

llclass
llclass -l <queuename>
llq
llq -c <queuename>
llq -l -x <jobid>
llstatus
llstatus -f %sta
llstatus -l
llstatus -R
llstatus -r %cpu %sta
llstatus -v

Condor

condor_rm <joboption_jobid>%.`hostname -f`
condor_q
condor_config_val
condor_submit <jobscript>

condor_q -constraint "NiceUser == False" -format "ClusterId = %V\n" ClusterId -format "ProcId = %V\n" ProcId -format "JobStatus = %V\n" JobStatus -format "CurrentHosts = %V\n" CurrentHosts -format "LastRemoteHost = %V\n" LastRemoteHost -format "RemoteHost = %V\n" RemoteHost -format "ImageSize = %V\n" ImageSize -format "RemoteWallClockTime = %V\n" RemoteWallClockTime -format "RemoteUserCpu = %V\n" RemoteUserCpu -format "RemoteSysCpu = %V\n" RemoteSysCpu -format "JobTimeLimit = %V\n" JobTimeLimit -format "JobCpuLimit = %V\n\n" JobCpuLimit
condor_status -format "%s\n" Machine
condor_status -format "Name = %V\n" Name -format "Machine = %V\n" Machine -format "State = %V\n" State -format "Cpus = %V\n" Cpus -format "TotalCpus = %V\n" TotalCpus -format "SlotType = %V\n\n" SlotType

Uses a find to collect job ids from controldir: find $controldir/processing -maxdepth 1 -name 'job.??????????*.status'

PBS

qdel <jobid>
qstat -a
qsub -r n -S /bin/bash -m n < <jobscript>

pbsnodes -a
qmgr -c "list server"
qstat -f
qstat -Q
qstat -Q -f <queuename>
showq
showbf -u <userid>

LSF

bkill <jobid>
bjobs -a -u all
bjobs -W -w <jobid>
bsub < <jobscript>
bparams -a

bhosts -w
bjobs -W -w <jobid>
bqueues -w
bqueues -w <userid> <queue name>
bqueues -l <userid> <queue name>
lshosts -w
lsid -V

SGE

qdel <jobid>
qstat -j <jobid>
qstat -j <jobid> -f <briefaccttempfile>
qstat -u '*'
qsub -S @posix_shell@ < <jobscript>
qconf -spl
qconf -sc

qconf -sconf global
qconf -sg <array/list of queue names>
qconf -sql
qhost -f -H
qhost -xml
qstat -f
qstat -help
qstat -j <jobid>
qstat -u '*' for compatibility with other versions either -F or -f

Slurm

scancel <jobid>
squeue -a -h -o "%i:%T" -t all -j <jobids>
sacct -j <jobid>.batch -o ExitCode -P
scontrol -o show job <jobid>
sacct -j <localid>.batch -o NCPUS,NNODES,CPUTime,Start,End,ExitCode,State -P
sbatch <jobscript>

scontrol show config
scontrol show node --oneliner
sinfo -a -h -o \"cpuinfo=%C\"
sinfo -a -h -o \"PartitionName=%P TotalCPUs=%C TotalNodes=%D MaxTime=%l\"
squeue -a -h -t all -o \"JobId=%i TimeUsed=%M Partition=%P JobState=%T ReqNodes=%D ReqCPUs=%C TimeLimit=%l Name=%j NodeList=%N\"

DGBridge

wsclient -e <endpoint URL> -m status -j <jobid>

fork

code in Sysinfo.pm
ps -e -o ppid,pid,vsz,time,etime,user,comm
uptime
ulimit -t

# profile code and write database to ./nytprof.out
perl -d:NYTProf some_perl.pl

# convert database into a set of html files, e.g., ./nytprof/index.html
# and open a web browser on the nytprof/index.html file
nytprofhtml --open

# or into comma separated files, e.g., ./nytprof/*.csv
nytprofcsv
  • Investigate ssh in HED - Silje,Martin
  • inline python - Jon
Available in EPEL and Debian (turned out a gregor herrmann added it to debian in July...)