A-REX

From NorduGrid

A-REX

This page collects issues, possible solutions, discussions and other useful information about A-REX - ARC Execution Service.

Data Management bottleneck

Scheduling of data staging tasks in A-REX is not flexible enough for complex environments. Current efforts to provide new infrastructure is recorded at dedicated page.

Huge jobs number scalability issues

A-REX (grid-manager too) and InfoSystem have problems handling big number of computing jobs. Following reasons are identified.

  1. A-REX periodically scans for new jobs - by reading job.*.status - to start processing. Each job is represented by 5-7 files in single control directory. Depending on used filesystem scanning through huge amount of files may be slow and resorce consuming.
  1. scan-*-jobs backend scripts for every finished job in batch system scans content of control directory - namely job.*.local files - in order to map batch id to grid id and report finished jobs. Such scannig may be resource consuming.
  1. infosystem provider periodically scans known jobs and produces informational document which contains all jobs and serviced users. Scanning may be resource consuming. Big informational document - especially rendered into XML - are difficult to handle and query.

Following solutions may be applied to fix issues.

  • Move from filesystem to database. This soution is questionable because filesystem is kind of specialized database itself. Also taking into account caching capabilities of operating system continuous reading content of object may be even faster in case of filesystem.
  • [Currently implemented] Split content of control directory. One of possible approaches is to keep job.*.status files in separate subdirectories. There could be few such subdirectories depending in job state. For example there could be separate directories for active jobs, for those waiting to be picked up by A-REX and those which finished processing. This would solve issue 1 if combined with imposed limit on number of active jobs (already present) and somehow more clever way to scan for new jobs.
  • Provide better way to track from batch id to grid id. If common list of id mapping or some other way to obtain batch-to-grid link directly exists, scanning would be eliminated from scan-*-jobs scripts.
  • [Currently implemented] Removing jobs and serviced users from informational documents would make their size independent on number of jobs and size of served VOs. That would enhance scalability drastically.