Here we describe the priority and shares system for the new data staging framework. During the design stage there were several ideas taken from other research in the field and the first implementation of the transfer shares model in ARC.
The initial idea was giving every DTR that comes into the system a fixed priority, then sorting the queue according to priorities and launching the first N DTRs with the highest priorities. This scheme also allows easy incorporation of pre-emption: if the job with higher priority appears it just pushes other DTRs out of the front of the queue and then during the next scheduler loop we can start these DTRs and suspend the pushed ones.
However, this idea can potentially lead to the situation that demanded the implementation of transfer shares in ARC. If a user or VO with the highest priority submits bunch of jobs at once all the other will be blocked, because DTRs from this bunch will occupy the front of the queue for a long time.
The idea of transfer shares comes in handy now. The available transfer slots should be shared among different VOs/Users, so nobody would be blocked. VOs/Users with higher priority get more transfer slots than the other. However, strict limits on the number of slots per share are not flexible enough - if the transfer pattern changes then strict limits could cause problems, squeezing lots of users/jobs into one share with a few slots and blocking others. The User or VO must also be able to decide the relative priority of jobs within its own share.
The ideas above led to the creation of two configurable properties: user-defined job priority and server-defined share priority. Users may define a priority for their jobs in the job description ("priority" attribute in xrsl), and this is a measure of the priority given to this job within the share it gets assigned to when submitted. On the server-side, it is possible to define a share type, and priorities of certain shares. The share priority is used to determine the number of transfer slots to assign to the share, taking into account which shares are currently active (active meaning the share has at least one DTR in the system).
When the Scheduler receives a new DTR, it is placed into a transfer share, which is defined by a User DN, VO, Group inside VO or role inside VO as it was in previous versions of ARC. Currently it's possible to use only one sharing criteria in the configuration, i.e. it's not possible to use simultaneously sharing by User and VO.
Priority is defined as a number between 1 and 100 inclusive - a higher number is a higher priority. In the A-REX configuration it is possible to specify a base priority for certain shares. If the DTR doesn't belong to any of these specified shares, it is placed in a "_default" share with default base priority (50). The scheduler sets the priority of the DTR to the base priority of the share multiplied by the user priority from the job description (default 50) divided by 100, therefore default priority of a DTR is 25. In this system the priority set in the job description effectively defines a percentage of the base priority. Thus service administrators can set maximum priority limits for certain shares, but users or VOs have full control of their jobs' priority within the share.
While revising the Delivery and Processor queues, the scheduler separates DTRs according to the shares they belong to. Inside every share DTRs are sorted according to their priorities. Then the scheduler determines the number of transfer slots that every active share can grab. The number is determined dynamically depending on priorities of active shares. Each share receives the number of slots which corresponds to the weight of its priority in the summed priority of all active shares. After the number of slots for each share is determined the scheduler just launches N[i] highest priority DTRs in each share, where N[i] is the number of transfer slots for i-th share.
The reason for weighting the DTR priority by the share priority is for occasions when the Scheduler considers the entire queue of DTRs, for example when allowing highest priority DTRs to pass certain limits.
Example: there are two active shares, one has base priority 60, the other 40. The summarized priority is 100 (60 + 40). The first share has a weight of 60%, the second 40%. So the first will grab 60% of configured transfer slots, and the second -- 40%. If the system is configured with 5 Delivery slots, then the first share will take 3 slots and the second 2 slots. The 3 highest priority DTRs from the first share and 2 highest priority from the second share will be assigned to those slots.
To avoid the situation where a fixed limit of slots are used up by slow transfers and a new high priority transfer has to wait for a slot, we have "emergency" transfer slots. If there are transfers in the queue from a particular share, but all slots are filled with transfers from other shares, one emergency slot can be assigned to this share to allow transfers to start immediately. The share may use an emergency slot until any other transfer finishes, at which point the emergency slot becomes a regular slot and a new transfer does not start from the queue.
The Generator can assign DTRs to "sub-shares" to give a higher granularity than the standard criteria and when assigning transfer slots. Sub-shares are treated as separate shares. In A-REX, different sub-shares are assigned to downloads and uploads, and in this case emergency transfer slots prove useful for preventing jobs not being able to finish because all transfer slots are taken by downloaders. If this happens then emergency slots can be used for uploads.
- Within a share, high priority jobs block low priority jobs. Thus if there is a constant stream of high priority jobs in a share, then some low priority jobs in the same share may never run. Possible solutions:
- Increasing the priority as the time spent in the queue increases (returning to previous priority after leaving the queue). This is currently implemented as increasing the priority by 1 every 5 minutes after the DTR's timeout has passed.
- Changing simple highest-priority-first to a random algorithm where higher priorities are weighted higher
- Making a higher granularity of shares by splitting each priority or priority range into its own share - this is probably too complicated
The configuration varies depending on the ARC version. In the examples below VO roles are used to assign shares, the atlas slow_prod role is assigned a low priority share and the atlas validation role is assigned a higher priority share.
Shares are defined using the same options in the [grid-manager] section arc.conf as in the old framework, but rather than a set number of slots per share, a priority is specified.
maxloadshare="1 voms:role" share_limit="atlas:slow-prod 20" share_limit="atlas:validation 80"
The new options in the [data-staging] section should be used
sharetype="voms:role" definedshare="atlas:slow-prod 20" definedshare="atlas:validation 80"
If both shares are active and there are 10 slots available, then DTRs in the slow-prod share will get 2 slots and those in the validation share get 8 slots, and so the jobs in the validation share will have a higher throughput (assuming similar numbers of files and file sizes in each type of job).
A user wants their job to be high (but not top) priority and specifies ("priority" = "80") in the job description. The user has a VOMS proxy with no role defined and submits the job to a site with the above configuration. The job is assigned to the default share and DTRs have priority 40 (50 x 80 / 100). The user then creates a VOMS proxy with the ATLAS validation role and submits another job with the same priority to the same site. This time the job goes to the configured atlas:validation share and the DTRs have priority 64 (80 x 80 / 100). Note that the priority of a DTR only affects the its position within a share and does not influence the distribution of slots between shares.