Data Staging

From NorduGrid

Jump to: navigation, search

This page describes the new data staging framework for ARC, from design to implementation. The framework is code-named DTR (Data Transfer Request).

Contents

Overview

ARC's Computing Element (A-REX) performs the task of data transfer for jobs before and after the jobs run. After a growing number of issues with the data management model used by A-REX, it was decided that an entirely new framework should be designed, which could be used by A-REX but also as a stand-alone system for other applications. The issues and the design steps are described in Data_Staging/Design. The new data staging framework uses a three-layer architecture, shown in the figure below:

Datastagingdesign.png

The Generator uses user input of tasks to construct a Data Transfer Request (DTR) per file that needs to be transferred. These DTRs are sent to the Scheduler for processing. The Scheduler sends DTRs to the Pre-processor for anything that needs to be done up until the physical transfer takes place (e.g. cache check, resolve replicas) and then to Delivery for the transfer itself. Once the transfer has finished the Post-processor handles any post-transfer operations (e.g. register replicas, release requests). The number of slots available for each component is limited, so the Scheduler controls queues and decides when to allocate slots to specific DTRs, based on the prioritisation algorithm implemented. See Data_Staging/Prioritizing for more information.

This layered architecture allows any implementation of a particular component to be easily substituted for another, for example a GUI with which users can enter DTRs (Generator) or an external point-to-point file transfer service (Delivery).

Implementation

The new framework is available as of ARC release 11.05. In A-REX it is turned off by default, but it can be turned on to replace the current downloaders and uploaders. It was initially developed in the data_staging subversion branch of arc1 until coding was completed and it was ready to be tested by the wider community. Implementation details can be found in Data_Staging/Design. This branch was merged to the arc1 trunk on 4/2/11 (revision 20148).

The middle and lower layers of the architecture (Scheduler, Processor and Delivery) are implemented as a separate library libarcdatastaging in src/libs/data-staging. This library depends on some low-level common ARC libraries and the DMC modules (which enable various data access protocols) but is independent of other components such as A-REX or ARC clients. A simple Generator is included in this library for testing purposes. A Generator for A-REX is implemented in src/services/a-rex/grid-manager/jobs/dtr_generator.(h|cpp), which performs the task of the down and uploaders - turning job descriptions into data transfer requests.

Using the new framework in A-REX

To enable the new framework the following option should be present in the [grid-manager] section of arc.conf:

newdatastaging=yes

If there is a problem it can be turned back off by setting the option to "no" or removing it. A-REX must be restarted for any change to take effect. However, before doing this it is recommended to drain the queue of jobs in staging states (PREPARING and FINISHING) by for example setting allownew=no, or maxjobs to a very small number so that jobs can still be accepted. This is because in the new framework there is no limit on jobs in staging states, since the limits on number of files transferring etc are handled internally by the data staging framework, and so there can potentially be hundreds or thousands of jobs in these states. Restarting in the old mode will mean that A-REX immediately starts downloader and uploader processes for all those jobs and this could easily overload the system.

Configuration (ARC 1.x)

All other configuration can remain the same as with the old framework and will be used by the new framework. The maximum number of delivery, pre-processor and post-processor slots is taken from the "maxload" option (max frontend jobs * max transferred files). The configuration of transfer shares has also changed slightly so that "share_limit" defines a priority for a share (from 1 to 100) instead of a number of slots.

Configuration (ARC 2.x)

As above, it is not necessary to define extra configuration, but the section [data-staging] can be used to set a more fine-grained selection of parameters, and also enable multi-host data staging.

Parameter Explanation Default Value
maxdelivery Maximum delivery slots 10
maxprocessor Maximum processor slots per state 10
maxemergency Maximum emergency slots for delivery and processor 1
maxprepared Maximum prepared files (for example pinned files using SRM) 200
sharetype Transfer share scheme (dn, voms:vo, voms:group or voms:role) None
definedshare Defined share and priority _default 50
dtrlog Path to file where DTR state is dumped controldir/dtrstate.log
Multi-host related parameters
deliveryservice URL of remote host which can perform data delivery None
localdelivery Whether local delivery should also be done no
remotesizelimit File size limit (in bytes) below which local transfer is always used 0
usehostcert Whether the host certificate should be used in communication with remote delivery services instead of the user's proxy no

Setting these parameters overrides any options set using the old-style configuration for arc 1.x.

The multi-host parameters are explained in more detail in Data_Staging/Multi-host

Example:

[data-staging]
maxdelivery="10"
maxprocessor="20"
maxemergency="2"
maxprepared="50"

sharetype="voms:role"
definedshare="myvo:production 80"

deliveryservice="https://spare.host:60003/datadeliveryservice"
localdelivery="yes"
remotesizelimit="1000000"

Client-side priorities

To specify the priority of jobs on the client side, the "priority" element can be added to an XRSL job description, eg

("priority" = "80")

For a full explanation of how priorities work see Data_Staging/Prioritizing.

gm-jobs -s

The command "gm-jobs -s" to show transfer shares information now shows the same information at the per-file level rather than per-job. The number in "Preparing" are the number of DTRs in TRANSFERRING state, i.e. doing physical transfer. Other DTR states count towards the "Pending" files. For example:

Preparing/Pending files        Transfer share
        2/86                   atlas:null-download
        3/32                   atlas:production-download

As before, per-job logging information is in the <control_dir>/job.id.errors files.

Using the new framework in third-party applications

Data_Staging/API gives examples on how to integrate the data staging framework in third-party applications.

Supported Protocols

The following access and transfer protocols are supported (more details in Data_Staging/Protocols_overview). Note that third-party transfer is not supported.

  • file
  • HTTP(s/g)
  • GridFTP
  • SRM
  • LFC
  • Xrootd (read-only)
  • GFAL (experimental)
  • RLS (deprecated)

Multi-host Data Staging

To increase overall bandwidth, multiple hosts can be used to perform the physical transfers. See Data_Staging/Multi-host for details.

Monitoring

In A-REX the state, priority and share of all DTRs is logged to the file <controldir>/dtrstate.log periodically (every second). This is then used by the Gangliarc framework to show data staging information as ganglia metrics. See Data_Staging/Monitoring for more information.

Advantages

The new system offers many advantages over the previous system, including:

  • High performance - When a transfer finishes in Delivery, there is always another prepared and ready, so the network is always fully used. A file stuck in a pre-processing step does not block others preparing or affect any physical transfers running or queued. Cached files are processed instantly rather than waiting behind those needing transferred. Bulk calls are implemented for some operations of LFC and SRM protocols.
  • Fast - All state is held in memory, which enables extremely fast queue processing. The system knows which files are writing to cache and so does not need to constantly poll the file system for lock files.
  • Clean - When a DTR is cancelled mid-transfer, the destination file is deleted and all resources such as SRM pins and cache locks are cleaned up before returning the DTR to the Generator. On A-REX shutdown all DTRs can be cleanly cancelled in this way.
  • Fault tolerance - The state of the system is frequently dumped to a file, so in the event of crash or power cut, this file can be read to recover the state of ongoing transfers. Transfers stopped mid-way are automatically restarted after cleaning up the half-finished attempt.
  • Intelligence - Error handling has vastly improved so that temporary errors caused by network glitches, timeouts, busy remote services etc are retried transparently.
  • Prioritisation - Both the server admins and users have control over which data transfers have which priority.
  • Monitoring - Admins can see at a glance the state of the system and using a standard framework like Ganglia means admins can monitor ARC in the same way as the rest of their system.
  • Scaleable - An arbitrary number of extra hosts can be easily added to the system to scale up the bandwidth available. The system has been tested with up to tens of thousands of concurrent DTRs.
  • Configurable - The system can run with no configuration changes, or many detailed options can be tweaked.
  • Generic flexible framework - The framework is not specific to ARC's Computing Element (A-REX) and can be used by any generic data transfer application.

Remaining Issues

These issues should be kept in sync with the current state of development

High Priority

  • Testing, testing, testing

Medium Priority

  • Provide a way for the infosys to obtain DTR status information
    • First basic implementation: when DTR changes state write current state to .input or .output file
  • Retry strategy for failed transfers

Low Priority or Future Work/Research

  • Decide whether or not to cancel all DTRs in a job when one fails
    • Current logic: if downloading, cancel all DTRs in job, if uploading don't cancel any
    • Should be configurable by user - also EMI execution service interface allows specifying per-file what to do in case of error
  • Priorities: more sophisticated algorithms for handling priorities
  • Advanced features such as pausing and resuming transfers

Related Pages

Requirements and design phases of the project

Detailed description of DTRs

How to use the data staging library in an external application

Ideas and implementation of priorities

Some numbers showing code performance

Running data staging on multiple hosts

Description of Dmytro Karpenko's conribution to allow others continue his work

Personal tools