This wiki is obsolete, see the NorduGrid web pages for up to date information.

Data Staging/DTR

From NorduGrid
Jump to navigationJump to search

DTR description

DTR stands for Data Transfer Request. This is the structure that contains several fields that fully describe the file transfer to be performed. One DTR is generated by the generator per each file transfer.

Fields of the DTR

More or less required:

  • DTR ID
  • source endpoint
  • destination endpoint
    • for source and destination, a list of metadata such as file size, checksum, creation date etc
    • for source and destination (if applicable) a list of replicas
    • for source and destination (if applicable) current replica
    • for source and destination (if applicable) TURL or delivery-level URL used for transfer
    • for source and destination (if applicable) request ID (in the case of asynchronous requests to remote storage services)
  • credentials
  • cache information
    • if the file is cacheable, the filename in cache
    • cache directories configuration
    • caching state (already in cache, cache currently locked etc)
  • local user information (uid/gid)
  • Job ID this transfer belongs to
  • priority of the transfer - a number set by the generator which flattens priorities
  • transfer share this DTR belongs to
  • sub-share the DTR belongs to - may be set by the Generator
  • tries left
  • flags to handle properties and strategies when dealing with index servers
    • flag to say whether DTR is replicating inside the same logical filename
    • flag to say whether DTR should force registration to an existing logical filename, if the source is different
  • mapping info - mapping information of local files to which remote files may be mapped to in the configuration (copyurl/linkurl)
  • status of the DTR
  • error status
    • type of error
    • location of error
    • text description of error detail
  • number of bytes transferred/offset
  • timing properties
    • timeout - time which DTR is allowed to remain in current state
    • creation time
    • last modification time
    • process time - wait until this time to do further processing
  • cancel (set to true if request is to be cancelled)
  • bulk operation flags to combine several DTRs in a bulk request
  • delivery endpoint, whether Delivery is to be carried out by a local process or remote service
  • current owner - who is in charge for this DTR right now
  • logger object, so each DTR can have its own log
  • lock, since DTRs can be modified by several processes, for avoiding writing collisions

Possible

  • affiliation (if we use the affiliation of multiple DTRs, see right below).
  • history of states

Multiple DTRs may be affiliated together. Possible reasons and uses:

  • Belong to same job
  • Belong to bunch of jobs which user indicated as preferably processed together
  • Belong to same VO and assigned priorities to be applied within group
  • Failure of one DTR in group may cancel processing of other DTRs (should be implemented in Generator)

State transitions of DTR

All possible states of a DTR, with arrows indicating the normal flow of DTRs between states. Each state is explained in detail below. Error conditions are not included here but are shown in another diagram further down.

DTR state diagram.png

Status codes

The following table describes all non-error status codes, and also the action taken in the event of a cancellation request being received while in that state. In general if all of the data transfer has been completed before receiving a cancellation request, the destination file is not deleted. The main reason for this is to preserve cache files, as the user may wish to run the same job soon after cancelling it.

Statuses of the DTR
Status Code Text Description Action on cancel
Statuses set by the generator
NEW The DTR has just been built by the generator Return to generator
CANCEL A request has been made to cancel the DTR n/a
Statuses set by the scheduler
CHECK_CACHE The DTR destination is cacheable and the cache should be checked for the file's existence Return to generator
RESOLVE The DTR source is a meta-protocol and should be resolved Set to PROCESS_CACHE to remove any cache locks
QUERY_REPLICA The DTR source should be queried to check existence, check file size, checksum etc. Set to REGISTER_REPLICA to remove pre-registered destination
PRE_CLEAN The destination in the DTR should be deleted before writing Set to REGISTER_REPLICA to remove pre-registered destination
STAGE_PREPARE_SOURCE The DTR source is a meta-protocol which must be prepared or staged Set to REGISTER_REPLICA to remove pre-registered destination
STAGE_PREPARE_DESTINATION The DTR destination is a meta-protocol which must be prepared or staged Set to REGISTER_REPLICA to remove pre-registered destination
TRANSFER_WAIT The DTR is ready to be sent to delivery but must wait due to transfer limits or priority settings Set to RELEASE_REQUEST
TRANSFER The DTR should be transferred immediately Set to RELEASE_REQUEST
RELEASE_REQUEST The DTR transfer has finished and any requests made on remote storage should be released Abort request and delete destination, set to REGISTER_REPLICA
REGISTER_REPLICA The DTR destination is a meta-protocol and the new replica should be registered Delete destination and set to PROCESS_CACHE
PROCESS_CACHE The DTR destination is cacheable and the cached file should be unlocked and linked/copied to the session dir Delete cache file
DONE The DTR completed successfully Do nothing
CANCELLED The DTR has been cancelled succesfully n/a
ERROR An error occurred with the DTR Do nothing
Statuses set by the pre-processor
CHECKING_CACHE The pre-processor is checking the cache Wait until complete, then set to CACHE_CHECKED. The scheduler will then set to PROCESS_CACHE
CACHE_WAIT The cache file is locked and the scheduler should wait before trying to obtain the lock Scheduler will return to generator
CACHE_CHECKED The cache check is complete Scheduler will set to PROCESS_CACHE
RESOLVING The pre-processor is resolving replicas Wait until complete, then set to RESOLVED. The scheduler will then set to REGISTER_REPLICA
RESOLVED The replica resolution is complete Scheduler will set to REGISTER_REPLICA
QUERYING_REPLICA The pre-processor is querying a replica Wait until complete, then set to REPLICA_QUERIED. The scheduler will then set to REGISTER_REPLICA
REPLICA_QUERIED The replica querying is complete Scheduler will set to REGISTER_REPLICA
PRE_CLEANING The pre-processor is deleting the destination file Wait until complete, then set to PRE_CLEANED. The scheduler will set to REGISTER_REPLICA
PRE_CLEANED The destination file has been deleted The scheduler will set to REGISTER_REPLICA
STAGING_PREPARING The pre-processor is making a staging or preparing request Wait until complete, then scheduler will set to RELEASE_REQUEST so it can be aborted
STAGING_PREPARING_WAIT The staging or preparing request is not ready and the scheduler should wait before polling the status of the request Scheduler will set to RELEASE_REQUEST so it can be aborted
STAGED_PREPARED The staging or preparing request is complete Scheduler will set to RELEASE_REQUEST so it can be aborted
Statuses set by the delivery
TRANSFERRING The transfer of the DTR is on-going Stop transfer and set to RELEASE_REQUEST. Delivery will delete the incomplete file and the request will be aborted
TRANSFERRED The transfer completed successfully Scheduler will abort the request
Statuses set by the post-processor
RELEASING_REQUEST The post-processor is releasing a stage or prepare request Wait until finished, then set to REGISTER_REPLICA to unregister the file
REQUEST_RELEASED The release of stage or prepare request is complete Set to REGISTER_REPLICA to unregister the file
REGISTERING_REPLICA The post-processor is registering a replica in an index service Continue as normal
REPLICA_REGISTERED Replica registration is complete Continue as normal
PROCESSING_CACHE The post-processor is releasing locks and copying/linking the cached file to the session dir Continue as normal
CACHE_PROCESSED Cache processing is complete Continue as normal

Error Conditions of DTRs

The following diagram shows possible error conditions and actions taken. For simplicity and because all error handling logic takes place within the scheduler, the pre- and post-processor and the delivery layers are not shown.

DTR error state diagram.png

Errors are categorised into the following types:

Error Explanation Retryable? Action
INTERNAL_LOGIC_ERROR Internal error in data staging logic No Stop processing and report back to generator
INTERNAL_PROCESS_ERROR Internal error like losing contact with an external process Yes Clean if necessary and retry
SELF_REPLICATION_ERROR Attempt to copy a file to itself No Return to generator
CACHE_ERROR A problem occurred in cache handling Yes Retry without caching
TEMPORARY_REMOTE_ERROR Error such as connection timeout on remote service Yes Retry with an increasing back-off
PERMANENT_REMOTE_ERROR Error such as file not existing, permission denied etc on remote service No Follow cancellation steps and return failed DTR to generator
LOCAL_FILE_ERROR Error with a local file No Follow cancellation steps and return to generator
TRANSFER_SPEED_ERROR Transfer rate was below specified limits Yes Retry transfer. If all retries fail, report back to generator - it will make the decision on whether to cancel other related DTRs. (Future work: make decision on whether other transfers caused slow transfer and whether cancelling others would help or should be done)
STAGING_TIMEOUT_ERROR The staging process took too long No Try a different replica - if none available, cancel and report back to generator

Methods of DTRs

DTR::push (DTR, receiver) -- pass the DTR from one process to another, e.g. DTR::push (dtr, preprocessor)

Implementation

Within Data Staging framework there is a global list of DTRs. Pointers to the DTRs are passed around between components, which can modify them directly and push them between each other.