This wiki is obsolete, see the NorduGrid web pages for up to date information.

Data Staging/Existing Solutions

From NorduGrid
Jump to navigationJump to search

Existing third-party solutions

gLite FTS

  • FTS is an service for transferring data point-to-point using the SRM protocol
  • Channels are set up between endpoints (SRM hostnames registered in info system)
    • A channel must exist between two endpoints for data to be transferred between them
    • "Catch-all" channels can also exist for endpoints not configured with standard channels
  • Each channel is configured with a maximum number of transfers, allowed VOs and share (% of simultaneous transfers) of each VO
  • Some users (DNs) can be configured as channel managers or VO managers
    • VO managers can change the priority of transfers within the VO
  • Several types of agents pick up and manage requests from the database
  • Pros:
    • Fulfils requirements for priority and VO share handling
    • Web-service so could create own client rather than requiring gLite tools at all sites
  • Cons:
    • Administration of extra service, including database
    • Only SRM endpoints supported
    • A central service would require caches to be exposed
  • Possibilities
    • One central ARC FTS server
    • Deploy an FTS service per site with direct access to cache
    • Take VO and channel handling part of service and integrate with Grid Manager
    • Implement own solution inspired by some ideas of FTS

BitTorrent

  • Bittorrent is basically a peer-to-peer protocol.
  • A seeder (a client that wants to expose files) makes a torrent file, containing information about a file or set of files to be made available and passes it to a so-called tracker. The tracker coordinates disctibution and makes other clients aware of the availability of these files. The clients must download the torrent file before they can start a real download.
  • A peer (a client that wants to download) obtains the information about available files through the tracker and connects to the seeder to download the file. As new peers connect to the network and request the same file, their computer receives a different piece of the data from the seeder. Once multiple peers have multiple pieces of the seed, BitTorrent allows each to become a source for that portion of the file. After the file is successfully and completely downloaded by a given peer, the peer is able to shift roles and become an additional seeder. This distributed nature of BitTorrent leads to a flood like spreading of a file throughout peers
  • Pros:
    • Decentralized by its nature, so perfectly compliant with grid (and ARC in particular) principles.
    • By design has some features we'd like to have in our framework, namely:
      • Effective usage of bandwidth.
      • Transfer should be capable of pausing (temporary cancel) and resuming.
      • Transfer from multiple alternative locations.
    • The current information system of ARC can be well utilized for bittorrent just with small extensions: local information providers create torrent files and pass them to trackers; index services in addition serve as file trackers.
  • Cons:
    • Seems to be really hard to predict the behavior of the transfer system --> may be hard or even impossible to implement the planned data-aware job scheduling (NOTE: since grid-infrastructures are usually a great deal more stable than usual bittorrent networks maybe it's not a big problem)
    • All participants in bittorrent network are equal – sharing and prioritization capabilities need to be introduced from scratch.
    • Significant delays before the download starts, because it may take time for enough peer connections to be established. This may raise transfer time, especially for small files, to unaccepted values.
    • Unclear now how to bundle bittorrent with GridFTP, SRM and other protocols currently used in grid.
  • Possibilities
    • Implement own solution inspired by some ideas of BitTorrent
    • Use BitTorrent technology for some set of low-priority tasks, which transfer times are not critical.

Stork

  • Stork is a batch system for data transfer
  • The stork process runs as a daemon
  • Clients submit data transfer "jobs" consisting of pairs of sources and destinations
  • The batch system schedules and executes these jobs
  • Pros
    • Fits with ARC architecture - can run a Stork on each ARC headnode
    • Can configure limits on number of simultaneous jobs
  • Cons
    • Doesn't add anything to what ARC already does
    • No priorities or VO handling
    • Only supports HTTP and (Grid)FTP protocols
    • Limited platform support
  • Possibilities
    • Current version is too limited for our requirements but it may be possible to plugin an enhanced future version