Data management requirements
Collected thoughts, requirements and dreams, coming mainly from WLCG experience. Comments, additions, corrections are welcomed (oxana).
- 1 Limited nature of European Grid middlewares
- 2 What makes Grid middleware useful
- 3 Storage and data management services and utilities
- 4 On caches and proxy servers
- 5 On archived data
- 6 On data discovery
- 7 Data-driven brokering
- 8 Unsorted implementation-like dreams and examples
Limited nature of European Grid middlewares
Existing European Grid middlewares are inherently handicapped, as they are trying to solve only the problem of task (job) execution. Currently accepted standards are all around job management: JSDL is a job description language, BES is a basic execution service, Glue2 is probably 90% about execution service description, and so on. All authentication and authorisation tools and utilities are always designed having job execution in mind.
Data manipulation and storage management is always an ugly stepdaughter of Grid middlewares. From GDMP through dCache to Chelonia, data management solutions were designed and developed independently of Grid middlewares, and then painstakingly integrated, with varying degree of success. The success was never complete. In WLCG, we have now "dark data", orphan records and other peculiarities that require immense human efforts to sort out.
What makes Grid middleware useful
Let us consider a degenerate case of a Grid job, namely, a "conventional" job submission by a user logged directly to the computing resource via e.g. SSH or a Globus submission utility. Even if all authorisation, book-keeping, monitoring and other execution-related auxilliary services are working flawlessy, the job is doomed if the data are not available. If the data are available, absence of all such auxilliary services will not prevent normal execution and the desired result will be achieved.
"Availability" is a complex concept, which can be expressed through the following requirements:
- Availability must not be achieved only through creation of multiple replicas of data; while this is exploited by P2P file sharing networks, it may impose unnecessary burden on owners of very large data sets
- Regardless of the requirement above, availability of "popular" data must be beter (i.e., faster), in a way similar to file sharing networks: more popular is a file, more copies are there, faster is the access
- Availability includes local and remote access; in case a local copy of data is not feasible, remote access must be deployed - via a distributed file system, or via a specialized protocol, or both
- Ideally, available data in Grid terms should mean that it is possible to mount and unmount remote storage at any time, just like local storage media. Obviously, once mounted, storage can not be unmounted if it is accessed by local processes
- Availability must include protection from unathorised or undesired removal; in Grid environment a storage administrator, or even a VO manager, can not know for sure who needs a particular data set. It must be possible for any user to flag any data as precious for him/her a specified amount of time, such that no removal can be done withouth his/her explicit approval
- Availability must not necessarily involve massive security solutions; it must be possible for one reseacher to make her data available to another researcher without becoming a member of a VO and without deploying an authorisation service
- Availability must assume mobility of data. This means, if data are on a mobile medium, it should not be necessary to move data to a dedicated Grid storage: it should be sufficient to specify to a Grid storage service that the mobile device currently holds data which are shared with specified users or services. The same concept applies to data that need to be moved from one device to another
- Availability must be instant or as close to instant as possible: as soon as the device holding the data is made known to Grid, authorised users should be able to discover and access the data on this device. As soon as the device is off-line, Grid queries for its data must return no results.
Availability can be high or low, primarily in terms of time needed to access data. Grid optimisation tools must always prefer data with higher availability (e.g., on a faster medium, or over a faster network).
Storage and data management services and utilities
For huge data volumes, data storage, management, movement and curation are very challenging tasks that require dedicated tools.
Data transfer tasks bear many similarities with computational jobs; namely, they are:
- characterised by resource usage, typically storage space and bandwith, but also by CPU consumption etc
- require authorisation for both transfer initiation and data access (e.g., not every person that is authorised to read a data set may be authorised to launch a transfer task; conversely, a person initiating transfer should always have read acess)
- can be queued
- have different priorities
- can have complex source and complex destination (e.g. aggregating data from multiple sources in one destination, or distributing data from one source to multiple destinations)
- may have alternatives that need optimisation/brokering (e.g., between multiple replicas for a source file, or between multiple destinations)
- may have recoverable failures
- may be grouped in batches or workflows
Utilities and tools that facilitate data access/transfer must be separate from computational tasks. While one can not always prevent computational tasks from executing transfers, useful Grid midleware must always offer dedicated services that will pre-stage input data and post-stage output data. This will allow to optimise computational resource usage and throttle bandwith to/from a site.
On caches and proxy servers
Local disk caches and caches in proxy servers are often discussed in relation to Grid data availability.
In general, cache may be considered as a way to implement data "mounting", though it has significant latency. This latency refers both to the time needed to "mount" the medium containing data (populate cache), and to unmount it (cache expiration).
Local disk cache is not always feasible. Experience shows that an average-sized computational cluster serving a large experiment at LHC would require 100-150TB of highly available local disk cache to sustain 2 weeks lifetime of cached data. This may not be an affordable solution for everybody.
A proxy server may help increasing availability by holding additional replicas of popular data; this however is a solution that only very large VOs can afford.
Regardless of limitations, both local disk caches and proxy servers must be considered as valid and necessary components of Grid middlewares. They should satisfy following requirements:
- It must be possibe to deliberately pre-populate a local disk cache or a proxy server cache
- These temporary storages must satisfy all the availability requirements listed above; i.e, they must not be different from e.g. a mobile storage device
On archived data
In large projects, data are archived in high-latency media, such as tapes. While in general tape storage must satisfy all the requirements listed above, high latency implies special requirements.
Sometimes terms "offline", "near-line" and "online" are used.
The following requirements stem from the peculiarities of high-latency storage:
- Tools dealing with data access must always be able to recognize high-latency storage.
- Data in high-latency storage must always be considered as having lowest possible availability.
- In case data in high-latency storage become "popular", their availability must be improved, by e.g. automatic creation of low-latency replicas
On data discovery
Data indexing is a service necessary to discover Grid data. A reliable data indexing combined with a well-defined namespace would effectively implement a Grid file system.
For completeness, metadata (size, timestamp, ownership, access rights etc) can and should be stored in such a "file system", much like POSIX defines. Metadata assist in file discovery, as Grid tools are expected to be able to query metadata.
Historically, data indexing is often developed as independent services. This causes many grieves, as indices are getting increasingly de-synchronysed with actual stored data.
The early ideas about optimizing processing of large distributed datasets were developed by the MONARC project, being commissioned by CERN. Essentially, the MONARC model foresaw data placement according to an hierarchial storage model consisting of several storage Tiers. Processing would then take place at locations where data are residing. That is, MONARC argued that it is better to ship tasks to data than other way around. This was the driving architectural approach for numerous data Grid solutions, like e.g. gLite or AliEn. ARC has a different approach: it assumes excellent networks and optimizes data movement by means of caching.
Despite being plagued by failures due to inconsistencies of various databases in existing solutions, or being absent altogether, data-driven brokering is still a much needed feature. Recently, one can see approaches when "moving jobs to data" model is getting extended, both in terms of data and in terms of location, see e.g. Panda system. Data may refer to a large set of files, and location may refer to a group of sites over which this set is spread. This approach is somewhat in between moving data to jobs and jobs to data.
The following basic functionalities are needed for enabling data-driven brokering:
- It must be possible to resolve user-defined input data into actual physical locations (can be many, if input is a data set); typically, one would expect this being done by the data discovery services
- Physical locations may include disk storage, tape storage or disk cache
- Only locations where a user is authorised can be considered as useful
- It must be possible to identify computing services near physical locations; this is expected to be aided by an information system
- Proximity may be defined in network terms, or in administrative terms
Unsorted implementation-like dreams and examples
- Every device holding data will have to be "served" and otherwise managed via a Grid "tool", like e.g. a BitTorrent or Opera manage shared folders. That is, if I want to mount my USB key on the Grid, I must make sure that PC in the Internet cafe around the corner has the necessary Grid tool installed. If it's not installed, I should be able to run it from my USB key.
- Such Grid data management tool must have an interface through which I can obtain a unique certificate for my data volume; everybody who has a personal Grid certificate can get automatically certificates for their data. Those who accept these certificates will be able to use the data in the way I allow (read, write, etc) - of course if I authorise their certificates, too. This tool must have a nice graphical interface that allows me to manage certificates and authorisation. Different volumes or storage media may have different certificates.
- Every Grid file must get a unique identfier (GUID, hash, whatever) first time it appears on the Grid. Identical files (replicas) must have identical identifiers.
- Something must keep count of the online copies (replicas) of the same file, much like file sharing networks do.
- It must be possible to share files directly, without having to register them to the Grid. That is, it should be sufficient to send a IP address of the Internet cafe desktop in Japan with my USB key plugged in to a friend in Iceland, and she should be able to mount my USB and run her volcano analysis programme over my data and tell me whether that Japanese volcano is OK while I'm chewing my sushi. All secure, encrypted etc..
- There must be a way to aggregate different storage volumes/media and characterize them as a single source or a single destination. For example, my data span several physical volumes distributed in different places, and yet I want to mount them as one folder, because my analysis programme works this way. A distributed storage serving many users at once, like the one created by NDGF, is an example of large storage volumes aggregated as one facility with shared policies. This is not so important for USB keys and data shared in a peer-to-peer manner.