This wiki is obsolete, see the NorduGrid web pages for up to date information.

Chelonia Workshop Budapest Nov 2010

From NorduGrid
Jump to navigationJump to search

This page summarizes the Chelonia workshop in Budapest in the last week of Nov. 2010.

Summary day 1

The main point of today's discussion was where we want to go with Chelonia, on a very high level. So we started out with some goals, problems and questions related to these:

Goals:

  • it should be very simple to add or remove storage nodes
  • we want fault-tolerant metadata
  • the client tools should be so easy to use that the user wouldn't notice them
  • services should be dynamic in the sense that services can come an go without the users noticing

Problems:

  • Berkeley DB
  • Shepherd reporting files is to heavy
  • ACL and blacklisting is not satisfactory
  • Onetime URLs needs revisiting
  • We need phase diagrams to really understand what is happening
  • Most other problems comes back to the first one.

Questions:

  • Do we want NFS4.1? Will maybe solve the easy client goal
  • Block access/replication -- maybe needed for NFS4.1 and could be useful for checksumming large files
  • SRM? Seems like others are moving away from it, do we need it then?

So based on these goals, problems and questions and some discussion we ended up with the following tentative, unordered plan for the week:

  • Investigate Keyspace vs. BDB for fault-tolerant metadata
  • Read about NFS4.1
  • Think about merging services
  • Think about file blocks
  • Rethink Shepherd reports

Summary day 2

Just a quick summary from yesterday (which continued to half past twelve today). We devoted the entire day with metadata in Chelonia, discussing how to replicate the metadata and investigating other solutions.

The first thing we discussed was how the metadata replication should be handled. As we could see it there were two major ways to go. Either we could have a dynamic number of metadata replicas, meaning that you can add and remove replicas at any time, or you can have a static, predefined number of replicas. Some pros and cons:

Dynamic replica numbers: Pro:

  • Can merge all services to one Chelonia service -- simplifying
  • Easy to add Chelonia instances
  • Can go from 1 -> 3 -> 2n-1 services without restart

Cons:

  • No known solutions -> lots of work
  • Too much communication between services?

Static replica numbers: Pro:

  • Has been done
  • May simplify implementation considerably

Cons:

  • Need dedicated metadata - cannot merge all services to one Chelonia service

Taking these into account we decided that we will have a static number of metadata replicas, at least until a new PhD-student comes along.

After this non-trivial decision the rest of the day was used to investigate possible alternatives to Berkeley DB, specifically Hadoop with HBase (too heavy) and Zookeeper (too light) and Keyspace, which turned out to be (at least) just as unstable as our replicated A-Hash.

Today, we are attending a cloud workshop so there will be limited time to do anything today, but we will try to test Berkeley DB's built-in replication manager directly to see if that is less unstable than our replicated A-Hash (which has a rather complicated reimplementation of the replication manager). Hopefully this will show us if there is something fundamentally wrong with Berkeley DB or just a problem with our implementation.

Summary day 4

Problem: There is currently too much communication between the Shepherds and the Librarian. As an example, starting up a Shepherd which has some thousands of files more or less crashes the Librarian because all the files immediately need to change state from OFFLINE to ALIVE.

As a starting point for the discussion we drew a state diagram for a file replica in Chelonia, describing the possible states of a file replica from file upload (left) to deletion (right).

CheloniaStateDiagramFileReplicas.jpg

Now, the big question is how to keep track of these state changes while keeping communication at a minimum. Since the main problem we have seen is that the Shepherds communicate too much with the Librarian, we considered an extreme case where the Shepherds to as few decisions as possible, handing over full control to one, almighty master Librarian. For the sake of simplicity, Librarian here means a service that controls, stores and gives access to all metadata in the storage system. The role of the Shepherd is to store files and, when a new file has arrived, register checksum and 'ALIVE' state to Librarian. The Shepherd does not do any checking of existing files unless told so by the Librarian.

Disclaimer: This was a high-level discussion, leaving out technical implementation details. We don't see this as a final solution for Chelonia, only a starting point for finding the direction for future Chelonia implementation.

Extreme case: Shepherd -> Sheep. Shepherd just sit there

  • Master Librarian controls everything
  • The Shepherds are registered to a list in Librarian

Now, there are three main cases to consider in file replication: (A) Librarian goes down during upload or replication, (B) Client uploads file to Shepherd, Shepherd goes down, and (C) Shepherd O sends file to Shepherd N and either O or N goes down.

For case (A), we assume that the Librarian is a replicated service and that one of the slave Librarians automatically takes over as master Librarian. The Shepherds, when failing to register the new file to the master Librarian, finds out who is the new master and retries registration. Note that this requires the master Librarian to only rely on information stored to disk, as information stored in memory can be lost at any time.

Considering case (B), the Shepherd receiving the file can go down

  1. BEFORE replica arrived
    1. client polls Librarian until file is in state 'ALIVE' or a timeout period -> client retries upload
  2. AFTER replica arrived, BEFORE registered 'ALIVE' in Librarian
    1. same as (1)
  3. AFTER registering the replica to the Librarian
    1. Librarian polls Shepherds periodically
    2. If Shepherd is down, Librarian loops through all files in the system
    3. Librarian puts offlined files in a queue for replication

Next we need to consider what happens if the Shepherd comes up again:

  1. AFTER replication start, BEFORE all files replicated
    1. Librarian tells Shepherd to remove list of replicated files
    2. Librarian stops replication and scanning and sets back the states of files from queue
  2. AFTER all offline files are replicated
    1. Librarian tells Shepherd to remove all files

Case (C) is by far the most complicated. We have two Shepherds, O and N. Shepherd O has a file replica and a transfer URL to Shepherd N. Both O and N can go down (i) BEFORE transfer, (ii) DURING transfer, (iii) AFTER transfer BEFORE registering and (iv) never. This gives 4x4=16 distinct possibilities. After going through all the possibilities we identified four cases that could be problematic.

  1. N goes down BEFORE or DURING transfer
    1. O notice that N is down
    2. O tells Librarian that transfer failed
    3. O has list of transfers in progress
    4. If O goes down and comes up again, it reports to Librarian about transfers in progress
  2. N goes down AFTER transfer, BEFORE registering
    1. After transfer O polls Librarian for new replica registered until timeout
  3. O and N goes down AFTER transfer, BEFORE registering
    1. see two last points of (1)
    2. O tells Librarian to check these files
  4. Nothing goes down
    1. N registers replica
    2. O polls Librarian, sees that N registered the new replica, wraps up and lives happily ever after.