Saturday, January 1, 2011

reliability attributes

Often times in the discussion of distributed systems and scale out architecture, we demand system being reliable. One of the application, I was part of designing/developing the scale out architecture for it primarily to fulfill the business demand that we should be able to add more hardware to support more load. Now, recently I watched this video from cloudera's free basic hadoop training and it talks about "reliability attributes". I realized that we need same reliability attributes in scale out architecture also and often talk about them( implicitly) in my team. And, this list makes them explicit(coming straight from the linked video)...

Partial Failure: System should be able to support partial failures. That is, if x out of n nodes(where x < n) in the system go down then only system's performance(e.g. throughput) should gracefully go down in the proportion of x/n instead of it being go down completely and not doing any work(or not serving any requests.

Fault Tolerance: This is more related to background jobs, so map-reduce in particular. If one of the nodes go down then its work must be picked up by some other functional unit. This is also sometimes solved by having redundancy in the system. For example, within one cluster, we have multiple app servers serving same set of requests and a load balancer to manage them. In case, one of the server goes down, load balancer detects it and stops sending any requests to it.

Individual Recoverability: Nodes that fail and restart should be able rejoin the group(or cluster) without needing a full restart.

Consistency: Internal failure should not cause externally visible issues.

Scalability: If we add more nodes, we should be able to handle more load in proportion of the new nodes added.

No comments:

Post a Comment