3 Ways of Thinking About Fault Tolerance

When it comes to the enterprise business continuity, the number of options can sometimes feel overwhelming. And one of the more confusing terms that gets tossed around is “fault-tolerance”.

Fault tolerance means different things when used in different contexts. In order to simplify our understanding of the term, I’ll assume that most of the definitions can be classified into one of 3 categories.

Component-Level Fault Tolerance

This will keep your servers protected in the event of hardware malfunctions such as storage, network devices and controllers. In the event that a component should fail, the server will be able to continue operations interruption to the applications.

RAID storage is probably the best known example of component-level fault tolerance. If one disk fails, the others keep running until the faulty disk is repaired.

Server-Level Fault Tolerance.

This will ensure that your company can endure a complete server failure (Or in the case of virtual servers, a complete host failure) without a second of downtime or interruption.

In order for this to happen, the system must ensure that memory state are maintained while clients remain connected. Also, it should ensure that active transactions continue to be processed while the system is in recovery mode.

In other words, neither the application nor the end-user should notice that anything strange has happened. And once the server has been repaired, the application should re-synchronize itself instantly without needing to restart the server or otherwise interrupt operations.

Geographic Fault Tolerance

This is very similar to server-level fault tolerance, where servers are hosted across multiple redundant hosts. However, in this instance, at least 2 of the host servers are spread across a wide geographical area.
This way, your company is protected from the most severe causes of downtime including fire, natural disasters, and in some extreme cases… even war.

Since datacenters are very expensive to build/rent, this type of fault tolerance is – by far – the most expensive to implement. But it’s also the safest.

Next time a vendor starts talking about how their service offers fault-tolerance, this classification system should help you ask better questions and get deeper insights about how these solutions can help your business.