Why “No Single Point of Failure” Is Often Misunderstood

Many engineers interpret "no single point of failure" as adding a secondary firewall or adding a secondary ISP or stack the switches or cluster the servers etc etc.. And technically "Yes" they could remove the single point of failure on the diagram, but they may not have removed the failure domain.

Failure is about Systems, not Devices

Single point of failure is not just about devices we see.

It can be:

Same routing path
Shared power source
Common software version
Single control plane
Single team that understand how it works

As and example; if both the firewalls depend on the same upstream link, the same software version or same configuration; you still have a single point of failure though it looks redundant.

Logical vs Physical Redundancy

Even though you have 2 core switches, 2 links and 2 power supplies; if both share the same control plane or same management plane true redundancy is not there. That's why we often call it, "Hardware Redundancy".

"No single point of failure" must be evaluated logically not visually.

Logical failures happen than hardware failures and biggest mistake is ignoring them and think the design is perfectly fine.

Examples:

Same firmware bug crashing both HA nodes
Same BGP policy mistake propagating everywhere
Same automation script pushing a bad config

Redundant components don't protect the network from shared risk.

True resilience requires independence of failure domains.

Can this Truly be Achieved?

Engineers ask "Do we have redundancy?"

Architects ask "What event could still take this entire service down?"

That Architect's question will reveal hidden dependencies, shared risks, operational risks, human factors etc.

Pursuit of zero single point of failure increases complexity and cost and if you deeply think about it, you will realize that it is like the Infinite number which is not real. Also trying to eliminate every SPoF can introduce new problems such as complexity which leads to new SPoFs which were not there originally. So it all depends on the criticality of the applications and the willingness of the client to invest to mitigate it.

So it should really mean:

Independent failure domains
Predictable failover behaviour
Tested recovery paths
Clear operational ownership
Understand blast radius

Not just 2 of everything..

Decisions @ Layer 3

Why “No Single Point of Failure” Is Often Misunderstood

Leave a Reply

Search on this Blog

Total Pageviews

Get new posts by email:

Popular Posts

Categories

Archives

Upcoming Posts

About Blog

About Me