Many engineers interpret "no single point of failure" as adding a secondary firewall or adding a secondary ISP or stack the switches or cluster the servers etc etc.. And technically "Yes" they could remove the single point of failure on the diagram, but they may not have removed the failure domain.
Failure is about Systems, not Devices
Single point of failure is not just about devices we see.
It can be:
- Same routing path
- Shared power source
- Common software version
- Single control plane
- Single team that understand how it works
As and example; if both the firewalls depend on the same upstream link, the same software version or same configuration; you still have a single point of failure though it looks redundant.
Logical vs Physical Redundancy
Even though you have 2 core switches, 2 links and 2 power supplies; if both share the same control plane or same management plane true redundancy is not there. That's why we often call it, "Hardware Redundancy".
"No single point of failure" must be evaluated logically not visually.
Logical failures happen than hardware failures and biggest mistake is ignoring them and think the design is perfectly fine.
Examples:
- Same firmware bug crashing both HA nodes
- Same BGP policy mistake propagating everywhere
- Same automation script pushing a bad config
Redundant components don't protect the network from shared risk.
True resilience requires independence of failure domains.
Can this Truly be Achieved?
Engineers ask "Do we have redundancy?"
Architects ask "What event could still take this entire service down?"
That Architect's question will reveal hidden dependencies, shared risks, operational risks, human factors etc.
Pursuit of zero single point of failure increases complexity and cost and if you deeply think about it, you will realize that it is like the Infinite number which is not real. Also trying to eliminate every SPoF can introduce new problems such as complexity which leads to new SPoFs which were not there originally. So it all depends on the criticality of the applications and the willingness of the client to invest to mitigate it.
So it should really mean:
- Independent failure domains
- Predictable failover behaviour
- Tested recovery paths
- Clear operational ownership
- Understand blast radius
Not just 2 of everything..
