Redundancy does not automatically equal to resiliency. They are related but they are not the same.
Redundancy can be achieved by adding backup links, placing a stand by firewall, clustering a swich etc but resiliency may still not have been met.
Redundancy is "Spare Tyre" in your car. It's all about having a backup. It's good to have a backup but what are the chances for both the components to fail at once?
Are both of them sharing the same power source? or same upstream provider? same software version? same configuration synced real time? same failure domain? If yes, they don't have resiliency..
Resiliency is the Ability to Survive Failure
Resiliency is architectural. It answers bigger questions like;
- What happens if core fails?
- What happens if control plane is poisoned?
- What happens if DNS dies?
- What happens if an engineer who works do a mistake?
Resiliency is about;
- Failure isolation
- Blast radius control
- Independent failure domains
- Fast detection and recovery
Resiliency is a property of a network.
If you’ve observed a mission-critical network; an airport as an example, you’ll often see two separate core switch clusters of two different models, deployed in two different switch rooms to support routing.
The clustered core switch pair at each location is redundancy.
But when there’s an entirely separate core switch cluster in a different physical location, capable of taking over if the primary site fails is resiliency.
How Architects Think Differently
Engineers ask "What should we duplicate?"
Architects ask "What must not fail together?"
Resilient design means:
- Different carriers
- Different physical paths
- Segmented failure domains
- Independent control planes
- Thoughtful simplification
Why both Matter?
Without redundancy; you have a single point of failure
Without resiliency; you may have a systematic collapse
or in other words,
Redundancy reduces probability while Resiliency reduces impact.
Ultimately, both come at a cost. Whether an organization chooses to invest in redundancy, resiliency, or both depends on how critical uninterrupted operations are to the business and how much risk it is willing to tolerate.
