There are some outages caused due to failures.
There are a lot of outages caused due to maintenance.
Upgrades, Applying Patches, Configuration Changes, Certificate Renewals, Hardware replacements are necessary; but they are risky.
If your design cannot survive maintenance, it is not resilient, it is fragile.
Maintenance is a design requirement in environments where availability matters..
It is a design requirement just like Performance, Scale, Redundancy, Security etc but it is rarely addressed.
Eliminate Single Maintenance Domains
A single failure domain is dangerous. A single maintenance domain is worse.
If upgrading one device requires:
- Taking down the only firewall
- Rebooting the only core swich
- Restarting the only call manager node
You don't have high availability.
Design principles like
- Dual Data Centers (DC and DR)
- Dual Firewalls (HA)
- Symmetric Design
- Load balancers
- Multiple application nodes
- Independent Control Planes
should be taken care of as maintenance should shift traffic, not stop it.
Imagine a Service Provider environment which runs 100s of firewalls, routers and switches frequently upgrading, patching and rebooting causing service interruptions to subscribers, they will soon find another one.
Separate Control, Data and Management Planes
Many outages during maintenance are reported with:
- Management sharing the same path as production
- Control Plane overload impacts forwarding
- Logging floods crash devices during upgrades
This is why you should consider adding the following in to a design
- Out-of-Band Management
- Proper CoPP
- Solutions with separated Control Planes
- Dedicated path for Management traffic
- Log rate controls
Simulate Maintenance
Everyone tests link failures, if you are an architect who is designing mission critical networks, test maintenance scenarios and document them before go live at least in a simulation environment.
Simulate:
- Version mismatches
- Cluster upgrades
- License renewals
- Certificate expirations
Make Downtime a Choice for the Client - Not a Necessity
Design is good if:
- Downtime for maintenance is optional
- Maintenance windows are for caution only
- Change risk is engineered down
If your architecture requires downtime for routine maintenance, it's not production-ready.
There is a famous saying in IT:
If everything works, don't touch it
I believe this comes with bad architectural work done since a long time almost everywhere. If you want to be an admired architect, try designing networks that engineers are not afraid to maintain..
