Understanding Failure Domains in Enterprise Networks

Failures are inevitable in any network but outages are not; the difference between the two can be explained by the concept of “Failure Domains” and addressing it is what separates network architecture from simple network design.

What is a Failure Domain?

A failure domain is the part of the system that is affected together when something fails..

It is not the device that failed..

It is not the protocol that misbehaved..

It is the impact boundary of that failure..

If a single link goes down and only one switch is affected, the failure domain is small.

If the same link goes down and takes down multiple sites, services or users, the failure domain is large.

Architecture is about controlling that boundary..

Failure vs Outage

A failure is a technical event while an outage is a business event.

Examples of failures are link downs, device crashes, process restarts etc

Examples of outages are users loose access, services stop etc.

Since the failure domains determine whether a failure remains a minor event or turns into a widespread outage, good architectures assume failures will happen and focus on limiting how far the impact can spread.

Why these things matter?

Network outages happen not because something has failed, but because too much depended on the same thing. Sometimes even the redundancy exists but shared dependencies are ignored.

Common Failure Domains in Enterprise Networks

Failure domains can be grouped into 2 types: Physical and Logical.

Physical failure domains include:

Device Level
Rack Level
Data Center
Branch
Availability Zone
Region

Logical failure domains include:

Routing Areas in OSPF
Management Plane
VLAN
VRF
Services like DNS, NTP

As an example, if a NIC malfunctions, the entire server will be a device-level failure domain if there are no secondary NIC plugged.

If a power cable fails, causing a router to go down and disrupting a branch office, the root cause is the physical failure of the power cable, the mechanism is the device failure and the failure domain is the entire branch.

A centralized DHCP server leases IPs to all hosts, wifi clients and IP phones even at branch locations, and if it's the only DHCP server, the entire network is the failure domain for that DHCP server.

If you have experienced TCAM exhaustion of distribution switches where all nodes reside in OSPF Area 0, and a senior engineer suggests redesigning the network with multiple OSPF areas, he is suggesting to shrink the failure domain.

How Architects Design the Failure Boundaries?

We cannot eliminate failures, but we can shape where failures are allowed to go.

This is done by intentionally defining the failure domains, keeping them small and localized, aligning them with operational boundaries etc.

You might have heard the advice: when planning a simple VLAN design, keep VLANs localized. Now you understand the reason behind it.

Common strategies for eliminating or shrinking failure domains:

Redundancy
Summarization & Filtering
Network segmentation (VLANs, VRFs)
Micro-segmentation

Well, things don’t always go as planned. We might assume an impact will remain confined to a specific area during the design phase but in reality, it can unexpectedly spread to other areas. This actual extent of impact is referred to as the “Blast Radius.”

So as a summary;

Failures will happen,

Links will go down,

Devices will crash,

The role of the architect is not to prevent these events, but to decide how much of the system is affected when they occur and shrink it to match the operational boundaries.

Failures are inevitable, Outages are architectural decisions..

Decisions @ Layer 3

Understanding Failure Domains in Enterprise Networks

Leave a Reply

Search on this Blog

Total Pageviews

Get new posts by email:

Popular Posts

Categories

Archives

Upcoming Posts

About Blog

About Me