Decisions @ Layer 3

Archive for January 2026

Understanding Failure Domains in Enterprise Networks

| Published Thursday, 15 January 2026

Failures are inevitable in any network but outages are not; the difference between the two can be explained by the concept of “Failure Domains” and addressing it is what separates network architecture from simple network design.

What is a Failure Domain?

A failure domain is the part of the system that is affected together when something fails..

It is not the device that failed..

It is not the protocol that misbehaved..

It is the impact boundary of that failure..

If a single link goes down and only one switch is affected, the failure domain is small.

If the same link goes down and takes down multiple sites, services or users, the failure domain is large.

Architecture is about controlling that boundary..

Failure vs Outage

A failure is a technical event while an outage is a business event.

Examples of failures are link downs, device crashes, process restarts etc

Examples of outages are users loose access, services stop etc.

Since the failure domains determine whether a failure remains a minor event or turns into a widespread outage, good architectures assume failures will happen and focus on limiting how far the impact can spread.

Why these things matter?

Network outages happen not because something has failed, but because too much depended on the same thing. Sometimes even the redundancy exists but shared dependencies are ignored.

Common Failure Domains in Enterprise Networks

Failure domains can be grouped into 2 types: Physical and Logical.

Physical failure domains include:

Device Level
Rack Level
Data Center
Branch
Availability Zone
Region

Logical failure domains include:

Routing Areas in OSPF
Management Plane
VLAN
VRF
Services like DNS, NTP

As an example, if a NIC malfunctions, the entire server will be a device-level failure domain if there are no secondary NIC plugged.

If a power cable fails, causing a router to go down and disrupting a branch office, the root cause is the physical failure of the power cable, the mechanism is the device failure and the failure domain is the entire branch.

A centralized DHCP server leases IPs to all hosts, wifi clients and IP phones even at branch locations, and if it's the only DHCP server, the entire network is the failure domain for that DHCP server.

If you have experienced TCAM exhaustion of distribution switches where all nodes reside in OSPF Area 0, and a senior engineer suggests redesigning the network with multiple OSPF areas, he is suggesting to shrink the failure domain.

How Architects Design the Failure Boundaries?

We cannot eliminate failures, but we can shape where failures are allowed to go.

This is done by intentionally defining the failure domains, keeping them small and localized, aligning them with operational boundaries etc.

You might have heard the advice: when planning a simple VLAN design, keep VLANs localized. Now you understand the reason behind it.

Common strategies for eliminating or shrinking failure domains:

Redundancy
Summarization & Filtering
Network segmentation (VLANs, VRFs)
Micro-segmentation

Well, things don’t always go as planned. We might assume an impact will remain confined to a specific area during the design phase but in reality, it can unexpectedly spread to other areas. This actual extent of impact is referred to as the “Blast Radius.”

So as a summary;

Failures will happen,

Links will go down,

Devices will crash,

The role of the architect is not to prevent these events, but to decide how much of the system is affected when they occur and shrink it to match the operational boundaries.

Failures are inevitable, Outages are architectural decisions..

Posted in Architecture Foundations | Leave a comment

What makes a Network “Architecture” vs a Design Diagram?

| Published Monday, 12 January 2026

Most network projects start with a diagram, but a diagram by itself does not represent an architecture.

Most of the time, when an engineer is asked to deliver an architecture, what they actually deliver is just a diagram of boxes, lines, labels, IP ranges, and zones. Over time, this has become so normalized that even clients often see the diagram as the architecture.

This is a typical experience when working with people who have spent most of their time configuring systems by following guides. If you are one of them, this post explains what architecture really means when someone asks you to hand it over.

A design diagram shows what is connected.
Architecture explains why it is connected that way..

Difference may seem subtle, but it is where most troublesome or trouble-free networks are decided.

What a Design Diagram shows us?

What devices are in the network?
How are they connected?
Where are the firewalls, routers and switches?
What VLANs or subnets exists?

Basically, it focusses on structure only..

The main purposes of having design diagrams are to help engineers build, troubleshoot and communicate, If you think about it for a moment, "same diagram in two different environments can behave very differently in production", In other words "a diagram can be correct and still represent a fragile system".

What an Architecture gives us?

Why was this kind of topology chosen?
What trade-offs were made between cost, complexity and resiliency?
Where are failure domains intentionally placed?
What assumptions does this design rely on?
What kind of failures is this network designed to survive?

Architecture captures intent, constraints and decisions which can never be covered by a diagram..

How it can relate to a architecture of a house / building in real world?

A house floor plan shows walls, doors, windows, rooms..

Architecture explains how the bedrooms are placed away from noise, how airflow and lighting are handled etc..

Same story applies to Networking..

What Diagrams don't show (but Architecture must..)

Failure domains (what breaks when a link, device or control plane fails?)
Blast radius (how far does an incident propagate before it is contained?)
Operational Simplicity (how easy this network to be operated even at worst days?)
Security Boundaries (where is trust enforced and why those points?)
Growth Paths (what should be changed when the network is needed to be expanded?)

Architecture should answer these questions before they are asked by reality..

Why these things matter??

Most troublesome networks are not caused by missing devices in a diagram or a link, they are caused by architectural blind spots like hidden dependencies, unclear failure behaviours, overlapping responsibilities, unwanted complexity etc.

And that's also why Network Architects exist and they are paid for..

A Diagram becomes An Architecture when it has a story behind it,

Means when it is accompanied by;

Clear reasoning for each major decision
An understanding of trade-offs
Explicit assumptions
Awareness of failure scenarios
Consideration for operations and humans

Architecture is the story behind the diagram.
Without the story, the diagram is just a picture.

I wanted to start this blog to document those stories, the decisions, trade-offs and reasoning which turn network diagrams into real architectures, If you are interested, stay tuned..

Posted in Architecture Foundations | Leave a comment

Decisions @ Layer 3

Archive for January 2026

Understanding Failure Domains in Enterprise Networks

What makes a Network “Architecture” vs a Design Diagram?

Search on this Blog

Total Pageviews

Get new posts by email:

Popular Posts

Categories

Archives

Upcoming Posts

About Blog

About Me