Archive for 2026

Centralized vs Distributed Internet Breakout

| Published Wednesday, 25 February 2026

"Should the internet traffic exit the network at a central DC or directly at the branch?"; is a well known debate among engineers specially when the cloud emerged. Here is the view of an architect for that matter.

Centralized Internet Breakout

All branch traffic is backhauled to a central data center where internet links connected and Firewalls, Proxies, IDS/IPS are placed.

Organizations choose this model for:

Easy security enforcement
Centralized policy control
Easier compliance & logging
Traditional MPLS based approach

Drawbacks:

Increased latency
WAN bandwidth consumption
Internet link bandwidth consumption
DC becomes a large failure domain

This model worked perfectly when all applications hosted in the DC, internet usage was limited to certain requirements and MPLS was dominant.

But cloud changed this landscape.

Distributed Internet Breakout (Direct Internet Access / DIA)

This is where each branch / site has its own internet, local firewalls / Secure Web Gateways and Direct SaaS access.

Why it was adapted widely:

Optimized SaaS performance
Lower latency
Reduced WAN link costs as fever bandwidth consumption
Split failure domains

Drawbacks:

Security Policy consistency becomes harder
Larger attack surface
More distributed devices to manage
Need to purchase / manage many internet circuits

Note that most of the drawbacks related to managing stuff could be countered with innovations in SD-WAN technologies.

The Real Decision Making Point

It's not about the security enforcement, managing devices or managing circuits; it's about the user experience when using cloud based applications like SaaS especially SaaS traffic like M365, Teams, Zoom, CRMs etc. It's in fact the whole point of architecture.

Modern enterprise architectures often combine distributed breakouts with central policy control via SASE / cloud security platforms and centralized path for sensitive applications.

Final Thought

Centralized breakout optimizes control.

Distributed breakout optimizes performance.

The best designs utilize both, intentionally for different application needs.

If SaaS traffic hairpins through a data center 1,000 kms away, you are paying MPLS unnecessarily and getting bad cloud experience.

Posted in Enterprise WAN Architecture | Leave a comment

MPLS vs Internet VPN vs SD-WAN - A Decision Framework

| Published Sunday, 22 February 2026

When SD-WAN first emerged and became the industry trend, many network engineers quickly declared:

“MPLS is dead.”
“SD-WAN fixes everything.”

But even after a decade, that narrative hasn’t proven true, and it never will.

The reason is simple:

SD-WAN is not a transport.
It is an orchestration layer.

MPLS, on the other hand, is a transport service.

SD-WAN does not replace MPLS by default, it can actually use MPLS as one of its underlay transports. It sits above the transports (MPLS, Internet, LTE, etc.) and intelligently steers traffic across them.

The right solution to choose starts with the business requirement and depends on what you're optimizing for not what's trending.

Before choosing the technology, ask:

does uptime have an SLA?
is application performance critical to revenue?
are sites in remote or unstable ISP regions?
is traffic mostly SaaS and cloud based?
is security centralized or distributed?

Architecture follows answers.

MPLS - Commitment First

Best when:

client needs a predictable performance
low latency between branches is required
client needs a carrier-backed SLA
require private circuit
controlled routing

Weaknesses:

Expensive
Slow to provision
Harder cloud breakout

Commitment is not something everyone asks for and it comes with a high price.

Internet VPN - Budget First

Best when:

sites are small
client is a SMB
budget is constrained
traffic is mostly SaaS
downtime tolerance is reasonable
fast deployment
simple design is admired

Weaknesses:

No SLA guarantees
ISP path unpredictability
Performance variability

Budget is not the only point for choosing a technology, if you are doing so you are doing procurement not architecture.

SD-WAN - Automated Control

Best when:

client has many transports which needs orchestration
applications have different priorities
client needs centralized policy control
cloud first architecture is in place
dynamic steering needed
application aware routing
better bandwidth utilization
integrated security (some vendors)

Weaknesses:

Careful planning needed to achieve results
Operational complexity
Vendor lock-in
Added costs

SD-WAN is utilizing the MPLS and Internet VPN and perform orchestration over imperfect links automatically to achieve the results.

Instead of asking "What's better?" ask:

What is the cost of application unreliability?
What is the cost of latency to the business?
What is more preferred? Guaranteed performance or Intelligent adaption?
Need optimization for stability or flexiblity?
Is your team ready to operate a policy-driven WAN?

Final Thought

SD-WAN is cleary the future for most global scale enterprises; for some, a well-designed MPLS VPN or even an Internet VPN can still be entirely sufficient and can be ideal in certain scenarios, even in 2026. Honestly it depends on the trade-offs which are aligning trasnport characteristics with business impact.

Posted in Enterprise WAN Architecture | Leave a comment

Why “No Single Point of Failure” Is Often Misunderstood

| Published Saturday, 21 February 2026

Many engineers interpret "no single point of failure" as adding a secondary firewall or adding a secondary ISP or stack the switches or cluster the servers etc etc.. And technically "Yes" they could remove the single point of failure on the diagram, but they may not have removed the failure domain.

Failure is about Systems, not Devices

Single point of failure is not just about devices we see.

It can be:

Same routing path
Shared power source
Common software version
Single control plane
Single team that understand how it works

As and example; if both the firewalls depend on the same upstream link, the same software version or same configuration; you still have a single point of failure though it looks redundant.

Logical vs Physical Redundancy

Even though you have 2 core switches, 2 links and 2 power supplies; if both share the same control plane or same management plane true redundancy is not there. That's why we often call it, "Hardware Redundancy".

"No single point of failure" must be evaluated logically not visually.

Logical failures happen than hardware failures and biggest mistake is ignoring them and think the design is perfectly fine.

Examples:

Same firmware bug crashing both HA nodes
Same BGP policy mistake propagating everywhere
Same automation script pushing a bad config

Redundant components don't protect the network from shared risk.

True resilience requires independence of failure domains.

Can this Truly be Achieved?

Engineers ask "Do we have redundancy?"

Architects ask "What event could still take this entire service down?"

That Architect's question will reveal hidden dependencies, shared risks, operational risks, human factors etc.

Pursuit of zero single point of failure increases complexity and cost and if you deeply think about it, you will realize that it is like the Infinite number which is not real. Also trying to eliminate every SPoF can introduce new problems such as complexity which leads to new SPoFs which were not there originally. So it all depends on the criticality of the applications and the willingness of the client to invest to mitigate it.

So it should really mean:

Independent failure domains
Predictable failover behaviour
Tested recovery paths
Clear operational ownership
Understand blast radius

Not just 2 of everything..

Posted in High Availability & Resilience | Leave a comment

Designing for Maintenance without Downtime

| Published Friday, 20 February 2026

There are some outages caused due to failures.
There are a lot of outages caused due to maintenance.

Upgrades, Applying Patches, Configuration Changes, Certificate Renewals, Hardware replacements are necessary; but they are risky.

If your design cannot survive maintenance, it is not resilient, it is fragile.

Maintenance is a design requirement in environments where availability matters..

It is a design requirement just like Performance, Scale, Redundancy, Security etc but it is rarely addressed.

Eliminate Single Maintenance Domains

A single failure domain is dangerous. A single maintenance domain is worse.

If upgrading one device requires:

Taking down the only firewall
Rebooting the only core swich
Restarting the only call manager node

You don't have high availability.

Design principles like

Dual Data Centers (DC and DR)
Dual Firewalls (HA)
Symmetric Design
Load balancers
Multiple application nodes
Independent Control Planes

should be taken care of as maintenance should shift traffic, not stop it.

Imagine a Service Provider environment which runs 100s of firewalls, routers and switches frequently upgrading, patching and rebooting causing service interruptions to subscribers, they will soon find another one.

Separate Control, Data and Management Planes

Many outages during maintenance are reported with:

Management sharing the same path as production
Control Plane overload impacts forwarding
Logging floods crash devices during upgrades

This is why you should consider adding the following in to a design

Out-of-Band Management
Proper CoPP
Solutions with separated Control Planes
Dedicated path for Management traffic
Log rate controls

Simulate Maintenance

Everyone tests link failures, if you are an architect who is designing mission critical networks, test maintenance scenarios and document them before go live at least in a simulation environment.

Simulate:

Version mismatches
Cluster upgrades
License renewals
Certificate expirations

Make Downtime a Choice for the Client - Not a Necessity

Design is good if:

Downtime for maintenance is optional
Maintenance windows are for caution only
Change risk is engineered down

If your architecture requires downtime for routine maintenance, it's not production-ready.

There is a famous saying in IT:

If everything works, don't touch it

I believe this comes with bad architectural work done since a long time almost everywhere. If you want to be an admired architect, try designing networks that engineers are not afraid to maintain..

Posted in High Availability & Resilience | Leave a comment

Active-Standby vs Active-Active: When to Use Each

| Published Thursday, 19 February 2026

When planning redundancy, especially in firewall deployments, you typically have two options: Active-Standby or Active-Active. Selecting the right model depends on understanding when each design is appropriate and where it makes the most sense to use it.

Active-Standby: The Comfort Design

One device is working, the other waits.

Simple, predictable, troubleshooting friendly.

Split-brain scenarios can occur where both devices become active, often due to software bugs or heartbeat communication failures. That's why it’s best practice to connect the heartbeat links directly to prevent issues specifically related to heartbeat failures, avoiding reliance on intermediate devices that could interrupt those signals.

Active-Active: The Performance Design

Both devices work at the same time.

More throughput, Better utilization, Unpredictable forwarding path, troubleshooting nightmare.

Asymmetric routing issues can occur, there for carefully designing is needed. Shared control plane dependencies can make maintenance harder, One node can fail silently as well in this mode.

Common Use Cases

Active-Standby is the most common choice for enterprise networks.

Examples are:

Enterprise Campuses / Corporate HQs (1-10Gbps internet edges, common enterprise apps)
Environments with Heavy Stateful Inspection (NAT, DPI, SSL Decrpt, IPS Inspected)
Organizations with Small Network Teams

Customers would love to have Active-Active designs as it seems both the devices they are paying for are utilized, but Active-Active can be justified in certain situations only in enterprise networks.

Active-Active becomes necessary when performance or continuity requirements are strict.

Examples are:

High-Throughput Data Centers (40G/100G Links, East-West traffic, Spine-leaf Archs)
Financial Trading Platforms (microseconds matter)
Service Providers (Massive traffic, Future horizontal scaling)

As you can see, Active-Active is typically deployed in environments where traffic symmetry is built into the design, or any asymmetry is intentionally engineered and tightly controlled. These setups commonly leverage ECMP, require horizontal scaling for future growth, and are supported by teams with the expertise to manage the added complexity.

Posted in High Availability & Resilience | Leave a comment

Redundancy vs Resiliency - Why both Matter?

| Published Tuesday, 17 February 2026

Redundancy does not automatically equal to resiliency. They are related but they are not the same.
Redundancy can be achieved by adding backup links, placing a stand by firewall, clustering a swich etc but resiliency may still not have been met.

Redundancy is "Spare Tyre" in your car. It's all about having a backup. It's good to have a backup but what are the chances for both the components to fail at once?

Are both of them sharing the same power source? or same upstream provider? same software version? same configuration synced real time? same failure domain? If yes, they don't have resiliency..

Resiliency is the Ability to Survive Failure

Resiliency is architectural. It answers bigger questions like;

What happens if core fails?
What happens if control plane is poisoned?
What happens if DNS dies?
What happens if an engineer who works do a mistake?

Resiliency is about;

Failure isolation
Blast radius control
Independent failure domains
Fast detection and recovery

Resiliency is a property of a network.

If you’ve observed a mission-critical network; an airport as an example, you’ll often see two separate core switch clusters of two different models, deployed in two different switch rooms to support routing.

The clustered core switch pair at each location is redundancy.

But when there’s an entirely separate core switch cluster in a different physical location, capable of taking over if the primary site fails is resiliency.

How Architects Think Differently

Engineers ask "What should we duplicate?"

Architects ask "What must not fail together?"

Resilient design means:

Different carriers
Different physical paths
Segmented failure domains
Independent control planes
Thoughtful simplification

Why both Matter?

Without redundancy; you have a single point of failure

Without resiliency; you may have a systematic collapse

or in other words,

Redundancy reduces probability while Resiliency reduces impact.

Ultimately, both come at a cost. Whether an organization chooses to invest in redundancy, resiliency, or both depends on how critical uninterrupted operations are to the business and how much risk it is willing to tolerate.

Posted in High Availability & Resilience | Leave a comment

Common Mistakes Engineers Make when Thinking like Architects

| Published Monday, 16 February 2026

At some point of their careers, most engineers are asked to make architectural decisions or at least propose some solutions for business requirements.

In such situations, thinking like an architect is crucial to achieve the business objective of the requirement but unfortunately there are mistakes which many engineers make when they don’t shift their thinking from engineer to architect.

Engineers focus on making it work somehow.
Architects focus on achieving business goals.

Mistake #1: Believing Deep Technical Knowledge equals Architectural Thinking

Engineering rewards deep dives to the technical concepts.

If you deeply understand BGP, STP, MPLS, Firewall traffic flow; you are valuable as an engineer.

And yes, strong technical knowledge is essential but an architecture is about;

Trade-offs
Constraints
Competing priorities
Risk distribution
Business Impact

Knowing how something works is engineering.
Knowing when and whether to use it is architecture.

This mistake alone can add unnecessary complexity casing high expenses, fragile network designs.

Mistake #2: Choosing the Technology they Love, Inbox Thinking

Engineers try to solve the problem with what they love doing. As an example, if someone loves Cisco technologies, will try to apply it everywhere even in areas where Cisco is not the strongest fit, where the business cannot afford it, or where better alternatives exist.

Architecture is not about personal preference.

It is about selecting what best fits the business need, budget, operational model, and long-term sustainability.

Loyalty to technology should never override alignment with business reality.

Mistake #3: Solving Local Problems instead of Systematic Weaknesses

Engineers are trained to fix what breaks.

Link flapping? Adjust timers
High CPU? Tune processes
Security Alert? Add a rule

Architectural thinking asks what decision made it happen, what is the blast radius and how to shrink it if happens again, what can be introduced to make the user experience better even that happens etc.

Engineers fix symptoms, Architects redesign boundaries.

Mistake #4: Optimizing Components instead of Balancing the Whole System

Engineers love optimizations.

Best convergence time
Lowest Latency
Maximum throughput
Tightest Security Policy

But architecture is about the balance, not technical perfection which increases caveats.

Improving one area may;

Increase operational complexity
Increase failure probability
Increase troubleshooting time
Increase dependencies on specific experts

If you optimize without evaluating trade-offs, you are still thinking like an Engineer..

As an Example: Using PVST+ instead of RSTP in a campus environment where device availability is monitored by an NMS.
In large campus networks with many switches and redundant links, availability is often monitored from a NOC.
RSTP converges very fast, which is technically superior.
But because it converges so quickly, the NMS may not detect the link failure.
No alarm triggers at 2:00 AM.
The on-shift engineer continues sleeping.
No one realizes a redundant link is already down.

The network appears healthy, but its resiliency is reduced.

Architecture considers monitoring systems, operational workflows, and human behaviour; not just protocol performance.

Mistake #5: Ignoring the Human System

Engineers like technological marvels. A design may be technically brilliant but if it is not designed for the people who are going to work with it, it's fragile.

Systems fail because of people more often than because of traffic.

Architectural thinking always includes the human layer where Engineers normally don't care.

The Real Shift

Engineers assume architecture is just larger-scale engineering, but it is not..

Engineering asks,

What to configure to achieve this?

Architecture asks,

What trade-offs am I choosing to achieve this?

When you shift your thinking from;

From technical perfection to balance
From technologies you love to technologies that fit
From configuration to business experience
From “somehow working” to “strategically aligned”
From optimization to trade-offs

is where engineering ends and architecture begins..

So, Do you know any other mistakes which Engineers cause while they perform Architectural work?

Posted in Architecture Foundations | Leave a comment

Why Simplicity is a Feature of a Network Architecture

| Published Wednesday, 4 February 2026

In a real world network, amateur engineers often expect designs to be complex. Many assume that if a network is difficult for an average-skilled engineer to understand, then it must be a great design.

In reality, the opposite is true.

Simplicity is one of the hardest architectural qualities to achieve and one of the most valuable.

In other words,

Complexity is easy to add, Simplicity must be designed..

Even a small network can become complex if it is poorly designed, Adding new firewall rules, adding more routing protocol to the network, temporary workarounds become permanent over time are some of the example situations which we see all the time in real world networks; the network will function but no one fully understand how it works.

The problem isn't the number of devices or technologies. The real issue is unnecessary variation, which makes networks fragile.

Simplicity Improves Predictability

A simple network is predictable which means:

Failures behave as expected
Changes have limited blast radius
Engineers can predict about outcomes before deploying

This significantly reduces operational cost. Clients do not need elite or expensive skill sets to operate the network; average-skilled engineers can manage it confidently.

Operational Simplicity beats Feature Richness

Vendors often sell their networking products / devices based on features. Architects should focus on operational simplicity.

After you design a network for a client, ask your self:

Can this be explained to an average engineer in 15 mins?
Can it be troubleshooted at 2.00 am under pressure by an average engineer?
Can changes be made without unintended side effects?

If the answer is "No" for any of the above questions, the design may be technically correct, it will work but it is operationally weak hence you haven't designed it well.

You should focus on:

Fewer protocols
Clear responsibility boundaries
Consistent patterns repeated everywhere possible

Simplicity Enables Faster Recovery

When there is an outage / failure, speedy recovery matters and a complex one is harder and take longer to recover causing business losses.

Simple networks:

Reduce the number of possible failure causes
Shorten Mean Time To Recovery (MTTR)

Designing for Simplicity is a Skill

Simplicity doesn't mean "basic" or "cheap", It means intentional trade-offs concerning the business requirements.

Good architects will not use:

Many protocols without clear value
Redundant mechanisms solving the same problem

True architectural skill is not adding more just because one knows more or wants to showcase expertise.

It is removing what is not needed.

Final Thought

A simple network is not one with fewer devices.

It's one where every device has a clear purpose, every protocol has a reason, and any average engineer understands the design and how it works.

Simplicity is the result of disciplined architecture.

Posted in Architecture Foundations | Leave a comment

Understanding Failure Domains in Enterprise Networks

| Published Thursday, 15 January 2026

Failures are inevitable in any network but outages are not; the difference between the two can be explained by the concept of “Failure Domains” and addressing it is what separates network architecture from simple network design.

What is a Failure Domain?

A failure domain is the part of the system that is affected together when something fails..

It is not the device that failed..

It is not the protocol that misbehaved..

It is the impact boundary of that failure..

If a single link goes down and only one switch is affected, the failure domain is small.

If the same link goes down and takes down multiple sites, services or users, the failure domain is large.

Architecture is about controlling that boundary..

Failure vs Outage

A failure is a technical event while an outage is a business event.

Examples of failures are link downs, device crashes, process restarts etc

Examples of outages are users loose access, services stop etc.

Since the failure domains determine whether a failure remains a minor event or turns into a widespread outage, good architectures assume failures will happen and focus on limiting how far the impact can spread.

Why these things matter?

Network outages happen not because something has failed, but because too much depended on the same thing. Sometimes even the redundancy exists but shared dependencies are ignored.

Common Failure Domains in Enterprise Networks

Failure domains can be grouped into 2 types: Physical and Logical.

Physical failure domains include:

Device Level
Rack Level
Data Center
Branch
Availability Zone
Region

Logical failure domains include:

Routing Areas in OSPF
Management Plane
VLAN
VRF
Services like DNS, NTP

As an example, if a NIC malfunctions, the entire server will be a device-level failure domain if there are no secondary NIC plugged.

If a power cable fails, causing a router to go down and disrupting a branch office, the root cause is the physical failure of the power cable, the mechanism is the device failure and the failure domain is the entire branch.

A centralized DHCP server leases IPs to all hosts, wifi clients and IP phones even at branch locations, and if it's the only DHCP server, the entire network is the failure domain for that DHCP server.

If you have experienced TCAM exhaustion of distribution switches where all nodes reside in OSPF Area 0, and a senior engineer suggests redesigning the network with multiple OSPF areas, he is suggesting to shrink the failure domain.

How Architects Design the Failure Boundaries?

We cannot eliminate failures, but we can shape where failures are allowed to go.

This is done by intentionally defining the failure domains, keeping them small and localized, aligning them with operational boundaries etc.

You might have heard the advice: when planning a simple VLAN design, keep VLANs localized. Now you understand the reason behind it.

Common strategies for eliminating or shrinking failure domains:

Redundancy
Summarization & Filtering
Network segmentation (VLANs, VRFs)
Micro-segmentation

Well, things don’t always go as planned. We might assume an impact will remain confined to a specific area during the design phase but in reality, it can unexpectedly spread to other areas. This actual extent of impact is referred to as the “Blast Radius.”

So as a summary;

Failures will happen,

Links will go down,

Devices will crash,

The role of the architect is not to prevent these events, but to decide how much of the system is affected when they occur and shrink it to match the operational boundaries.

Failures are inevitable, Outages are architectural decisions..

Posted in Architecture Foundations | Leave a comment

What makes a Network “Architecture” vs a Design Diagram?

| Published Monday, 12 January 2026

Most network projects start with a diagram, but a diagram by itself does not represent an architecture.

Most of the time, when an engineer is asked to deliver an architecture, what they actually deliver is just a diagram of boxes, lines, labels, IP ranges, and zones. Over time, this has become so normalized that even clients often see the diagram as the architecture.

This is a typical experience when working with people who have spent most of their time configuring systems by following guides. If you are one of them, this post explains what architecture really means when someone asks you to hand it over.

A design diagram shows what is connected.
Architecture explains why it is connected that way..

Difference may seem subtle, but it is where most troublesome or trouble-free networks are decided.

What a Design Diagram shows us?

What devices are in the network?
How are they connected?
Where are the firewalls, routers and switches?
What VLANs or subnets exists?

Basically, it focusses on structure only..

The main purposes of having design diagrams are to help engineers build, troubleshoot and communicate, If you think about it for a moment, "same diagram in two different environments can behave very differently in production", In other words "a diagram can be correct and still represent a fragile system".

What an Architecture gives us?

Why was this kind of topology chosen?
What trade-offs were made between cost, complexity and resiliency?
Where are failure domains intentionally placed?
What assumptions does this design rely on?
What kind of failures is this network designed to survive?

Architecture captures intent, constraints and decisions which can never be covered by a diagram..

How it can relate to a architecture of a house / building in real world?

A house floor plan shows walls, doors, windows, rooms..

Architecture explains how the bedrooms are placed away from noise, how airflow and lighting are handled etc..

Same story applies to Networking..

What Diagrams don't show (but Architecture must..)

Failure domains (what breaks when a link, device or control plane fails?)
Blast radius (how far does an incident propagate before it is contained?)
Operational Simplicity (how easy this network to be operated even at worst days?)
Security Boundaries (where is trust enforced and why those points?)
Growth Paths (what should be changed when the network is needed to be expanded?)

Architecture should answer these questions before they are asked by reality..

Why these things matter??

Most troublesome networks are not caused by missing devices in a diagram or a link, they are caused by architectural blind spots like hidden dependencies, unclear failure behaviours, overlapping responsibilities, unwanted complexity etc.

And that's also why Network Architects exist and they are paid for..

A Diagram becomes An Architecture when it has a story behind it,

Means when it is accompanied by;

Clear reasoning for each major decision
An understanding of trade-offs
Explicit assumptions
Awareness of failure scenarios
Consideration for operations and humans

Architecture is the story behind the diagram.
Without the story, the diagram is just a picture.

I wanted to start this blog to document those stories, the decisions, trade-offs and reasoning which turn network diagrams into real architectures, If you are interested, stay tuned..

Posted in Architecture Foundations | Leave a comment

Decisions @ Layer 3

Archive for 2026

Centralized vs Distributed Internet Breakout

MPLS vs Internet VPN vs SD-WAN - A Decision Framework

Why “No Single Point of Failure” Is Often Misunderstood

Designing for Maintenance without Downtime

Active-Standby vs Active-Active: When to Use Each

Redundancy vs Resiliency - Why both Matter?

Common Mistakes Engineers Make when Thinking like Architects

Why Simplicity is a Feature of a Network Architecture

Understanding Failure Domains in Enterprise Networks

What makes a Network “Architecture” vs a Design Diagram?

Search on this Blog

Total Pageviews

Get new posts by email:

Popular Posts

Categories

Archives

Upcoming Posts

About Blog

About Me