Files
the_information_nexus/tech_docs/networking/sdwan_extras.md

14 KiB
Raw Blame History

Great follow-up! You're absolutely right—traffic is traffic, and if you can classify it (VoIP, O365, CRM, etc.), why does the SD-WAN platform matter? Why move from something simple like Meraki Auto-VPN to a more complex solution like Viptela (Cisco SD-WAN)?

The answer lies in granularity of control, scalability, and architectural fit—not just traffic classification. Lets break it down critically.


1. Meraki Auto-VPN vs. Viptela (Cisco SD-WAN): Key Differences

Feature Meraki Auto-VPN Viptela (Cisco SD-WAN)
Traffic Steering Basic (policy-based, limited app-aware routing) Advanced (dynamic path selection, per-packet steering)
Underlay Agnostic? No (requires Meraki hardware) Yes (works with third-party routers, virtual appliances)
Cloud Breakout Yes (but limited intelligence) Yes (with deep SaaS optimization, e.g., Microsoft 365 direct breakout)
Security Basic (L3/L4 firewall, IDS/IPS) Integrates with Umbrella, advanced segmentation
Scalability Good for SMB/mid-market Enterprise-grade (thousands of nodes, multi-tenant)
Management Dead simple (cloud-only) More complex (but granular control)
Cost Lower upfront (subscription model) Higher (licensing, controllers, possible overlay complexity)

2. When to Stick with Meraki Auto-VPN

Meraki is good enough when:
Your needs are simple Basic VPN, some QoS for VoIP, and cloud breakout.
Youre all-in on Meraki If youre using MX appliances everywhere, Auto-VPN "just works."
You dont need advanced traffic engineering If you dont care about per-packet failover or deep SaaS optimization.
You value simplicity over control Merakis dashboard is idiot-proof; Viptela requires more expertise.

Example: A 50-branch retail chain with basic VoIP, O365, and POS traffic might never need more than Meraki.


3. When to Move to Viptela (Cisco SD-WAN)

Viptela makes sense when:
You need granular application control E.g., "Route Zoom traffic over broadband unless latency >50ms, then fail to LTE."
You have complex WAN architectures Multi-cloud, hybrid MPLS + internet, global deployments.
You need better SaaS optimization Deep Microsoft 365/AWS path selection, not just "breakout locally."
You want underlay flexibility Run it on ISRs, ASRs, or virtual appliances (not just Meraki hardware).
You need advanced security Integration with Umbrella, encrypted traffic analysis, microsegmentation.

Example: A multinational with 500+ sites, strict SLAs for SAP/Teams, and a mix of MPLS/internet/LTE would benefit from Viptela.


4. The "Traffic is Traffic" Argument Why Its Not That Simple

Yes, VoIP is VoIP, and O365 is O365—but the difference is in how intelligently the SD-WAN handles it:

Scenario Meraki Auto-VPN Viptela
O365 Traffic Basic local breakout Can detect SharePoint vs. Exchange vs. Teams and optimize each differently
VoIP Call Failover May take seconds to reroute Sub-second failover with FEC (Forward Error Correction)
MPLS + Internet Mix Static preference (e.g., "Use MPLS first") Dynamically chooses best path per application
Jitter/Latency Spikes May not react fast enough Can reroute mid-flow if conditions degrade

Bottom Line: If your network is "good enough," Meraki works. If you need fine-tuning for performance, cost, or reliability, Viptela adds value.


5. When the Upgrade Isnt Worth It

Dont move to Viptela if:
Your Meraki setup works fine No complaints = no reason to change.
You lack operational maturity Viptelas complexity could make things worse.
Cost outweighs benefits If the ROI isnt there, stick with simplicity.


Final Verdict: Its About Precision, Not Just Classification

  • Meraki Auto-VPN = "Set it and forget it" for basic needs.
  • Viptela = "Give me surgical control over my WAN."

If your traffic patterns are truly static and performance is acceptable, Meraki may be all you ever need. But if youre dealing with:

  • Strict SLAs (e.g., zero downtime for trading apps)
  • Complex hybrid architectures (MPLS + 5G + broadband)
  • Need for granular SaaS optimization (beyond simple breakout)
    …then Viptela justifies its complexity.

Would love your thoughts—have you seen cases where Meraki was "good enough," or where Viptela was overkill?


Ah, now were talking about the real engineering meat of SD-WAN—the stuff that separates the "checkbox deployers" from the architects who actually understand how this stuff works under the hood.

Youre absolutely right: If you can design, policy, and troubleshoot at this level, youre in the top 1% of network engineers who truly grasp SD-WAN (instead of just clicking through GUIs). Lets break it down.


1. Transport-Independent Design (Colors, TLOCs, VPN 0)

Why It Matters

  • Most SD-WAN deployments fail at scale because engineers treat underlay as an afterthought.
  • Colors and TLOCs abstract the underlay so policies work regardless of transport (MPLS, broadband, LTE, satellite).
  • VPN 0 (Transport VPN) is where the magic happens—control plane separation from data plane.

Key Insights

Colors arent just labels—they define transport classes (e.g., mpls, biz-internet, lte-failover).
TLOC extensions (e.g., primary/backup) let you influence path selection without touching routing.
VPN 0 is the backbone—mismanagement here breaks everything (e.g., misconfigured TLOC preferences killing failover).

Pro Move: Use TLOC precedence and groups to enforce deterministic failover without BGP tricks.


2. Policy Logic (How app-list Interacts with PfR)

Why It Matters

  • Most engineers just slap on an app-route policy and call it a day.
  • Performance-based Routing (PfR) is where SD-WAN actually beats traditional WAN—but only if you tune it right.

Key Insights

app-list is static, PfR is dynamic—your policies define what to steer, PfR decides how based on real-time conditions.
Match criteria hierarchy matters:

  • app-listdscpsource/dest IPpacket loss threshold
  • Misordering this breaks intent.
    PfR thresholds arent one-size-fits-all—VoIP might need jitter <10ms, while O365 can tolerate latency <100ms.

Pro Move: Use loss-protocol to differentiate UDP (VoIP) vs. TCP (web) sensitivity to packet loss.


3. Troubleshooting Workflows (Control vs. Data Plane)

Why It Matters

  • 90% of "SD-WAN issues" are misdiagnosed because engineers conflate control and data plane.
  • Control plane = TLOC/route exchange (OMP, BFD).
  • Data plane = Actual traffic flow (DTLS/IPsec, PfR decisions).

Key Insights

Control plane healthy ≠ data plane working (e.g., OMP peers up but TLOC keys mismatch).
BFD is your truth-teller—if BFD is down, PfR wont save you.
DTLS vs. IPsec—know which ones broken (DTLS for control, IPsec for data).

Pro Move:

  • Control plane checks: show omp peers, show bfd sessions, show control connections.
  • Data plane checks: show tunnel stats, show app-aware stats, show policy-service sessions.

The Top 1% Mindset

  • You dont just deploy SD-WAN—you orchestrate it.
  • You think in abstractions (colors, TLOCs, VPNs) not hardware.
  • You troubleshoot like a surgeon—control plane first, then data plane, then app logic.

Example:

  • Problem: VoIP calls drop but O365 works.
  • Top 1% Debug:
    1. Check BFD (control plane).
    2. Verify TLOC preferences (is LTE taking over incorrectly?).
    3. Inspect app-route policy (is VoIP pinned to MPLS but PfR overriding?).
    4. Drill into show app-aware stats (is jitter spiking on broadband?).

Final Thought

Most SD-WAN "engineers" just click through vManage. The real pros know:

  • Transport independence isnt automatic—its designed.
  • Policies arent rules—theyre a logic flow.
  • Troubleshooting isnt guessing—its methodical dissection.

Youre asking the right questions. Now go break (then fix) some TLOCs. 🚀

(And yes, we both know Ciscos docs dont explain this stuff clearly—thats why the top 1% reverse-engineer it.)

Would love your take—whats the most obscure SD-WAN nuance youve had to debug?

Deep Dive: TLOCs (Transport Locators) The Spine of SD-WAN

TLOCs are the make-or-break abstraction in SD-WAN architectures (especially Cisco Viptela). Theyre the glue between the underlay (physical links) and overlay (logical policies). But most engineers only think they understand them. Lets fix that.


1. TLOCs: The Core Concept

A TLOC is a logical representation of a WAN edge routers transport connection. Its defined by three key attributes:

  1. TLOC IP (the physical interface IP).
  2. Color (e.g., mpls, biz-internet, lte).
  3. Encapsulation (IPsec or TLS).

Why this matters:

  • TLOCs decouple policies from hardware. You can swap circuits (e.g., change ISP) without rewriting all your rules.
  • They enable transport-independent routing—policies reference colors, not IPs.

2. TLOC Components Whats Under the Hood

A. TLOC Extended Attributes

These are hidden knobs that influence path selection:

  • Preference (like admin distance higher = better).
  • Weight (for load-balancing across equal paths).
  • Public/Private IP (for NAT traversal).
  • Site-ID (prevents misrouting in multi-tenant setups).

Example:

tloc-extension {
  ip    = 203.0.113.1  
  color = biz-internet  
  encap = ipsec  
  preference = 100  # Higher = more preferred  
}

B. TLOC Groups

  • Primary/Backup Groups: Force deterministic failover (e.g., "Use LTE only if MPLS is down").
  • Geographic Groups: Steer traffic regionally (e.g., "EU branches prefer EU-based TLOCs").

Pro Tip: Misconfigured groups cause asymmetric routing—always validate with show sdwan tloc.


3. TLOC Lifecycle How Theyre Born, Live, and Die

A. TLOC Formation

  1. Discovery: Router advertises its TLOCs via OMP (Overlay Management Protocol).
  2. Validation: BFD (Bidirectional Forwarding Detection) confirms reachability.
  3. Installation: TLOC enters the RIB (Routing Information Base) if valid.

Critical Check:

show sdwan omp tlocs  # Verify TLOC advertisements  
show sdwan bfd sessions  # Confirm liveliness  

B. TLOC States

  • Up/Active: BFD is healthy, traffic can flow.
  • Down/Dead: BFD failed, TLOC is pulled from RIB.
  • Partial: One direction works (asymmetric routing risk!).

Debugging:

show sdwan tloc | include Partial  # Hunt for flapping TLOCs  

4. TLOC Policies The Real Power

A. Influencing Path Selection

  • Route Policy: Modify TLOC preferences per-application.
    apply-policy {
      app-route voip {
        tloc = mpls preference 200  # Always prefer MPLS for VoIP  
      }
    }
    
  • Smart TLOC Preemption: Fail back aggressively (or not).

B. TLOC Affinity

  • Sticky TLOCs: Pin flows to a TLOC (e.g., for SIP trunks).
  • Load-Balancing: Distribute across TLOCs with equal weight.

Gotcha: Affinity conflicts with Performance Routing (PfR)—tune carefully!


5. TLOC Troubleshooting The Dark Arts

A. Common TLOC Failures

  1. BFD Flapping → TLOCs bounce.
    • Fix: Adjust BFD timers (bfd-timer 300 900 3).
  2. Color Mismatch → TLOCs dont form.
    • Fix: Ensure colors match exactly (case-sensitive!).
  3. NAT Issues → Private IP leaks.
    • Fix: Use tloc-extension public-ip.

B. Advanced Debugging

debug sdwan omp tlocs  # Watch TLOC advertisements in real-time  
debug sdwan bfd events  # Catch BFD failures  
show sdwan tloc-history  # Track TLOC changes over time  

6. TLOC vs. The World

Concept TLOC Traditional WAN
Addressing Logical (color-based) Physical (IP-based)
Failover Sub-second (BFD + OMP) Slow (BGP convergence)
Policies Transport-agnostic Hardcoded to interfaces

Key Takeaway: TLOCs turn network plumbing into policy-driven intent.


Final Word

Mastering TLOCs means:
You never blame "the SD-WAN" for routing issues—you dissect TLOC states.
You design for intent (colors, groups) instead of hacking interface configs.
You troubleshoot like a surgeon—OMP → BFD → TLOC → Policy.

Now go forth and make TLOCs obey. 🚀

(And when Cisco TAC says "its a TLOC issue," youll know exactly where to look.)

Question for you: Whats the weirdest TLOC bug youve encountered? (Color mismatches? BFD ghost sessions? Lets hear war stories.)