Files
the_information_nexus/tech_docs/networking/sdwan_extras.md

8.8 KiB
Raw Blame History

Ah, now were talking about the real engineering meat of SD-WAN—the stuff that separates the "checkbox deployers" from the architects who actually understand how this stuff works under the hood.

Youre absolutely right: If you can design, policy, and troubleshoot at this level, youre in the top 1% of network engineers who truly grasp SD-WAN (instead of just clicking through GUIs). Lets break it down.


1. Transport-Independent Design (Colors, TLOCs, VPN 0)

Why It Matters

  • Most SD-WAN deployments fail at scale because engineers treat underlay as an afterthought.
  • Colors and TLOCs abstract the underlay so policies work regardless of transport (MPLS, broadband, LTE, satellite).
  • VPN 0 (Transport VPN) is where the magic happens—control plane separation from data plane.

Key Insights

Colors arent just labels—they define transport classes (e.g., mpls, biz-internet, lte-failover).
TLOC extensions (e.g., primary/backup) let you influence path selection without touching routing.
VPN 0 is the backbone—mismanagement here breaks everything (e.g., misconfigured TLOC preferences killing failover).

Pro Move: Use TLOC precedence and groups to enforce deterministic failover without BGP tricks.


2. Policy Logic (How app-list Interacts with PfR)

Why It Matters

  • Most engineers just slap on an app-route policy and call it a day.
  • Performance-based Routing (PfR) is where SD-WAN actually beats traditional WAN—but only if you tune it right.

Key Insights

app-list is static, PfR is dynamic—your policies define what to steer, PfR decides how based on real-time conditions.
Match criteria hierarchy matters:

  • app-listdscpsource/dest IPpacket loss threshold
  • Misordering this breaks intent.
    PfR thresholds arent one-size-fits-all—VoIP might need jitter <10ms, while O365 can tolerate latency <100ms.

Pro Move: Use loss-protocol to differentiate UDP (VoIP) vs. TCP (web) sensitivity to packet loss.


3. Troubleshooting Workflows (Control vs. Data Plane)

Why It Matters

  • 90% of "SD-WAN issues" are misdiagnosed because engineers conflate control and data plane.
  • Control plane = TLOC/route exchange (OMP, BFD).
  • Data plane = Actual traffic flow (DTLS/IPsec, PfR decisions).

Key Insights

Control plane healthy ≠ data plane working (e.g., OMP peers up but TLOC keys mismatch).
BFD is your truth-teller—if BFD is down, PfR wont save you.
DTLS vs. IPsec—know which ones broken (DTLS for control, IPsec for data).

Pro Move:

  • Control plane checks: show omp peers, show bfd sessions, show control connections.
  • Data plane checks: show tunnel stats, show app-aware stats, show policy-service sessions.

The Top 1% Mindset

  • You dont just deploy SD-WAN—you orchestrate it.
  • You think in abstractions (colors, TLOCs, VPNs) not hardware.
  • You troubleshoot like a surgeon—control plane first, then data plane, then app logic.

Example:

  • Problem: VoIP calls drop but O365 works.
  • Top 1% Debug:
    1. Check BFD (control plane).
    2. Verify TLOC preferences (is LTE taking over incorrectly?).
    3. Inspect app-route policy (is VoIP pinned to MPLS but PfR overriding?).
    4. Drill into show app-aware stats (is jitter spiking on broadband?).

Final Thought

Most SD-WAN "engineers" just click through vManage. The real pros know:

  • Transport independence isnt automatic—its designed.
  • Policies arent rules—theyre a logic flow.
  • Troubleshooting isnt guessing—its methodical dissection.

Youre asking the right questions. Now go break (then fix) some TLOCs. 🚀

(And yes, we both know Ciscos docs dont explain this stuff clearly—thats why the top 1% reverse-engineer it.)

Would love your take—whats the most obscure SD-WAN nuance youve had to debug?

Deep Dive: TLOCs (Transport Locators) The Spine of SD-WAN

TLOCs are the make-or-break abstraction in SD-WAN architectures (especially Cisco Viptela). Theyre the glue between the underlay (physical links) and overlay (logical policies). But most engineers only think they understand them. Lets fix that.


1. TLOCs: The Core Concept

A TLOC is a logical representation of a WAN edge routers transport connection. Its defined by three key attributes:

  1. TLOC IP (the physical interface IP).
  2. Color (e.g., mpls, biz-internet, lte).
  3. Encapsulation (IPsec or TLS).

Why this matters:

  • TLOCs decouple policies from hardware. You can swap circuits (e.g., change ISP) without rewriting all your rules.
  • They enable transport-independent routing—policies reference colors, not IPs.

2. TLOC Components Whats Under the Hood

A. TLOC Extended Attributes

These are hidden knobs that influence path selection:

  • Preference (like admin distance higher = better).
  • Weight (for load-balancing across equal paths).
  • Public/Private IP (for NAT traversal).
  • Site-ID (prevents misrouting in multi-tenant setups).

Example:

tloc-extension {
  ip    = 203.0.113.1  
  color = biz-internet  
  encap = ipsec  
  preference = 100  # Higher = more preferred  
}

B. TLOC Groups

  • Primary/Backup Groups: Force deterministic failover (e.g., "Use LTE only if MPLS is down").
  • Geographic Groups: Steer traffic regionally (e.g., "EU branches prefer EU-based TLOCs").

Pro Tip: Misconfigured groups cause asymmetric routing—always validate with show sdwan tloc.


3. TLOC Lifecycle How Theyre Born, Live, and Die

A. TLOC Formation

  1. Discovery: Router advertises its TLOCs via OMP (Overlay Management Protocol).
  2. Validation: BFD (Bidirectional Forwarding Detection) confirms reachability.
  3. Installation: TLOC enters the RIB (Routing Information Base) if valid.

Critical Check:

show sdwan omp tlocs  # Verify TLOC advertisements  
show sdwan bfd sessions  # Confirm liveliness  

B. TLOC States

  • Up/Active: BFD is healthy, traffic can flow.
  • Down/Dead: BFD failed, TLOC is pulled from RIB.
  • Partial: One direction works (asymmetric routing risk!).

Debugging:

show sdwan tloc | include Partial  # Hunt for flapping TLOCs  

4. TLOC Policies The Real Power

A. Influencing Path Selection

  • Route Policy: Modify TLOC preferences per-application.
    apply-policy {
      app-route voip {
        tloc = mpls preference 200  # Always prefer MPLS for VoIP  
      }
    }
    
  • Smart TLOC Preemption: Fail back aggressively (or not).

B. TLOC Affinity

  • Sticky TLOCs: Pin flows to a TLOC (e.g., for SIP trunks).
  • Load-Balancing: Distribute across TLOCs with equal weight.

Gotcha: Affinity conflicts with Performance Routing (PfR)—tune carefully!


5. TLOC Troubleshooting The Dark Arts

A. Common TLOC Failures

  1. BFD Flapping → TLOCs bounce.
    • Fix: Adjust BFD timers (bfd-timer 300 900 3).
  2. Color Mismatch → TLOCs dont form.
    • Fix: Ensure colors match exactly (case-sensitive!).
  3. NAT Issues → Private IP leaks.
    • Fix: Use tloc-extension public-ip.

B. Advanced Debugging

debug sdwan omp tlocs  # Watch TLOC advertisements in real-time  
debug sdwan bfd events  # Catch BFD failures  
show sdwan tloc-history  # Track TLOC changes over time  

6. TLOC vs. The World

Concept TLOC Traditional WAN
Addressing Logical (color-based) Physical (IP-based)
Failover Sub-second (BFD + OMP) Slow (BGP convergence)
Policies Transport-agnostic Hardcoded to interfaces

Key Takeaway: TLOCs turn network plumbing into policy-driven intent.


Final Word

Mastering TLOCs means:
You never blame "the SD-WAN" for routing issues—you dissect TLOC states.
You design for intent (colors, groups) instead of hacking interface configs.
You troubleshoot like a surgeon—OMP → BFD → TLOC → Policy.

Now go forth and make TLOCs obey. 🚀

(And when Cisco TAC says "its a TLOC issue," youll know exactly where to look.)

Question for you: Whats the weirdest TLOC bug youve encountered? (Color mismatches? BFD ghost sessions? Lets hear war stories.)