10 KiB
When troubleshooting live production environments, minimizing disruption is critical. Here’s where to exercise caution and best practices to avoid downtime or broken connections:
1. High-Risk Actions That Can Break Traffic
A. Security Group Rule Modifications
- Risk: Removing/updating rules can drop active connections.
- Example:
- Revoking an inbound
HTTPS (443)rule kills live sessions. - Changing egress rules can disrupt outbound API calls.
- Revoking an inbound
- Mitigation:
- Stage changes: Add new rules before removing old ones.
- Use temporary rules: Set short-lived rules (e.g.,
aws ec2 authorize-security-group-ingress --cidr 1.2.3.4/32 --port 443 --protocol tcp --group-id sg-123).
B. Network ACL (NACL) Updates
- Risk: NACLs are stateless—updates drop existing connections.
- Example:
- Adding a deny rule for
10.0.1.0/24kills active TCP sessions.
- Adding a deny rule for
- Mitigation:
- Test in non-prod first.
- Modify NACLs during low-traffic windows.
C. Route Table Changes
- Risk: Misrouting traffic (e.g., removing a NAT Gateway route).
- Example:
- Deleting
0.0.0.0/0 → igw-123makes public subnets unreachable.
- Deleting
- Mitigation:
- Pre-validate routes:
aws ec2 describe-route-tables --route-table-id rtb-123 --query 'RouteTables[*].Routes' - Use weighted routing (e.g., Transit Gateway) for failover.
- Pre-validate routes:
D. NAT Gateway Replacement
- Risk: Swapping NAT Gateways breaks long-lived connections (e.g., SFTP, WebSockets).
- Mitigation:
- Preserve Elastic IPs (attach to new NAT Gateway first).
- Warm standby: Deploy new NAT Gateway before decommissioning old one.
2. Safe Troubleshooting Techniques
A. Passive Monitoring (Zero Impact)
- Flow Logs: Query logs without touching infrastructure.
# CloudWatch Logs Insights (GUI) fields @timestamp, srcAddr, dstAddr, action | filter dstAddr = "10.0.2.5" and action = "REJECT" - VPC Traffic Mirroring: Copy traffic to a monitoring instance (no production impact).
B. Non-Destructive Testing
- Packet Captures on Test Instances:
sudo tcpdump -i eth0 -w /tmp/capture.pcap host 10.0.1.10 # No service restart needed - Canary Deployments: Test changes on 1% of traffic (e.g., weighted ALB routes).
C. Connection-Preserving Changes
- Security Groups:
- Add new rules with higher priority (lower rule numbers) before deleting old ones.
- NACLs:
- Temporarily set
Ephemeral Ports (32768-60999)toALLOWduring changes.
- Temporarily set
3. Redundancy Patterns to Reduce Risk
| Scenario | Failover Strategy |
|---|---|
| NAT Gateway Failure | Deploy NAT Gateway per AZ + test failover. |
| Route Table Corruption | Use version-controlled Terraform rollback. |
| SG Lockout | Pre-configure backup admin access (e.g., SSM). |
4. Worst-Case Recovery Plan
- Rollback Immediately:
- Revert NACLs/SGs to last-known-good state.
aws ec2 replace-network-acl-association --association-id aclassoc-123 --network-acl-id acl-456 - Bypass Troubleshooting:
- Use AWS Systems Manager (SSM) to debug instances without SSH (no SG changes needed).
- Post-Mortem:
- Check CloudTrail for who made changes:
aws cloudtrail lookup-events --lookup-attributes AttributeKey=ResourceName,AttributeValue=sg-123
- Check CloudTrail for who made changes:
Key Takeaways
✅ Avoid Live Changes: Prefer read-only tools (Flow Logs, CloudWatch).
✅ Stage Changes: Test in non-prod, then deploy with canaries.
✅ Preserve State: Never drop NACL/SG rules without redundancy.
✅ Automate Rollbacks: Use Terraform/CloudFormation for quick recovery.
Pro Tip: For critical systems, run chaos engineering tests (e.g., simulate NAT failure) during off-hours to validate resilience.
Need a step-by-step rollback guide for a specific scenario? I can tailor it to your architecture.
You’re spot on—real-world war stories and battle-tested tricks are what separate theoretical knowledge from production-hardened expertise. Below is a concise technical guide distilled from AWS networking trenches, covering lessons learned, hidden pitfalls, and pro tips rarely found in docs.
AWS Networking War Stories: The Unwritten Guide
"Good judgment comes from experience. Experience comes from bad judgment."
1. Security Groups (SGs): The Silent Killers
War Story: The Case of the Phantom Timeouts
- Symptoms: Intermittent HTTP timeouts between microservices.
- Root Cause: Overlapping SG rules with different
descriptionfields but identicalIP permissions. AWS silently dedupes them, causing random drops. - Fix:
# Audit duplicate rules (CLI reveals what GUI hides) aws ec2 describe-security-groups --query 'SecurityGroups[*].IpPermissions' | jq '.[] | group_by(.FromPort, .ToPort, .IpProtocol, .IpRanges)[] | select(length > 1)' - Lesson: Never trust the GUI alone—use CLI to audit SGs.
Pro Tip: The "Deny All" Egress Trap
- Mistake: Setting
egress = []in Terraform (defaults todeny all). - Outcome: Instances lose SSM, patch management, and API connectivity.
- Fix: Always explicitly allow:
egress { from_port = 0 to_port = 0 protocol = "-1" cidr_blocks = ["0.0.0.0/0"] # Or restrict to necessary IPs }
2. NACLs: The Stateless Nightmare
War Story: The 5-Minute Outage
- Symptoms: Database replication breaks after NACL "minor update."
- Root Cause: NACL rule #100 allowed
TCP/3306, but rule #200 deniedEphemeral Ports(32768-60999)—breaking replies. - Fix:
# Allow ephemeral ports INBOUND for responses aws ec2 create-network-acl-entry --network-acl-id acl-123 --rule-number 150 --protocol tcp --port-range From=32768,To=60999 --cidr-block 10.0.1.0/24 --rule-action allow --ingress - Lesson: NACLs need mirror rules for ingress/egress. Test with
telnetbefore deploying.
Pro Tip: The Rule-Order Bomb
- Mistake: Adding a
denyrule at #50 after allowing at #100. - Outcome: Traffic silently drops (first match wins).
- Fix: Use
describe-network-aclsto audit rule ordering:aws ec2 describe-network-acls --query 'NetworkAcls[*].Entries[?RuleNumber==`50`]'
3. NAT Gateways: The $0.045/hr Landmine
War Story: The 4 AM Bill Shock
- Symptoms: $3k/month bill from "idle" NAT Gateways.
- Root Cause: Leftover NAT Gateways in unused AZs (auto-created by Terraform).
- Fix:
# Find unattached NAT Gateways aws ec2 describe-nat-gateways --filter "Name=state,Values=available" --query 'NatGateways[?subnetId==`null`].NatGatewayId' - Lesson: Always tag NAT Gateways with
OwnerandExpiry.
Pro Tip: The TCP Connection Black Hole
- Mistake: Replacing a NAT Gateway without draining connections.
- Outcome: Active sessions (SSH, RDP, DB) hang until TCP timeout (30+ mins).
- Fix:
- Before replacement: Reduce TCP timeouts on clients.
- Use Network Load Balancer (NLB) for stateful failover.
4. VPC Peering: The Cross-Account Trap
War Story: The DNS That Wasn’t
- Symptoms: EC2 instances can’t resolve peered VPC’s private hosted zones.
- Root Cause: Peering doesn’t auto-share Route53 Private Hosted Zones.
- Fix:
# Associate PHZ with peer VPC aws route53 create-vpc-association-authorization --hosted-zone-id Z123 --vpc VPCRegion=us-east-1,VPCId=vpc-456 - Lesson: Test DNS resolution early in peering setups.
Pro Tip: The Overlapping CIDR Silent Fail
- Mistake: Peering
10.0.0.0/16with another10.0.0.0/16. - Outcome: Routes appear, but traffic fails.
- Fix: Always design non-overlapping CIDRs (e.g.,
10.0.0.0/16+10.1.0.0/16).
5. Direct Connect: The BGP Rollercoaster
War Story: The 1-Packet-Per-Second Mystery
- Symptoms: Applications crawl over Direct Connect.
- Root Cause: BGP
keepaliveset to 60s (default), causing route flapping. - Fix:
# Adjust BGP timers (via AWS Console or CLI) aws directconnect create-bgp-peer --virtual-interface-id dxvif-123 --bgp-peer 192.0.2.1,65000 --bgp-options '{"PeeringMode": "PRIVATE", "BgpAsn": 65101, "KeepaliveInterval": 10}' - Lesson: Override defaults—set
keepalive = 10s,holddown = 30s.
Pro Tip: The MTU Mismatch
- Mistake: Assuming AWS supports jumbo frames (9001 MTU).
- Outcome: Packet fragmentation kills throughput.
- Fix: Hard-set MTU to 1500 on on-prem routers:
# Linux example ip link set dev eth0 mtu 1500
6. The Ultimate Troubleshooting Checklist
Before Making Changes:
- Backup Configs:
aws ec2 describe-security-groups --query 'SecurityGroups[*].{GroupId:GroupId,IpPermissions:IpPermissions}' > sg-backup.json - Enable Flow Logs:
aws ec2 create-flow-logs --resource-type VPC --resource-id vpc-123 --traffic-type ALL --log-destination-type cloud-watch-logs - Test with Canary: Deploy changes to one AZ/subnet first.
When Things Break:
- Rollback Fast: Use Terraform
terraform apply -replaceor CLI. - SSM Session Manager: Access instances without SSH (bypass broken SGs).
- CloudTrail Forensics:
aws cloudtrail lookup-events --lookup-attributes AttributeKey=EventName,AttributeValue=DeleteSecurityGroup
Final Wisdom
- Document Your "Murder Mystery" Stories: Every outage teaches something.
- Automate Recovery: Use Lambda + EventBridge to auto-rollback NACL changes.
- Pressure-Test Resiliency: Run GameDays (e.g., randomly kill NAT Gateways).
Want this as a PDF cheatsheet? I can structure it with more war stories and code snippets. Let me know!