Files
the_information_nexus/tech_docs/cloud/aws_notes.md

10 KiB
Raw Blame History

When troubleshooting live production environments, minimizing disruption is critical. Heres where to exercise caution and best practices to avoid downtime or broken connections:


1. High-Risk Actions That Can Break Traffic

A. Security Group Rule Modifications

  • Risk: Removing/updating rules can drop active connections.
  • Example:
    • Revoking an inbound HTTPS (443) rule kills live sessions.
    • Changing egress rules can disrupt outbound API calls.
  • Mitigation:
    • Stage changes: Add new rules before removing old ones.
    • Use temporary rules: Set short-lived rules (e.g., aws ec2 authorize-security-group-ingress --cidr 1.2.3.4/32 --port 443 --protocol tcp --group-id sg-123).

B. Network ACL (NACL) Updates

  • Risk: NACLs are stateless—updates drop existing connections.
  • Example:
    • Adding a deny rule for 10.0.1.0/24 kills active TCP sessions.
  • Mitigation:
    • Test in non-prod first.
    • Modify NACLs during low-traffic windows.

C. Route Table Changes

  • Risk: Misrouting traffic (e.g., removing a NAT Gateway route).
  • Example:
    • Deleting 0.0.0.0/0 → igw-123 makes public subnets unreachable.
  • Mitigation:
    • Pre-validate routes:
      aws ec2 describe-route-tables --route-table-id rtb-123 --query 'RouteTables[*].Routes'
      
    • Use weighted routing (e.g., Transit Gateway) for failover.

D. NAT Gateway Replacement

  • Risk: Swapping NAT Gateways breaks long-lived connections (e.g., SFTP, WebSockets).
  • Mitigation:
    • Preserve Elastic IPs (attach to new NAT Gateway first).
    • Warm standby: Deploy new NAT Gateway before decommissioning old one.

2. Safe Troubleshooting Techniques

A. Passive Monitoring (Zero Impact)

  • Flow Logs: Query logs without touching infrastructure.
    # CloudWatch Logs Insights (GUI)  
    fields @timestamp, srcAddr, dstAddr, action  
    | filter dstAddr = "10.0.2.5" and action = "REJECT"  
    
  • VPC Traffic Mirroring: Copy traffic to a monitoring instance (no production impact).

B. Non-Destructive Testing

  • Packet Captures on Test Instances:
    sudo tcpdump -i eth0 -w /tmp/capture.pcap host 10.0.1.10  # No service restart needed  
    
  • Canary Deployments: Test changes on 1% of traffic (e.g., weighted ALB routes).

C. Connection-Preserving Changes

  • Security Groups:
    • Add new rules with higher priority (lower rule numbers) before deleting old ones.
  • NACLs:
    • Temporarily set Ephemeral Ports (32768-60999) to ALLOW during changes.

3. Redundancy Patterns to Reduce Risk

Scenario Failover Strategy
NAT Gateway Failure Deploy NAT Gateway per AZ + test failover.
Route Table Corruption Use version-controlled Terraform rollback.
SG Lockout Pre-configure backup admin access (e.g., SSM).

4. Worst-Case Recovery Plan

  1. Rollback Immediately:
    • Revert NACLs/SGs to last-known-good state.
    aws ec2 replace-network-acl-association --association-id aclassoc-123 --network-acl-id acl-456  
    
  2. Bypass Troubleshooting:
    • Use AWS Systems Manager (SSM) to debug instances without SSH (no SG changes needed).
  3. Post-Mortem:
    • Check CloudTrail for who made changes:
      aws cloudtrail lookup-events --lookup-attributes AttributeKey=ResourceName,AttributeValue=sg-123  
      

Key Takeaways

Avoid Live Changes: Prefer read-only tools (Flow Logs, CloudWatch).
Stage Changes: Test in non-prod, then deploy with canaries.
Preserve State: Never drop NACL/SG rules without redundancy.
Automate Rollbacks: Use Terraform/CloudFormation for quick recovery.

Pro Tip: For critical systems, run chaos engineering tests (e.g., simulate NAT failure) during off-hours to validate resilience.

Need a step-by-step rollback guide for a specific scenario? I can tailor it to your architecture.


Youre spot on—real-world war stories and battle-tested tricks are what separate theoretical knowledge from production-hardened expertise. Below is a concise technical guide distilled from AWS networking trenches, covering lessons learned, hidden pitfalls, and pro tips rarely found in docs.


AWS Networking War Stories: The Unwritten Guide

"Good judgment comes from experience. Experience comes from bad judgment."


1. Security Groups (SGs): The Silent Killers

War Story: The Case of the Phantom Timeouts

  • Symptoms: Intermittent HTTP timeouts between microservices.
  • Root Cause: Overlapping SG rules with different description fields but identical IP permissions. AWS silently dedupes them, causing random drops.
  • Fix:
    # Audit duplicate rules (CLI reveals what GUI hides)
    aws ec2 describe-security-groups --query 'SecurityGroups[*].IpPermissions' | jq '.[] | group_by(.FromPort, .ToPort, .IpProtocol, .IpRanges)[] | select(length > 1)'
    
  • Lesson: Never trust the GUI alone—use CLI to audit SGs.

Pro Tip: The "Deny All" Egress Trap

  • Mistake: Setting egress = [] in Terraform (defaults to deny all).
  • Outcome: Instances lose SSM, patch management, and API connectivity.
  • Fix: Always explicitly allow:
    egress {
      from_port   = 0
      to_port     = 0
      protocol    = "-1"
      cidr_blocks = ["0.0.0.0/0"]  # Or restrict to necessary IPs
    }
    

2. NACLs: The Stateless Nightmare

War Story: The 5-Minute Outage

  • Symptoms: Database replication breaks after NACL "minor update."
  • Root Cause: NACL rule #100 allowed TCP/3306, but rule #200 denied Ephemeral Ports (32768-60999)—breaking replies.
  • Fix:
    # Allow ephemeral ports INBOUND for responses
    aws ec2 create-network-acl-entry --network-acl-id acl-123 --rule-number 150 --protocol tcp --port-range From=32768,To=60999 --cidr-block 10.0.1.0/24 --rule-action allow --ingress
    
  • Lesson: NACLs need mirror rules for ingress/egress. Test with telnet before deploying.

Pro Tip: The Rule-Order Bomb

  • Mistake: Adding a deny rule at #50 after allowing at #100.
  • Outcome: Traffic silently drops (first match wins).
  • Fix: Use describe-network-acls to audit rule ordering:
    aws ec2 describe-network-acls --query 'NetworkAcls[*].Entries[?RuleNumber==`50`]'
    

3. NAT Gateways: The $0.045/hr Landmine

War Story: The 4 AM Bill Shock

  • Symptoms: $3k/month bill from "idle" NAT Gateways.
  • Root Cause: Leftover NAT Gateways in unused AZs (auto-created by Terraform).
  • Fix:
    # Find unattached NAT Gateways
    aws ec2 describe-nat-gateways --filter "Name=state,Values=available" --query 'NatGateways[?subnetId==`null`].NatGatewayId'
    
  • Lesson: Always tag NAT Gateways with Owner and Expiry.

Pro Tip: The TCP Connection Black Hole

  • Mistake: Replacing a NAT Gateway without draining connections.
  • Outcome: Active sessions (SSH, RDP, DB) hang until TCP timeout (30+ mins).
  • Fix:
    • Before replacement: Reduce TCP timeouts on clients.
    • Use Network Load Balancer (NLB) for stateful failover.

4. VPC Peering: The Cross-Account Trap

War Story: The DNS That Wasnt

  • Symptoms: EC2 instances cant resolve peered VPCs private hosted zones.
  • Root Cause: Peering doesnt auto-share Route53 Private Hosted Zones.
  • Fix:
    # Associate PHZ with peer VPC
    aws route53 create-vpc-association-authorization --hosted-zone-id Z123 --vpc VPCRegion=us-east-1,VPCId=vpc-456
    
  • Lesson: Test DNS resolution early in peering setups.

Pro Tip: The Overlapping CIDR Silent Fail

  • Mistake: Peering 10.0.0.0/16 with another 10.0.0.0/16.
  • Outcome: Routes appear, but traffic fails.
  • Fix: Always design non-overlapping CIDRs (e.g., 10.0.0.0/16 + 10.1.0.0/16).

5. Direct Connect: The BGP Rollercoaster

War Story: The 1-Packet-Per-Second Mystery

  • Symptoms: Applications crawl over Direct Connect.
  • Root Cause: BGP keepalive set to 60s (default), causing route flapping.
  • Fix:
    # Adjust BGP timers (via AWS Console or CLI)
    aws directconnect create-bgp-peer --virtual-interface-id dxvif-123 --bgp-peer 192.0.2.1,65000 --bgp-options '{"PeeringMode": "PRIVATE", "BgpAsn": 65101, "KeepaliveInterval": 10}'
    
  • Lesson: Override defaults—set keepalive = 10s, holddown = 30s.

Pro Tip: The MTU Mismatch

  • Mistake: Assuming AWS supports jumbo frames (9001 MTU).
  • Outcome: Packet fragmentation kills throughput.
  • Fix: Hard-set MTU to 1500 on on-prem routers:
    # Linux example
    ip link set dev eth0 mtu 1500
    

6. The Ultimate Troubleshooting Checklist

Before Making Changes:

  1. Backup Configs:
    aws ec2 describe-security-groups --query 'SecurityGroups[*].{GroupId:GroupId,IpPermissions:IpPermissions}' > sg-backup.json
    
  2. Enable Flow Logs:
    aws ec2 create-flow-logs --resource-type VPC --resource-id vpc-123 --traffic-type ALL --log-destination-type cloud-watch-logs
    
  3. Test with Canary: Deploy changes to one AZ/subnet first.

When Things Break:

  1. Rollback Fast: Use Terraform terraform apply -replace or CLI.
  2. SSM Session Manager: Access instances without SSH (bypass broken SGs).
  3. CloudTrail Forensics:
    aws cloudtrail lookup-events --lookup-attributes AttributeKey=EventName,AttributeValue=DeleteSecurityGroup
    

Final Wisdom

  • Document Your "Murder Mystery" Stories: Every outage teaches something.
  • Automate Recovery: Use Lambda + EventBridge to auto-rollback NACL changes.
  • Pressure-Test Resiliency: Run GameDays (e.g., randomly kill NAT Gateways).

Want this as a PDF cheatsheet? I can structure it with more war stories and code snippets. Let me know!