Files

medusa 8899fb9888 Add tech_docs/cloud/aws_notes.md

2025-07-20 20:58:27 -05:00

10 KiB

Raw Blame History

When troubleshooting live production environments, minimizing disruption is critical. Here’s where to exercise caution and best practices to avoid downtime or broken connections:

1. High-Risk Actions That Can Break Traffic

A. Security Group Rule Modifications

Risk: Removing/updating rules can drop active connections.
Example:
- Revoking an inbound HTTPS (443) rule kills live sessions.
- Changing egress rules can disrupt outbound API calls.
Mitigation:
- Stage changes: Add new rules before removing old ones.
- Use temporary rules: Set short-lived rules (e.g., aws ec2 authorize-security-group-ingress --cidr 1.2.3.4/32 --port 443 --protocol tcp --group-id sg-123).

B. Network ACL (NACL) Updates

Risk: NACLs are stateless—updates drop existing connections.
Example:
- Adding a deny rule for 10.0.1.0/24 kills active TCP sessions.
Mitigation:
- Test in non-prod first.
- Modify NACLs during low-traffic windows.

C. Route Table Changes

Risk: Misrouting traffic (e.g., removing a NAT Gateway route).
Example:
- Deleting 0.0.0.0/0 → igw-123 makes public subnets unreachable.

Mitigation:

Pre-validate routes:

aws ec2 describe-route-tables --route-table-id rtb-123 --query 'RouteTables[*].Routes'

Use weighted routing (e.g., Transit Gateway) for failover.

D. NAT Gateway Replacement

Risk: Swapping NAT Gateways breaks long-lived connections (e.g., SFTP, WebSockets).
Mitigation:
- Preserve Elastic IPs (attach to new NAT Gateway first).
- Warm standby: Deploy new NAT Gateway before decommissioning old one.

2. Safe Troubleshooting Techniques

A. Passive Monitoring (Zero Impact)

Flow Logs: Query logs without touching infrastructure.

# CloudWatch Logs Insights (GUI)  
fields @timestamp, srcAddr, dstAddr, action  
| filter dstAddr = "10.0.2.5" and action = "REJECT"

VPC Traffic Mirroring: Copy traffic to a monitoring instance (no production impact).

B. Non-Destructive Testing

Packet Captures on Test Instances:

sudo tcpdump -i eth0 -w /tmp/capture.pcap host 10.0.1.10  # No service restart needed

Canary Deployments: Test changes on 1% of traffic (e.g., weighted ALB routes).

C. Connection-Preserving Changes

Security Groups:
- Add new rules with higher priority (lower rule numbers) before deleting old ones.
NACLs:
- Temporarily set Ephemeral Ports (32768-60999) to ALLOW during changes.

3. Redundancy Patterns to Reduce Risk

Scenario	Failover Strategy
NAT Gateway Failure	Deploy NAT Gateway per AZ + test failover.
Route Table Corruption	Use version-controlled Terraform rollback.
SG Lockout	Pre-configure backup admin access (e.g., SSM).

4. Worst-Case Recovery Plan

Rollback Immediately:

Revert NACLs/SGs to last-known-good state.

aws ec2 replace-network-acl-association --association-id aclassoc-123 --network-acl-id acl-456

Bypass Troubleshooting:
- Use AWS Systems Manager (SSM) to debug instances without SSH (no SG changes needed).

Post-Mortem:

Check CloudTrail for who made changes:

aws cloudtrail lookup-events --lookup-attributes AttributeKey=ResourceName,AttributeValue=sg-123

Key Takeaways

✅ Avoid Live Changes: Prefer read-only tools (Flow Logs, CloudWatch).
✅ Stage Changes: Test in non-prod, then deploy with canaries.
✅ Preserve State: Never drop NACL/SG rules without redundancy.
✅ Automate Rollbacks: Use Terraform/CloudFormation for quick recovery.

Pro Tip: For critical systems, run chaos engineering tests (e.g., simulate NAT failure) during off-hours to validate resilience.

Need a step-by-step rollback guide for a specific scenario? I can tailor it to your architecture.

You’re spot on—real-world war stories and battle-tested tricks are what separate theoretical knowledge from production-hardened expertise. Below is a concise technical guide distilled from AWS networking trenches, covering lessons learned, hidden pitfalls, and pro tips rarely found in docs.

AWS Networking War Stories: The Unwritten Guide

"Good judgment comes from experience. Experience comes from bad judgment."

1. Security Groups (SGs): The Silent Killers

War Story: The Case of the Phantom Timeouts

Symptoms: Intermittent HTTP timeouts between microservices.
Root Cause: Overlapping SG rules with different description fields but identical IP permissions. AWS silently dedupes them, causing random drops.

Fix:

# Audit duplicate rules (CLI reveals what GUI hides)
aws ec2 describe-security-groups --query 'SecurityGroups[*].IpPermissions' | jq '.[] | group_by(.FromPort, .ToPort, .IpProtocol, .IpRanges)[] | select(length > 1)'

Lesson: Never trust the GUI alone—use CLI to audit SGs.

Pro Tip: The "Deny All" Egress Trap

Mistake: Setting egress = [] in Terraform (defaults to deny all).
Outcome: Instances lose SSM, patch management, and API connectivity.

Fix: Always explicitly allow:

egress {
  from_port   = 0
  to_port     = 0
  protocol    = "-1"
  cidr_blocks = ["0.0.0.0/0"]  # Or restrict to necessary IPs
}

2. NACLs: The Stateless Nightmare

War Story: The 5-Minute Outage

Symptoms: Database replication breaks after NACL "minor update."
Root Cause: NACL rule #100 allowed TCP/3306, but rule #200 denied Ephemeral Ports (32768-60999)—breaking replies.

Fix:

# Allow ephemeral ports INBOUND for responses
aws ec2 create-network-acl-entry --network-acl-id acl-123 --rule-number 150 --protocol tcp --port-range From=32768,To=60999 --cidr-block 10.0.1.0/24 --rule-action allow --ingress

Lesson: NACLs need mirror rules for ingress/egress. Test with telnet before deploying.

Pro Tip: The Rule-Order Bomb

Mistake: Adding a deny rule at #50 after allowing at #100.
Outcome: Traffic silently drops (first match wins).

Fix: Use describe-network-acls to audit rule ordering:

aws ec2 describe-network-acls --query 'NetworkAcls[*].Entries[?RuleNumber==`50`]'

3. NAT Gateways: The $0.045/hr Landmine

War Story: The 4 AM Bill Shock

Symptoms: $3k/month bill from "idle" NAT Gateways.
Root Cause: Leftover NAT Gateways in unused AZs (auto-created by Terraform).

Fix:

# Find unattached NAT Gateways
aws ec2 describe-nat-gateways --filter "Name=state,Values=available" --query 'NatGateways[?subnetId==`null`].NatGatewayId'

Lesson: Always tag NAT Gateways with Owner and Expiry.

Pro Tip: The TCP Connection Black Hole

Mistake: Replacing a NAT Gateway without draining connections.
Outcome: Active sessions (SSH, RDP, DB) hang until TCP timeout (30+ mins).
Fix:
- Before replacement: Reduce TCP timeouts on clients.
- Use Network Load Balancer (NLB) for stateful failover.

4. VPC Peering: The Cross-Account Trap

War Story: The DNS That Wasn’t

Symptoms: EC2 instances can’t resolve peered VPC’s private hosted zones.
Root Cause: Peering doesn’t auto-share Route53 Private Hosted Zones.

Fix:

# Associate PHZ with peer VPC
aws route53 create-vpc-association-authorization --hosted-zone-id Z123 --vpc VPCRegion=us-east-1,VPCId=vpc-456

Lesson: Test DNS resolution early in peering setups.

Pro Tip: The Overlapping CIDR Silent Fail

Mistake: Peering 10.0.0.0/16 with another 10.0.0.0/16.
Outcome: Routes appear, but traffic fails.
Fix: Always design non-overlapping CIDRs (e.g., 10.0.0.0/16 + 10.1.0.0/16).

5. Direct Connect: The BGP Rollercoaster

War Story: The 1-Packet-Per-Second Mystery

Symptoms: Applications crawl over Direct Connect.
Root Cause: BGP keepalive set to 60s (default), causing route flapping.

Fix:

# Adjust BGP timers (via AWS Console or CLI)
aws directconnect create-bgp-peer --virtual-interface-id dxvif-123 --bgp-peer 192.0.2.1,65000 --bgp-options '{"PeeringMode": "PRIVATE", "BgpAsn": 65101, "KeepaliveInterval": 10}'

Lesson: Override defaults—set keepalive = 10s, holddown = 30s.

Pro Tip: The MTU Mismatch

Mistake: Assuming AWS supports jumbo frames (9001 MTU).
Outcome: Packet fragmentation kills throughput.

Fix: Hard-set MTU to 1500 on on-prem routers:

# Linux example
ip link set dev eth0 mtu 1500

6. The Ultimate Troubleshooting Checklist

Before Making Changes:

Backup Configs:

aws ec2 describe-security-groups --query 'SecurityGroups[*].{GroupId:GroupId,IpPermissions:IpPermissions}' > sg-backup.json

Enable Flow Logs:

aws ec2 create-flow-logs --resource-type VPC --resource-id vpc-123 --traffic-type ALL --log-destination-type cloud-watch-logs

Test with Canary: Deploy changes to one AZ/subnet first.

When Things Break:

Rollback Fast: Use Terraform terraform apply -replace or CLI.
SSM Session Manager: Access instances without SSH (bypass broken SGs).

CloudTrail Forensics:

aws cloudtrail lookup-events --lookup-attributes AttributeKey=EventName,AttributeValue=DeleteSecurityGroup

Final Wisdom

Document Your "Murder Mystery" Stories: Every outage teaches something.
Automate Recovery: Use Lambda + EventBridge to auto-rollback NACL changes.
Pressure-Test Resiliency: Run GameDays (e.g., randomly kill NAT Gateways).

Want this as a PDF cheatsheet? I can structure it with more war stories and code snippets. Let me know!

10 KiB Raw Blame History Unescape Escape

1. High-Risk Actions That Can Break Traffic

A. Security Group Rule Modifications

B. Network ACL (NACL) Updates

C. Route Table Changes

D. NAT Gateway Replacement

2. Safe Troubleshooting Techniques

A. Passive Monitoring (Zero Impact)

B. Non-Destructive Testing

C. Connection-Preserving Changes

3. Redundancy Patterns to Reduce Risk

4. Worst-Case Recovery Plan

Key Takeaways

AWS Networking War Stories: The Unwritten Guide

1. Security Groups (SGs): The Silent Killers

War Story: The Case of the Phantom Timeouts

Pro Tip: The "Deny All" Egress Trap

2. NACLs: The Stateless Nightmare

War Story: The 5-Minute Outage

Pro Tip: The Rule-Order Bomb

3. NAT Gateways: The $0.045/hr Landmine

War Story: The 4 AM Bill Shock

Pro Tip: The TCP Connection Black Hole

4. VPC Peering: The Cross-Account Trap

War Story: The DNS That Wasn’t

Pro Tip: The Overlapping CIDR Silent Fail

5. Direct Connect: The BGP Rollercoaster

War Story: The 1-Packet-Per-Second Mystery

Pro Tip: The MTU Mismatch

6. The Ultimate Troubleshooting Checklist

Before Making Changes:

When Things Break:

Final Wisdom

10 KiB

Raw Blame History