diff --git a/tech_docs/cloud/aws_notes.md b/tech_docs/cloud/aws_notes.md new file mode 100644 index 0000000..0be4909 --- /dev/null +++ b/tech_docs/cloud/aws_notes.md @@ -0,0 +1,247 @@ +When troubleshooting live production environments, **minimizing disruption** is critical. Here’s where to exercise caution and best practices to avoid downtime or broken connections: + +--- + +### **1. High-Risk Actions That Can Break Traffic** +#### **A. Security Group Rule Modifications** +- **Risk**: Removing/updating rules can drop active connections. +- **Example**: + - Revoking an inbound `HTTPS (443)` rule kills live sessions. + - Changing egress rules can disrupt outbound API calls. +- **Mitigation**: + - **Stage changes**: Add new rules before removing old ones. + - **Use temporary rules**: Set short-lived rules (e.g., `aws ec2 authorize-security-group-ingress --cidr 1.2.3.4/32 --port 443 --protocol tcp --group-id sg-123`). + +#### **B. Network ACL (NACL) Updates** +- **Risk**: NACLs are stateless—updates drop **existing connections**. +- **Example**: + - Adding a deny rule for `10.0.1.0/24` kills active TCP sessions. +- **Mitigation**: + - **Test in non-prod first**. + - **Modify NACLs during low-traffic windows**. + +#### **C. Route Table Changes** +- **Risk**: Misrouting traffic (e.g., removing a NAT Gateway route). +- **Example**: + - Deleting `0.0.0.0/0 → igw-123` makes public subnets unreachable. +- **Mitigation**: + - **Pre-validate routes**: + ```bash + aws ec2 describe-route-tables --route-table-id rtb-123 --query 'RouteTables[*].Routes' + ``` + - **Use weighted routing** (e.g., Transit Gateway) for failover. + +#### **D. NAT Gateway Replacement** +- **Risk**: Swapping NAT Gateways breaks long-lived connections (e.g., SFTP, WebSockets). +- **Mitigation**: + - **Preserve Elastic IPs** (attach to new NAT Gateway first). + - **Warm standby**: Deploy new NAT Gateway before decommissioning old one. + +--- + +### **2. Safe Troubleshooting Techniques** +#### **A. Passive Monitoring (Zero Impact)** +- **Flow Logs**: Query logs without touching infrastructure. + ```sql + # CloudWatch Logs Insights (GUI) + fields @timestamp, srcAddr, dstAddr, action + | filter dstAddr = "10.0.2.5" and action = "REJECT" + ``` +- **VPC Traffic Mirroring**: Copy traffic to a monitoring instance (no production impact). + +#### **B. Non-Destructive Testing** +- **Packet Captures on Test Instances**: + ```bash + sudo tcpdump -i eth0 -w /tmp/capture.pcap host 10.0.1.10 # No service restart needed + ``` +- **Canary Deployments**: Test changes on 1% of traffic (e.g., weighted ALB routes). + +#### **C. Connection-Preserving Changes** +- **Security Groups**: + - Add new rules with higher priority (lower rule numbers) before deleting old ones. +- **NACLs**: + - Temporarily set `Ephemeral Ports (32768-60999)` to `ALLOW` during changes. + +--- + +### **3. Redundancy Patterns to Reduce Risk** +| **Scenario** | **Failover Strategy** | +|----------------------------|--------------------------------------------| +| **NAT Gateway Failure** | Deploy NAT Gateway per AZ + test failover. | +| **Route Table Corruption** | Use version-controlled Terraform rollback. | +| **SG Lockout** | Pre-configure backup admin access (e.g., SSM). | + +--- + +### **4. Worst-Case Recovery Plan** +1. **Rollback Immediately**: + - Revert NACLs/SGs to last-known-good state. + ```bash + aws ec2 replace-network-acl-association --association-id aclassoc-123 --network-acl-id acl-456 + ``` +2. **Bypass Troubleshooting**: + - Use **AWS Systems Manager (SSM)** to debug instances without SSH (no SG changes needed). +3. **Post-Mortem**: + - Check CloudTrail for who made changes: + ```bash + aws cloudtrail lookup-events --lookup-attributes AttributeKey=ResourceName,AttributeValue=sg-123 + ``` + +--- + +### **Key Takeaways** +✅ **Avoid Live Changes**: Prefer read-only tools (Flow Logs, CloudWatch). +✅ **Stage Changes**: Test in non-prod, then deploy with canaries. +✅ **Preserve State**: Never drop NACL/SG rules without redundancy. +✅ **Automate Rollbacks**: Use Terraform/CloudFormation for quick recovery. + +**Pro Tip**: For critical systems, **run chaos engineering tests** (e.g., simulate NAT failure) during off-hours to validate resilience. + +Need a **step-by-step rollback guide** for a specific scenario? I can tailor it to your architecture. + +--- + +You’re spot on—**real-world war stories and battle-tested tricks** are what separate theoretical knowledge from production-hardened expertise. Below is a **concise technical guide** distilled from AWS networking trenches, covering **lessons learned, hidden pitfalls, and pro tips** rarely found in docs. + +--- + +# **AWS Networking War Stories: The Unwritten Guide** +*"Good judgment comes from experience. Experience comes from bad judgment."* + +--- + +## **1. Security Groups (SGs): The Silent Killers** +### **War Story: The Case of the Phantom Timeouts** +- **Symptoms**: Intermittent HTTP timeouts between microservices. +- **Root Cause**: Overlapping SG rules with different `description` fields but identical `IP permissions`. AWS silently dedupes them, causing random drops. +- **Fix**: + ```bash + # Audit duplicate rules (CLI reveals what GUI hides) + aws ec2 describe-security-groups --query 'SecurityGroups[*].IpPermissions' | jq '.[] | group_by(.FromPort, .ToPort, .IpProtocol, .IpRanges)[] | select(length > 1)' + ``` +- **Lesson**: Never trust the GUI alone—use CLI to audit SGs. + +### **Pro Tip: The "Deny All" Egress Trap** +- **Mistake**: Setting `egress = []` in Terraform (defaults to `deny all`). +- **Outcome**: Instances lose SSM, patch management, and API connectivity. +- **Fix**: Always explicitly allow: + ```hcl + egress { + from_port = 0 + to_port = 0 + protocol = "-1" + cidr_blocks = ["0.0.0.0/0"] # Or restrict to necessary IPs + } + ``` + +--- + +## **2. NACLs: The Stateless Nightmare** +### **War Story: The 5-Minute Outage** +- **Symptoms**: Database replication breaks after NACL "minor update." +- **Root Cause**: NACL rule #100 allowed `TCP/3306`, but rule #200 denied `Ephemeral Ports` (32768-60999)—breaking replies. +- **Fix**: + ```bash + # Allow ephemeral ports INBOUND for responses + aws ec2 create-network-acl-entry --network-acl-id acl-123 --rule-number 150 --protocol tcp --port-range From=32768,To=60999 --cidr-block 10.0.1.0/24 --rule-action allow --ingress + ``` +- **Lesson**: NACLs need **mirror rules** for ingress/egress. Test with `telnet` before deploying. + +### **Pro Tip: The Rule-Order Bomb** +- **Mistake**: Adding a `deny` rule at #50 *after* allowing at #100. +- **Outcome**: Traffic silently drops (first match wins). +- **Fix**: Use `describe-network-acls` to audit rule ordering: + ```bash + aws ec2 describe-network-acls --query 'NetworkAcls[*].Entries[?RuleNumber==`50`]' + ``` + +--- + +## **3. NAT Gateways: The $0.045/hr Landmine** +### **War Story: The 4 AM Bill Shock** +- **Symptoms**: $3k/month bill from "idle" NAT Gateways. +- **Root Cause**: Leftover NAT Gateways in unused AZs (auto-created by Terraform). +- **Fix**: + ```bash + # Find unattached NAT Gateways + aws ec2 describe-nat-gateways --filter "Name=state,Values=available" --query 'NatGateways[?subnetId==`null`].NatGatewayId' + ``` +- **Lesson**: Always tag NAT Gateways with `Owner` and `Expiry`. + +### **Pro Tip: The TCP Connection Black Hole** +- **Mistake**: Replacing a NAT Gateway without draining connections. +- **Outcome**: Active sessions (SSH, RDP, DB) hang until TCP timeout (30+ mins). +- **Fix**: + - **Before replacement**: Reduce TCP timeouts on clients. + - **Use Network Load Balancer (NLB)** for stateful failover. + +--- + +## **4. VPC Peering: The Cross-Account Trap** +### **War Story: The DNS That Wasn’t** +- **Symptoms**: EC2 instances can’t resolve peered VPC’s private hosted zones. +- **Root Cause**: Peering doesn’t auto-share Route53 Private Hosted Zones. +- **Fix**: + ```bash + # Associate PHZ with peer VPC + aws route53 create-vpc-association-authorization --hosted-zone-id Z123 --vpc VPCRegion=us-east-1,VPCId=vpc-456 + ``` +- **Lesson**: Test **DNS resolution** early in peering setups. + +### **Pro Tip: The Overlapping CIDR Silent Fail** +- **Mistake**: Peering `10.0.0.0/16` with another `10.0.0.0/16`. +- **Outcome**: Routes appear, but traffic fails. +- **Fix**: Always design non-overlapping CIDRs (e.g., `10.0.0.0/16` + `10.1.0.0/16`). + +--- + +## **5. Direct Connect: The BGP Rollercoaster** +### **War Story: The 1-Packet-Per-Second Mystery** +- **Symptoms**: Applications crawl over Direct Connect. +- **Root Cause**: BGP `keepalive` set to 60s (default), causing route flapping. +- **Fix**: + ```bash + # Adjust BGP timers (via AWS Console or CLI) + aws directconnect create-bgp-peer --virtual-interface-id dxvif-123 --bgp-peer 192.0.2.1,65000 --bgp-options '{"PeeringMode": "PRIVATE", "BgpAsn": 65101, "KeepaliveInterval": 10}' + ``` +- **Lesson**: Override defaults—set `keepalive = 10s`, `holddown = 30s`. + +### **Pro Tip: The MTU Mismatch** +- **Mistake**: Assuming AWS supports jumbo frames (9001 MTU). +- **Outcome**: Packet fragmentation kills throughput. +- **Fix**: Hard-set MTU to **1500** on on-prem routers: + ```bash + # Linux example + ip link set dev eth0 mtu 1500 + ``` + +--- + +## **6. The Ultimate Troubleshooting Checklist** +### **Before Making Changes:** +1. **Backup Configs**: + ```bash + aws ec2 describe-security-groups --query 'SecurityGroups[*].{GroupId:GroupId,IpPermissions:IpPermissions}' > sg-backup.json + ``` +2. **Enable Flow Logs**: + ```bash + aws ec2 create-flow-logs --resource-type VPC --resource-id vpc-123 --traffic-type ALL --log-destination-type cloud-watch-logs + ``` +3. **Test with Canary**: Deploy changes to one AZ/subnet first. + +### **When Things Break:** +1. **Rollback Fast**: Use Terraform `terraform apply -replace` or CLI. +2. **SSM Session Manager**: Access instances without SSH (bypass broken SGs). +3. **CloudTrail Forensics**: + ```bash + aws cloudtrail lookup-events --lookup-attributes AttributeKey=EventName,AttributeValue=DeleteSecurityGroup + ``` + +--- + +## **Final Wisdom** +- **Document Your "Murder Mystery" Stories**: Every outage teaches something. +- **Automate Recovery**: Use Lambda + EventBridge to auto-rollback NACL changes. +- **Pressure-Test Resiliency**: Run GameDays (e.g., randomly kill NAT Gateways). + +Want this as a **PDF cheatsheet**? I can structure it with more war stories and code snippets. Let me know! \ No newline at end of file