Add tech_docs/cloud/aws_notes.md

2025-07-20 20:58:27 -05:00
parent 2e589e18bc
commit 8899fb9888
1 changed files with 247 additions and 0 deletions
--- a/tech_docs/cloud/aws_notes.md
+++ b/tech_docs/cloud/aws_notes.md
@@ -0,0 +1,247 @@
+When troubleshooting live production environments, **minimizing disruption** is critical. Here’s where to exercise caution and best practices to avoid downtime or broken connections:
+
+---
+
+### **1. High-Risk Actions That Can Break Traffic**  
+#### **A. Security Group Rule Modifications**  
+- **Risk**: Removing/updating rules can drop active connections.  
+- **Example**:  
+  - Revoking an inbound `HTTPS (443)` rule kills live sessions.  
+  - Changing egress rules can disrupt outbound API calls.  
+- **Mitigation**:  
+  - **Stage changes**: Add new rules before removing old ones.  
+  - **Use temporary rules**: Set short-lived rules (e.g., `aws ec2 authorize-security-group-ingress --cidr 1.2.3.4/32 --port 443 --protocol tcp --group-id sg-123`).  
+
+#### **B. Network ACL (NACL) Updates**  
+- **Risk**: NACLs are stateless—updates drop **existing connections**.  
+- **Example**:  
+  - Adding a deny rule for `10.0.1.0/24` kills active TCP sessions.  
+- **Mitigation**:  
+  - **Test in non-prod first**.  
+  - **Modify NACLs during low-traffic windows**.  
+
+#### **C. Route Table Changes**  
+- **Risk**: Misrouting traffic (e.g., removing a NAT Gateway route).  
+- **Example**:  
+  - Deleting `0.0.0.0/0 → igw-123` makes public subnets unreachable.  
+- **Mitigation**:  
+  - **Pre-validate routes**:  
+    ```bash
+    aws ec2 describe-route-tables --route-table-id rtb-123 --query 'RouteTables[*].Routes'
+    ```  
+  - **Use weighted routing** (e.g., Transit Gateway) for failover.  
+
+#### **D. NAT Gateway Replacement**  
+- **Risk**: Swapping NAT Gateways breaks long-lived connections (e.g., SFTP, WebSockets).  
+- **Mitigation**:  
+  - **Preserve Elastic IPs** (attach to new NAT Gateway first).  
+  - **Warm standby**: Deploy new NAT Gateway before decommissioning old one.  
+
+---
+
+### **2. Safe Troubleshooting Techniques**  
+#### **A. Passive Monitoring (Zero Impact)**  
+- **Flow Logs**: Query logs without touching infrastructure.  
+  ```sql
+  # CloudWatch Logs Insights (GUI)  
+  fields @timestamp, srcAddr, dstAddr, action  
+  | filter dstAddr = "10.0.2.5" and action = "REJECT"  
+  ```  
+- **VPC Traffic Mirroring**: Copy traffic to a monitoring instance (no production impact).  
+
+#### **B. Non-Destructive Testing**  
+- **Packet Captures on Test Instances**:  
+  ```bash
+  sudo tcpdump -i eth0 -w /tmp/capture.pcap host 10.0.1.10  # No service restart needed  
+  ```  
+- **Canary Deployments**: Test changes on 1% of traffic (e.g., weighted ALB routes).  
+
+#### **C. Connection-Preserving Changes**  
+- **Security Groups**:  
+  - Add new rules with higher priority (lower rule numbers) before deleting old ones.  
+- **NACLs**:  
+  - Temporarily set `Ephemeral Ports (32768-60999)` to `ALLOW` during changes.  
+
+---
+
+### **3. Redundancy Patterns to Reduce Risk**  
+| **Scenario**               | **Failover Strategy**                      |  
+|----------------------------|--------------------------------------------|  
+| **NAT Gateway Failure**    | Deploy NAT Gateway per AZ + test failover. |  
+| **Route Table Corruption** | Use version-controlled Terraform rollback. |  
+| **SG Lockout**             | Pre-configure backup admin access (e.g., SSM). |  
+
+---
+
+### **4. Worst-Case Recovery Plan**  
+1. **Rollback Immediately**:  
+   - Revert NACLs/SGs to last-known-good state.  
+   ```bash
+   aws ec2 replace-network-acl-association --association-id aclassoc-123 --network-acl-id acl-456  
+   ```  
+2. **Bypass Troubleshooting**:  
+   - Use **AWS Systems Manager (SSM)** to debug instances without SSH (no SG changes needed).  
+3. **Post-Mortem**:  
+   - Check CloudTrail for who made changes:  
+     ```bash
+     aws cloudtrail lookup-events --lookup-attributes AttributeKey=ResourceName,AttributeValue=sg-123  
+     ```  
+
+---
+
+### **Key Takeaways**  
+✅ **Avoid Live Changes**: Prefer read-only tools (Flow Logs, CloudWatch).  
+✅ **Stage Changes**: Test in non-prod, then deploy with canaries.  
+✅ **Preserve State**: Never drop NACL/SG rules without redundancy.  
+✅ **Automate Rollbacks**: Use Terraform/CloudFormation for quick recovery.  
+
+**Pro Tip**: For critical systems, **run chaos engineering tests** (e.g., simulate NAT failure) during off-hours to validate resilience.  
+
+Need a **step-by-step rollback guide** for a specific scenario? I can tailor it to your architecture.
+
+---
+
+You’re spot on—**real-world war stories and battle-tested tricks** are what separate theoretical knowledge from production-hardened expertise. Below is a **concise technical guide** distilled from AWS networking trenches, covering **lessons learned, hidden pitfalls, and pro tips** rarely found in docs.  
+
+---
+
+# **AWS Networking War Stories: The Unwritten Guide**  
+*"Good judgment comes from experience. Experience comes from bad judgment."*  
+
+---
+
+## **1. Security Groups (SGs): The Silent Killers**  
+### **War Story: The Case of the Phantom Timeouts**  
+- **Symptoms**: Intermittent HTTP timeouts between microservices.  
+- **Root Cause**: Overlapping SG rules with different `description` fields but identical `IP permissions`. AWS silently dedupes them, causing random drops.  
+- **Fix**:  
+  ```bash
+  # Audit duplicate rules (CLI reveals what GUI hides)
+  aws ec2 describe-security-groups --query 'SecurityGroups[*].IpPermissions' | jq '.[] | group_by(.FromPort, .ToPort, .IpProtocol, .IpRanges)[] | select(length > 1)'
+  ```
+- **Lesson**: Never trust the GUI alone—use CLI to audit SGs.  
+
+### **Pro Tip: The "Deny All" Egress Trap**  
+- **Mistake**: Setting `egress = []` in Terraform (defaults to `deny all`).  
+- **Outcome**: Instances lose SSM, patch management, and API connectivity.  
+- **Fix**: Always explicitly allow:  
+  ```hcl
+  egress {
+    from_port   = 0
+    to_port     = 0
+    protocol    = "-1"
+    cidr_blocks = ["0.0.0.0/0"]  # Or restrict to necessary IPs
+  }
+  ```
+
+---
+
+## **2. NACLs: The Stateless Nightmare**  
+### **War Story: The 5-Minute Outage**  
+- **Symptoms**: Database replication breaks after NACL "minor update."  
+- **Root Cause**: NACL rule #100 allowed `TCP/3306`, but rule #200 denied `Ephemeral Ports` (32768-60999)—breaking replies.  
+- **Fix**:  
+  ```bash
+  # Allow ephemeral ports INBOUND for responses
+  aws ec2 create-network-acl-entry --network-acl-id acl-123 --rule-number 150 --protocol tcp --port-range From=32768,To=60999 --cidr-block 10.0.1.0/24 --rule-action allow --ingress
+  ```
+- **Lesson**: NACLs need **mirror rules** for ingress/egress. Test with `telnet` before deploying.  
+
+### **Pro Tip: The Rule-Order Bomb**  
+- **Mistake**: Adding a `deny` rule at #50 *after* allowing at #100.  
+- **Outcome**: Traffic silently drops (first match wins).  
+- **Fix**: Use `describe-network-acls` to audit rule ordering:  
+  ```bash
+  aws ec2 describe-network-acls --query 'NetworkAcls[*].Entries[?RuleNumber==`50`]'
+  ```
+
+---
+
+## **3. NAT Gateways: The $0.045/hr Landmine**  
+### **War Story: The 4 AM Bill Shock**  
+- **Symptoms**: $3k/month bill from "idle" NAT Gateways.  
+- **Root Cause**: Leftover NAT Gateways in unused AZs (auto-created by Terraform).  
+- **Fix**:  
+  ```bash
+  # Find unattached NAT Gateways
+  aws ec2 describe-nat-gateways --filter "Name=state,Values=available" --query 'NatGateways[?subnetId==`null`].NatGatewayId'
+  ```
+- **Lesson**: Always tag NAT Gateways with `Owner` and `Expiry`.  
+
+### **Pro Tip: The TCP Connection Black Hole**  
+- **Mistake**: Replacing a NAT Gateway without draining connections.  
+- **Outcome**: Active sessions (SSH, RDP, DB) hang until TCP timeout (30+ mins).  
+- **Fix**:  
+  - **Before replacement**: Reduce TCP timeouts on clients.  
+  - **Use Network Load Balancer (NLB)** for stateful failover.  
+
+---
+
+## **4. VPC Peering: The Cross-Account Trap**  
+### **War Story: The DNS That Wasn’t**  
+- **Symptoms**: EC2 instances can’t resolve peered VPC’s private hosted zones.  
+- **Root Cause**: Peering doesn’t auto-share Route53 Private Hosted Zones.  
+- **Fix**:  
+  ```bash
+  # Associate PHZ with peer VPC
+  aws route53 create-vpc-association-authorization --hosted-zone-id Z123 --vpc VPCRegion=us-east-1,VPCId=vpc-456
+  ```
+- **Lesson**: Test **DNS resolution** early in peering setups.  
+
+### **Pro Tip: The Overlapping CIDR Silent Fail**  
+- **Mistake**: Peering `10.0.0.0/16` with another `10.0.0.0/16`.  
+- **Outcome**: Routes appear, but traffic fails.  
+- **Fix**: Always design non-overlapping CIDRs (e.g., `10.0.0.0/16` + `10.1.0.0/16`).  
+
+---
+
+## **5. Direct Connect: The BGP Rollercoaster**  
+### **War Story: The 1-Packet-Per-Second Mystery**  
+- **Symptoms**: Applications crawl over Direct Connect.  
+- **Root Cause**: BGP `keepalive` set to 60s (default), causing route flapping.  
+- **Fix**:  
+  ```bash
+  # Adjust BGP timers (via AWS Console or CLI)
+  aws directconnect create-bgp-peer --virtual-interface-id dxvif-123 --bgp-peer 192.0.2.1,65000 --bgp-options '{"PeeringMode": "PRIVATE", "BgpAsn": 65101, "KeepaliveInterval": 10}'
+  ```
+- **Lesson**: Override defaults—set `keepalive = 10s`, `holddown = 30s`.  
+
+### **Pro Tip: The MTU Mismatch**  
+- **Mistake**: Assuming AWS supports jumbo frames (9001 MTU).  
+- **Outcome**: Packet fragmentation kills throughput.  
+- **Fix**: Hard-set MTU to **1500** on on-prem routers:  
+  ```bash
+  # Linux example
+  ip link set dev eth0 mtu 1500
+  ```
+
+---
+
+## **6. The Ultimate Troubleshooting Checklist**  
+### **Before Making Changes:**  
+1. **Backup Configs**:  
+   ```bash
+   aws ec2 describe-security-groups --query 'SecurityGroups[*].{GroupId:GroupId,IpPermissions:IpPermissions}' > sg-backup.json
+   ```
+2. **Enable Flow Logs**:  
+   ```bash
+   aws ec2 create-flow-logs --resource-type VPC --resource-id vpc-123 --traffic-type ALL --log-destination-type cloud-watch-logs
+   ```
+3. **Test with Canary**: Deploy changes to one AZ/subnet first.  
+
+### **When Things Break:**  
+1. **Rollback Fast**: Use Terraform `terraform apply -replace` or CLI.  
+2. **SSM Session Manager**: Access instances without SSH (bypass broken SGs).  
+3. **CloudTrail Forensics**:  
+   ```bash
+   aws cloudtrail lookup-events --lookup-attributes AttributeKey=EventName,AttributeValue=DeleteSecurityGroup
+   ```
+
+---
+
+## **Final Wisdom**  
+- **Document Your "Murder Mystery" Stories**: Every outage teaches something.  
+- **Automate Recovery**: Use Lambda + EventBridge to auto-rollback NACL changes.  
+- **Pressure-Test Resiliency**: Run GameDays (e.g., randomly kill NAT Gateways).  
+
+Want this as a **PDF cheatsheet**? I can structure it with more war stories and code snippets. Let me know!