Add tech_docs/cloud/aws_notes.md

This commit is contained in:
2025-07-20 20:58:27 -05:00
parent 2e589e18bc
commit 8899fb9888

View File

@@ -0,0 +1,247 @@
When troubleshooting live production environments, **minimizing disruption** is critical. Heres where to exercise caution and best practices to avoid downtime or broken connections:
---
### **1. High-Risk Actions That Can Break Traffic**
#### **A. Security Group Rule Modifications**
- **Risk**: Removing/updating rules can drop active connections.
- **Example**:
- Revoking an inbound `HTTPS (443)` rule kills live sessions.
- Changing egress rules can disrupt outbound API calls.
- **Mitigation**:
- **Stage changes**: Add new rules before removing old ones.
- **Use temporary rules**: Set short-lived rules (e.g., `aws ec2 authorize-security-group-ingress --cidr 1.2.3.4/32 --port 443 --protocol tcp --group-id sg-123`).
#### **B. Network ACL (NACL) Updates**
- **Risk**: NACLs are stateless—updates drop **existing connections**.
- **Example**:
- Adding a deny rule for `10.0.1.0/24` kills active TCP sessions.
- **Mitigation**:
- **Test in non-prod first**.
- **Modify NACLs during low-traffic windows**.
#### **C. Route Table Changes**
- **Risk**: Misrouting traffic (e.g., removing a NAT Gateway route).
- **Example**:
- Deleting `0.0.0.0/0 → igw-123` makes public subnets unreachable.
- **Mitigation**:
- **Pre-validate routes**:
```bash
aws ec2 describe-route-tables --route-table-id rtb-123 --query 'RouteTables[*].Routes'
```
- **Use weighted routing** (e.g., Transit Gateway) for failover.
#### **D. NAT Gateway Replacement**
- **Risk**: Swapping NAT Gateways breaks long-lived connections (e.g., SFTP, WebSockets).
- **Mitigation**:
- **Preserve Elastic IPs** (attach to new NAT Gateway first).
- **Warm standby**: Deploy new NAT Gateway before decommissioning old one.
---
### **2. Safe Troubleshooting Techniques**
#### **A. Passive Monitoring (Zero Impact)**
- **Flow Logs**: Query logs without touching infrastructure.
```sql
# CloudWatch Logs Insights (GUI)
fields @timestamp, srcAddr, dstAddr, action
| filter dstAddr = "10.0.2.5" and action = "REJECT"
```
- **VPC Traffic Mirroring**: Copy traffic to a monitoring instance (no production impact).
#### **B. Non-Destructive Testing**
- **Packet Captures on Test Instances**:
```bash
sudo tcpdump -i eth0 -w /tmp/capture.pcap host 10.0.1.10 # No service restart needed
```
- **Canary Deployments**: Test changes on 1% of traffic (e.g., weighted ALB routes).
#### **C. Connection-Preserving Changes**
- **Security Groups**:
- Add new rules with higher priority (lower rule numbers) before deleting old ones.
- **NACLs**:
- Temporarily set `Ephemeral Ports (32768-60999)` to `ALLOW` during changes.
---
### **3. Redundancy Patterns to Reduce Risk**
| **Scenario** | **Failover Strategy** |
|----------------------------|--------------------------------------------|
| **NAT Gateway Failure** | Deploy NAT Gateway per AZ + test failover. |
| **Route Table Corruption** | Use version-controlled Terraform rollback. |
| **SG Lockout** | Pre-configure backup admin access (e.g., SSM). |
---
### **4. Worst-Case Recovery Plan**
1. **Rollback Immediately**:
- Revert NACLs/SGs to last-known-good state.
```bash
aws ec2 replace-network-acl-association --association-id aclassoc-123 --network-acl-id acl-456
```
2. **Bypass Troubleshooting**:
- Use **AWS Systems Manager (SSM)** to debug instances without SSH (no SG changes needed).
3. **Post-Mortem**:
- Check CloudTrail for who made changes:
```bash
aws cloudtrail lookup-events --lookup-attributes AttributeKey=ResourceName,AttributeValue=sg-123
```
---
### **Key Takeaways**
✅ **Avoid Live Changes**: Prefer read-only tools (Flow Logs, CloudWatch).
✅ **Stage Changes**: Test in non-prod, then deploy with canaries.
✅ **Preserve State**: Never drop NACL/SG rules without redundancy.
✅ **Automate Rollbacks**: Use Terraform/CloudFormation for quick recovery.
**Pro Tip**: For critical systems, **run chaos engineering tests** (e.g., simulate NAT failure) during off-hours to validate resilience.
Need a **step-by-step rollback guide** for a specific scenario? I can tailor it to your architecture.
---
Youre spot on—**real-world war stories and battle-tested tricks** are what separate theoretical knowledge from production-hardened expertise. Below is a **concise technical guide** distilled from AWS networking trenches, covering **lessons learned, hidden pitfalls, and pro tips** rarely found in docs.
---
# **AWS Networking War Stories: The Unwritten Guide**
*"Good judgment comes from experience. Experience comes from bad judgment."*
---
## **1. Security Groups (SGs): The Silent Killers**
### **War Story: The Case of the Phantom Timeouts**
- **Symptoms**: Intermittent HTTP timeouts between microservices.
- **Root Cause**: Overlapping SG rules with different `description` fields but identical `IP permissions`. AWS silently dedupes them, causing random drops.
- **Fix**:
```bash
# Audit duplicate rules (CLI reveals what GUI hides)
aws ec2 describe-security-groups --query 'SecurityGroups[*].IpPermissions' | jq '.[] | group_by(.FromPort, .ToPort, .IpProtocol, .IpRanges)[] | select(length > 1)'
```
- **Lesson**: Never trust the GUI alone—use CLI to audit SGs.
### **Pro Tip: The "Deny All" Egress Trap**
- **Mistake**: Setting `egress = []` in Terraform (defaults to `deny all`).
- **Outcome**: Instances lose SSM, patch management, and API connectivity.
- **Fix**: Always explicitly allow:
```hcl
egress {
from_port = 0
to_port = 0
protocol = "-1"
cidr_blocks = ["0.0.0.0/0"] # Or restrict to necessary IPs
}
```
---
## **2. NACLs: The Stateless Nightmare**
### **War Story: The 5-Minute Outage**
- **Symptoms**: Database replication breaks after NACL "minor update."
- **Root Cause**: NACL rule #100 allowed `TCP/3306`, but rule #200 denied `Ephemeral Ports` (32768-60999)—breaking replies.
- **Fix**:
```bash
# Allow ephemeral ports INBOUND for responses
aws ec2 create-network-acl-entry --network-acl-id acl-123 --rule-number 150 --protocol tcp --port-range From=32768,To=60999 --cidr-block 10.0.1.0/24 --rule-action allow --ingress
```
- **Lesson**: NACLs need **mirror rules** for ingress/egress. Test with `telnet` before deploying.
### **Pro Tip: The Rule-Order Bomb**
- **Mistake**: Adding a `deny` rule at #50 *after* allowing at #100.
- **Outcome**: Traffic silently drops (first match wins).
- **Fix**: Use `describe-network-acls` to audit rule ordering:
```bash
aws ec2 describe-network-acls --query 'NetworkAcls[*].Entries[?RuleNumber==`50`]'
```
---
## **3. NAT Gateways: The $0.045/hr Landmine**
### **War Story: The 4 AM Bill Shock**
- **Symptoms**: $3k/month bill from "idle" NAT Gateways.
- **Root Cause**: Leftover NAT Gateways in unused AZs (auto-created by Terraform).
- **Fix**:
```bash
# Find unattached NAT Gateways
aws ec2 describe-nat-gateways --filter "Name=state,Values=available" --query 'NatGateways[?subnetId==`null`].NatGatewayId'
```
- **Lesson**: Always tag NAT Gateways with `Owner` and `Expiry`.
### **Pro Tip: The TCP Connection Black Hole**
- **Mistake**: Replacing a NAT Gateway without draining connections.
- **Outcome**: Active sessions (SSH, RDP, DB) hang until TCP timeout (30+ mins).
- **Fix**:
- **Before replacement**: Reduce TCP timeouts on clients.
- **Use Network Load Balancer (NLB)** for stateful failover.
---
## **4. VPC Peering: The Cross-Account Trap**
### **War Story: The DNS That Wasnt**
- **Symptoms**: EC2 instances cant resolve peered VPCs private hosted zones.
- **Root Cause**: Peering doesnt auto-share Route53 Private Hosted Zones.
- **Fix**:
```bash
# Associate PHZ with peer VPC
aws route53 create-vpc-association-authorization --hosted-zone-id Z123 --vpc VPCRegion=us-east-1,VPCId=vpc-456
```
- **Lesson**: Test **DNS resolution** early in peering setups.
### **Pro Tip: The Overlapping CIDR Silent Fail**
- **Mistake**: Peering `10.0.0.0/16` with another `10.0.0.0/16`.
- **Outcome**: Routes appear, but traffic fails.
- **Fix**: Always design non-overlapping CIDRs (e.g., `10.0.0.0/16` + `10.1.0.0/16`).
---
## **5. Direct Connect: The BGP Rollercoaster**
### **War Story: The 1-Packet-Per-Second Mystery**
- **Symptoms**: Applications crawl over Direct Connect.
- **Root Cause**: BGP `keepalive` set to 60s (default), causing route flapping.
- **Fix**:
```bash
# Adjust BGP timers (via AWS Console or CLI)
aws directconnect create-bgp-peer --virtual-interface-id dxvif-123 --bgp-peer 192.0.2.1,65000 --bgp-options '{"PeeringMode": "PRIVATE", "BgpAsn": 65101, "KeepaliveInterval": 10}'
```
- **Lesson**: Override defaults—set `keepalive = 10s`, `holddown = 30s`.
### **Pro Tip: The MTU Mismatch**
- **Mistake**: Assuming AWS supports jumbo frames (9001 MTU).
- **Outcome**: Packet fragmentation kills throughput.
- **Fix**: Hard-set MTU to **1500** on on-prem routers:
```bash
# Linux example
ip link set dev eth0 mtu 1500
```
---
## **6. The Ultimate Troubleshooting Checklist**
### **Before Making Changes:**
1. **Backup Configs**:
```bash
aws ec2 describe-security-groups --query 'SecurityGroups[*].{GroupId:GroupId,IpPermissions:IpPermissions}' > sg-backup.json
```
2. **Enable Flow Logs**:
```bash
aws ec2 create-flow-logs --resource-type VPC --resource-id vpc-123 --traffic-type ALL --log-destination-type cloud-watch-logs
```
3. **Test with Canary**: Deploy changes to one AZ/subnet first.
### **When Things Break:**
1. **Rollback Fast**: Use Terraform `terraform apply -replace` or CLI.
2. **SSM Session Manager**: Access instances without SSH (bypass broken SGs).
3. **CloudTrail Forensics**:
```bash
aws cloudtrail lookup-events --lookup-attributes AttributeKey=EventName,AttributeValue=DeleteSecurityGroup
```
---
## **Final Wisdom**
- **Document Your "Murder Mystery" Stories**: Every outage teaches something.
- **Automate Recovery**: Use Lambda + EventBridge to auto-rollback NACL changes.
- **Pressure-Test Resiliency**: Run GameDays (e.g., randomly kill NAT Gateways).
Want this as a **PDF cheatsheet**? I can structure it with more war stories and code snippets. Let me know!