Add tech_docs/cloud/aws_notes.md
This commit is contained in:
247
tech_docs/cloud/aws_notes.md
Normal file
247
tech_docs/cloud/aws_notes.md
Normal file
@@ -0,0 +1,247 @@
|
||||
When troubleshooting live production environments, **minimizing disruption** is critical. Here’s where to exercise caution and best practices to avoid downtime or broken connections:
|
||||
|
||||
---
|
||||
|
||||
### **1. High-Risk Actions That Can Break Traffic**
|
||||
#### **A. Security Group Rule Modifications**
|
||||
- **Risk**: Removing/updating rules can drop active connections.
|
||||
- **Example**:
|
||||
- Revoking an inbound `HTTPS (443)` rule kills live sessions.
|
||||
- Changing egress rules can disrupt outbound API calls.
|
||||
- **Mitigation**:
|
||||
- **Stage changes**: Add new rules before removing old ones.
|
||||
- **Use temporary rules**: Set short-lived rules (e.g., `aws ec2 authorize-security-group-ingress --cidr 1.2.3.4/32 --port 443 --protocol tcp --group-id sg-123`).
|
||||
|
||||
#### **B. Network ACL (NACL) Updates**
|
||||
- **Risk**: NACLs are stateless—updates drop **existing connections**.
|
||||
- **Example**:
|
||||
- Adding a deny rule for `10.0.1.0/24` kills active TCP sessions.
|
||||
- **Mitigation**:
|
||||
- **Test in non-prod first**.
|
||||
- **Modify NACLs during low-traffic windows**.
|
||||
|
||||
#### **C. Route Table Changes**
|
||||
- **Risk**: Misrouting traffic (e.g., removing a NAT Gateway route).
|
||||
- **Example**:
|
||||
- Deleting `0.0.0.0/0 → igw-123` makes public subnets unreachable.
|
||||
- **Mitigation**:
|
||||
- **Pre-validate routes**:
|
||||
```bash
|
||||
aws ec2 describe-route-tables --route-table-id rtb-123 --query 'RouteTables[*].Routes'
|
||||
```
|
||||
- **Use weighted routing** (e.g., Transit Gateway) for failover.
|
||||
|
||||
#### **D. NAT Gateway Replacement**
|
||||
- **Risk**: Swapping NAT Gateways breaks long-lived connections (e.g., SFTP, WebSockets).
|
||||
- **Mitigation**:
|
||||
- **Preserve Elastic IPs** (attach to new NAT Gateway first).
|
||||
- **Warm standby**: Deploy new NAT Gateway before decommissioning old one.
|
||||
|
||||
---
|
||||
|
||||
### **2. Safe Troubleshooting Techniques**
|
||||
#### **A. Passive Monitoring (Zero Impact)**
|
||||
- **Flow Logs**: Query logs without touching infrastructure.
|
||||
```sql
|
||||
# CloudWatch Logs Insights (GUI)
|
||||
fields @timestamp, srcAddr, dstAddr, action
|
||||
| filter dstAddr = "10.0.2.5" and action = "REJECT"
|
||||
```
|
||||
- **VPC Traffic Mirroring**: Copy traffic to a monitoring instance (no production impact).
|
||||
|
||||
#### **B. Non-Destructive Testing**
|
||||
- **Packet Captures on Test Instances**:
|
||||
```bash
|
||||
sudo tcpdump -i eth0 -w /tmp/capture.pcap host 10.0.1.10 # No service restart needed
|
||||
```
|
||||
- **Canary Deployments**: Test changes on 1% of traffic (e.g., weighted ALB routes).
|
||||
|
||||
#### **C. Connection-Preserving Changes**
|
||||
- **Security Groups**:
|
||||
- Add new rules with higher priority (lower rule numbers) before deleting old ones.
|
||||
- **NACLs**:
|
||||
- Temporarily set `Ephemeral Ports (32768-60999)` to `ALLOW` during changes.
|
||||
|
||||
---
|
||||
|
||||
### **3. Redundancy Patterns to Reduce Risk**
|
||||
| **Scenario** | **Failover Strategy** |
|
||||
|----------------------------|--------------------------------------------|
|
||||
| **NAT Gateway Failure** | Deploy NAT Gateway per AZ + test failover. |
|
||||
| **Route Table Corruption** | Use version-controlled Terraform rollback. |
|
||||
| **SG Lockout** | Pre-configure backup admin access (e.g., SSM). |
|
||||
|
||||
---
|
||||
|
||||
### **4. Worst-Case Recovery Plan**
|
||||
1. **Rollback Immediately**:
|
||||
- Revert NACLs/SGs to last-known-good state.
|
||||
```bash
|
||||
aws ec2 replace-network-acl-association --association-id aclassoc-123 --network-acl-id acl-456
|
||||
```
|
||||
2. **Bypass Troubleshooting**:
|
||||
- Use **AWS Systems Manager (SSM)** to debug instances without SSH (no SG changes needed).
|
||||
3. **Post-Mortem**:
|
||||
- Check CloudTrail for who made changes:
|
||||
```bash
|
||||
aws cloudtrail lookup-events --lookup-attributes AttributeKey=ResourceName,AttributeValue=sg-123
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### **Key Takeaways**
|
||||
✅ **Avoid Live Changes**: Prefer read-only tools (Flow Logs, CloudWatch).
|
||||
✅ **Stage Changes**: Test in non-prod, then deploy with canaries.
|
||||
✅ **Preserve State**: Never drop NACL/SG rules without redundancy.
|
||||
✅ **Automate Rollbacks**: Use Terraform/CloudFormation for quick recovery.
|
||||
|
||||
**Pro Tip**: For critical systems, **run chaos engineering tests** (e.g., simulate NAT failure) during off-hours to validate resilience.
|
||||
|
||||
Need a **step-by-step rollback guide** for a specific scenario? I can tailor it to your architecture.
|
||||
|
||||
---
|
||||
|
||||
You’re spot on—**real-world war stories and battle-tested tricks** are what separate theoretical knowledge from production-hardened expertise. Below is a **concise technical guide** distilled from AWS networking trenches, covering **lessons learned, hidden pitfalls, and pro tips** rarely found in docs.
|
||||
|
||||
---
|
||||
|
||||
# **AWS Networking War Stories: The Unwritten Guide**
|
||||
*"Good judgment comes from experience. Experience comes from bad judgment."*
|
||||
|
||||
---
|
||||
|
||||
## **1. Security Groups (SGs): The Silent Killers**
|
||||
### **War Story: The Case of the Phantom Timeouts**
|
||||
- **Symptoms**: Intermittent HTTP timeouts between microservices.
|
||||
- **Root Cause**: Overlapping SG rules with different `description` fields but identical `IP permissions`. AWS silently dedupes them, causing random drops.
|
||||
- **Fix**:
|
||||
```bash
|
||||
# Audit duplicate rules (CLI reveals what GUI hides)
|
||||
aws ec2 describe-security-groups --query 'SecurityGroups[*].IpPermissions' | jq '.[] | group_by(.FromPort, .ToPort, .IpProtocol, .IpRanges)[] | select(length > 1)'
|
||||
```
|
||||
- **Lesson**: Never trust the GUI alone—use CLI to audit SGs.
|
||||
|
||||
### **Pro Tip: The "Deny All" Egress Trap**
|
||||
- **Mistake**: Setting `egress = []` in Terraform (defaults to `deny all`).
|
||||
- **Outcome**: Instances lose SSM, patch management, and API connectivity.
|
||||
- **Fix**: Always explicitly allow:
|
||||
```hcl
|
||||
egress {
|
||||
from_port = 0
|
||||
to_port = 0
|
||||
protocol = "-1"
|
||||
cidr_blocks = ["0.0.0.0/0"] # Or restrict to necessary IPs
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## **2. NACLs: The Stateless Nightmare**
|
||||
### **War Story: The 5-Minute Outage**
|
||||
- **Symptoms**: Database replication breaks after NACL "minor update."
|
||||
- **Root Cause**: NACL rule #100 allowed `TCP/3306`, but rule #200 denied `Ephemeral Ports` (32768-60999)—breaking replies.
|
||||
- **Fix**:
|
||||
```bash
|
||||
# Allow ephemeral ports INBOUND for responses
|
||||
aws ec2 create-network-acl-entry --network-acl-id acl-123 --rule-number 150 --protocol tcp --port-range From=32768,To=60999 --cidr-block 10.0.1.0/24 --rule-action allow --ingress
|
||||
```
|
||||
- **Lesson**: NACLs need **mirror rules** for ingress/egress. Test with `telnet` before deploying.
|
||||
|
||||
### **Pro Tip: The Rule-Order Bomb**
|
||||
- **Mistake**: Adding a `deny` rule at #50 *after* allowing at #100.
|
||||
- **Outcome**: Traffic silently drops (first match wins).
|
||||
- **Fix**: Use `describe-network-acls` to audit rule ordering:
|
||||
```bash
|
||||
aws ec2 describe-network-acls --query 'NetworkAcls[*].Entries[?RuleNumber==`50`]'
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## **3. NAT Gateways: The $0.045/hr Landmine**
|
||||
### **War Story: The 4 AM Bill Shock**
|
||||
- **Symptoms**: $3k/month bill from "idle" NAT Gateways.
|
||||
- **Root Cause**: Leftover NAT Gateways in unused AZs (auto-created by Terraform).
|
||||
- **Fix**:
|
||||
```bash
|
||||
# Find unattached NAT Gateways
|
||||
aws ec2 describe-nat-gateways --filter "Name=state,Values=available" --query 'NatGateways[?subnetId==`null`].NatGatewayId'
|
||||
```
|
||||
- **Lesson**: Always tag NAT Gateways with `Owner` and `Expiry`.
|
||||
|
||||
### **Pro Tip: The TCP Connection Black Hole**
|
||||
- **Mistake**: Replacing a NAT Gateway without draining connections.
|
||||
- **Outcome**: Active sessions (SSH, RDP, DB) hang until TCP timeout (30+ mins).
|
||||
- **Fix**:
|
||||
- **Before replacement**: Reduce TCP timeouts on clients.
|
||||
- **Use Network Load Balancer (NLB)** for stateful failover.
|
||||
|
||||
---
|
||||
|
||||
## **4. VPC Peering: The Cross-Account Trap**
|
||||
### **War Story: The DNS That Wasn’t**
|
||||
- **Symptoms**: EC2 instances can’t resolve peered VPC’s private hosted zones.
|
||||
- **Root Cause**: Peering doesn’t auto-share Route53 Private Hosted Zones.
|
||||
- **Fix**:
|
||||
```bash
|
||||
# Associate PHZ with peer VPC
|
||||
aws route53 create-vpc-association-authorization --hosted-zone-id Z123 --vpc VPCRegion=us-east-1,VPCId=vpc-456
|
||||
```
|
||||
- **Lesson**: Test **DNS resolution** early in peering setups.
|
||||
|
||||
### **Pro Tip: The Overlapping CIDR Silent Fail**
|
||||
- **Mistake**: Peering `10.0.0.0/16` with another `10.0.0.0/16`.
|
||||
- **Outcome**: Routes appear, but traffic fails.
|
||||
- **Fix**: Always design non-overlapping CIDRs (e.g., `10.0.0.0/16` + `10.1.0.0/16`).
|
||||
|
||||
---
|
||||
|
||||
## **5. Direct Connect: The BGP Rollercoaster**
|
||||
### **War Story: The 1-Packet-Per-Second Mystery**
|
||||
- **Symptoms**: Applications crawl over Direct Connect.
|
||||
- **Root Cause**: BGP `keepalive` set to 60s (default), causing route flapping.
|
||||
- **Fix**:
|
||||
```bash
|
||||
# Adjust BGP timers (via AWS Console or CLI)
|
||||
aws directconnect create-bgp-peer --virtual-interface-id dxvif-123 --bgp-peer 192.0.2.1,65000 --bgp-options '{"PeeringMode": "PRIVATE", "BgpAsn": 65101, "KeepaliveInterval": 10}'
|
||||
```
|
||||
- **Lesson**: Override defaults—set `keepalive = 10s`, `holddown = 30s`.
|
||||
|
||||
### **Pro Tip: The MTU Mismatch**
|
||||
- **Mistake**: Assuming AWS supports jumbo frames (9001 MTU).
|
||||
- **Outcome**: Packet fragmentation kills throughput.
|
||||
- **Fix**: Hard-set MTU to **1500** on on-prem routers:
|
||||
```bash
|
||||
# Linux example
|
||||
ip link set dev eth0 mtu 1500
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## **6. The Ultimate Troubleshooting Checklist**
|
||||
### **Before Making Changes:**
|
||||
1. **Backup Configs**:
|
||||
```bash
|
||||
aws ec2 describe-security-groups --query 'SecurityGroups[*].{GroupId:GroupId,IpPermissions:IpPermissions}' > sg-backup.json
|
||||
```
|
||||
2. **Enable Flow Logs**:
|
||||
```bash
|
||||
aws ec2 create-flow-logs --resource-type VPC --resource-id vpc-123 --traffic-type ALL --log-destination-type cloud-watch-logs
|
||||
```
|
||||
3. **Test with Canary**: Deploy changes to one AZ/subnet first.
|
||||
|
||||
### **When Things Break:**
|
||||
1. **Rollback Fast**: Use Terraform `terraform apply -replace` or CLI.
|
||||
2. **SSM Session Manager**: Access instances without SSH (bypass broken SGs).
|
||||
3. **CloudTrail Forensics**:
|
||||
```bash
|
||||
aws cloudtrail lookup-events --lookup-attributes AttributeKey=EventName,AttributeValue=DeleteSecurityGroup
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## **Final Wisdom**
|
||||
- **Document Your "Murder Mystery" Stories**: Every outage teaches something.
|
||||
- **Automate Recovery**: Use Lambda + EventBridge to auto-rollback NACL changes.
|
||||
- **Pressure-Test Resiliency**: Run GameDays (e.g., randomly kill NAT Gateways).
|
||||
|
||||
Want this as a **PDF cheatsheet**? I can structure it with more war stories and code snippets. Let me know!
|
||||
Reference in New Issue
Block a user