Update tech_docs/cloud/aws_notes.md

This commit is contained in:
2025-07-20 21:02:24 -05:00
parent 45bc59ac44
commit 133b478404

View File

@@ -1,3 +1,194 @@
Here's a polished, cohesive version of your notes with improved flow, filled-in gaps, and tighter organization while preserving all critical details:
---
# **AWS Networking: The Production Survival Guide**
*Battle-tested strategies for troubleshooting and maintaining resilient networks*
---
## **I. Flow Log Mastery: The GUI-CLI Hybrid Approach**
### **1. Enabling Flow Logs (GUI Method)**
**Steps:**
1. Navigate to **VPC Dashboard** → Select target VPC → **Actions****Create Flow Log**
2. Configure:
- **Filter**: `ALL` (full visibility), `REJECT` (security focus), or `ACCEPT` (performance)
- **Destination**:
- CloudWatch Logs for real-time analysis
- S3 for compliance/archiving
- **Advanced**: Add custom fields like `${tcp-flags}` for packet analysis
**Pro Tip:**
Enable flow logs in all environments - they're cheap insurance and only log future traffic.
### **2. CloudWatch Logs Insights Deep Dive**
**Key Queries:**
```sql
/* Basic Traffic Analysis */
fields @timestamp, srcAddr, dstAddr, action, bytes
| filter dstPort = 443
| stats sum(bytes) as totalTraffic by srcAddr
| sort totalTraffic desc
/* Security Investigation */
fields @timestamp, srcAddr, dstAddr, dstPort
| filter action = "REJECT" and dstPort = 22
| limit 50
/* NAT Gateway Health Check */
fields @timestamp, srcAddr, dstAddr
| filter srcAddr like "10.0.1." and isIpv4InSubnet(dstAddr, "8.8.8.0/24")
| stats count() by bin(5m)
```
**Visualization Tricks:**
1. Use **time series** graphs to spot traffic patterns
2. Create **bar charts** of top talkers
3. Save frequent queries as dashboard widgets
---
## **II. High-Risk Operations Playbook**
### **Danger Zone: Actions That Break Connections**
| Operation | Risk | Safe Approach |
|-----------|------|---------------|
| SG Modifications | Drops active connections | Add new rules first, then remove old |
| NACL Updates | Stateless - kills existing flows | Test in staging first |
| Route Changes | Misroutes critical traffic | Use weighted routing for failover |
| NAT Replacement | Breaks long-lived sessions | Warm standby + EIP preservation |
**Real-World Example:**
A financial firm caused a 37-minute outage by modifying NACLs during trading hours. The fix? Now they:
1. Test all changes in a replica environment
2. Implement change windows
3. Use Terraform plan/apply for dry runs
### **Safe Troubleshooting Techniques**
1. **Passive Monitoring**
- Flow logs (meta-analysis)
- Traffic mirroring (packet-level)
- CloudWatch Metrics (trend spotting)
2. **Non-Destructive Testing**
```bash
# Packet capture without service impact
sudo tcpdump -i eth0 -w debug.pcap host 10.0.1.5 and port 3306 -C 100 -W 5
```
3. **Change Management**
- Canary deployments (1% traffic first)
- Automated rollback hooks
- SSM Session Manager for emergency access
---
## **III. War Stories: Lessons From the Trenches**
### **1. The Case of the Vanishing Packets**
**Symptoms:** Intermittent database timeouts
**Root Cause:** Overlapping security group rules being silently deduped
**Fix:**
```bash
# Find duplicate SG rules
aws ec2 describe-security-groups \
--query 'SecurityGroups[*].IpPermissions' \
| jq '.[] | group_by(.FromPort, .ToPort, .IpRanges)[] | select(length > 1)'
```
### **2. The $15,000 NAT Surprise**
**Symptoms:** Unexpected bill spike
**Discovery:**
```bash
# Find idle NAT Gateways
aws ec2 describe-nat-gateways \
--filter "Name=state,Values=available" \
--query 'NatGateways[?subnetId==`null`]'
```
**Prevention:** Tag all resources with Owner and Purpose
### **3. The Peering Paradox**
**Issue:** Cross-account VPC peering with broken DNS
**Solution:
```bash
# Share private hosted zones
aws route53 create-vpc-association-authorization \
--hosted-zone-id Z123 \
--vpc VPCRegion=us-east-1,VPCId=vpc-456
```
---
## **IV. The Resiliency Toolkit**
### **Must-Have Automation**
1. **Auto-Rollback Systems**
```python
# Lambda function monitoring CloudTrail for dangerous changes
def lambda_handler(event, context):
if event['detail']['eventName'] == 'DeleteNetworkAcl':
revert_nacl(event['detail']['requestParameters']['networkAclId'])
```
2. **Chaos Engineering Tests**
- Scheduled NAT failure drills
- AZ isolation simulations
- Route table corruption tests
### **The 5-Minute Recovery Checklist**
1. **Diagnose**
```bash
aws ec2 describe-network-interfaces --filters "Name=status,Values=available"
```
2. **Contain**
- Freeze CI/CD pipelines
- Disable problematic security groups
3. **Restore**
- Terraform rollback
- Route table replacement
---
## **V. Pro Tips Archive**
### **Security Group Wisdom**
```hcl
# Terraform best practice
resource "aws_security_group" "example" {
egress {
# Never leave empty - defaults to deny all!
from_port = 0
to_port = 0
protocol = "-1"
cidr_blocks = ["0.0.0.0/0"] # Restrict in prod
}
}
```
### **NACL Gotchas**
- Ephemeral ports (32768-60999) must be allowed bidirectionally
- Rule evaluation order matters (lowest number first)
- Default NACL allows all traffic (custom NACLs deny)
### **Direct Connect Pro Tips**
- Set BGP timers to 10s keepalive/30s holddown
- Hardcode MTU to 1500
- Monitor with:
```bash
aws directconnect describe-virtual-interfaces --query 'virtualInterfaces[*].[virtualInterfaceId,bgpPeers[0].bgpStatus]'
```
---
## **Final Checklist for Production Safety**
1. [ ] Enable flow logs in all VPCs
2. [ ] Document rollback procedures
3. [ ] Test failure scenarios regularly
4. [ ] Implement change controls
5. [ ] Tag all network resources
**Remember:** The best troubleshooting is avoiding problems through design. Use this guide as your playbook for building and maintaining bulletproof AWS networks.
Would you like me to develop any specific section further with more technical depth or real-world examples?
---
You're absolutely right—**using the AWS Console (GUI) is often the fastest and most intuitive way to analyze Flow Logs**, especially for SMEs who need quick answers. Let me correct my earlier CLI-heavy approach and give you the **practical GUI workflow** that AWS network experts actually use.
---