Update tech_docs/cloud/aws_notes.md
This commit is contained in:
@@ -1,3 +1,194 @@
|
||||
Here's a polished, cohesive version of your notes with improved flow, filled-in gaps, and tighter organization while preserving all critical details:
|
||||
|
||||
---
|
||||
|
||||
# **AWS Networking: The Production Survival Guide**
|
||||
*Battle-tested strategies for troubleshooting and maintaining resilient networks*
|
||||
|
||||
---
|
||||
|
||||
## **I. Flow Log Mastery: The GUI-CLI Hybrid Approach**
|
||||
### **1. Enabling Flow Logs (GUI Method)**
|
||||
**Steps:**
|
||||
1. Navigate to **VPC Dashboard** → Select target VPC → **Actions** → **Create Flow Log**
|
||||
2. Configure:
|
||||
- **Filter**: `ALL` (full visibility), `REJECT` (security focus), or `ACCEPT` (performance)
|
||||
- **Destination**:
|
||||
- CloudWatch Logs for real-time analysis
|
||||
- S3 for compliance/archiving
|
||||
- **Advanced**: Add custom fields like `${tcp-flags}` for packet analysis
|
||||
|
||||
**Pro Tip:**
|
||||
Enable flow logs in all environments - they're cheap insurance and only log future traffic.
|
||||
|
||||
### **2. CloudWatch Logs Insights Deep Dive**
|
||||
**Key Queries:**
|
||||
```sql
|
||||
/* Basic Traffic Analysis */
|
||||
fields @timestamp, srcAddr, dstAddr, action, bytes
|
||||
| filter dstPort = 443
|
||||
| stats sum(bytes) as totalTraffic by srcAddr
|
||||
| sort totalTraffic desc
|
||||
|
||||
/* Security Investigation */
|
||||
fields @timestamp, srcAddr, dstAddr, dstPort
|
||||
| filter action = "REJECT" and dstPort = 22
|
||||
| limit 50
|
||||
|
||||
/* NAT Gateway Health Check */
|
||||
fields @timestamp, srcAddr, dstAddr
|
||||
| filter srcAddr like "10.0.1." and isIpv4InSubnet(dstAddr, "8.8.8.0/24")
|
||||
| stats count() by bin(5m)
|
||||
```
|
||||
|
||||
**Visualization Tricks:**
|
||||
1. Use **time series** graphs to spot traffic patterns
|
||||
2. Create **bar charts** of top talkers
|
||||
3. Save frequent queries as dashboard widgets
|
||||
|
||||
---
|
||||
|
||||
## **II. High-Risk Operations Playbook**
|
||||
### **Danger Zone: Actions That Break Connections**
|
||||
| Operation | Risk | Safe Approach |
|
||||
|-----------|------|---------------|
|
||||
| SG Modifications | Drops active connections | Add new rules first, then remove old |
|
||||
| NACL Updates | Stateless - kills existing flows | Test in staging first |
|
||||
| Route Changes | Misroutes critical traffic | Use weighted routing for failover |
|
||||
| NAT Replacement | Breaks long-lived sessions | Warm standby + EIP preservation |
|
||||
|
||||
**Real-World Example:**
|
||||
A financial firm caused a 37-minute outage by modifying NACLs during trading hours. The fix? Now they:
|
||||
1. Test all changes in a replica environment
|
||||
2. Implement change windows
|
||||
3. Use Terraform plan/apply for dry runs
|
||||
|
||||
### **Safe Troubleshooting Techniques**
|
||||
1. **Passive Monitoring**
|
||||
- Flow logs (meta-analysis)
|
||||
- Traffic mirroring (packet-level)
|
||||
- CloudWatch Metrics (trend spotting)
|
||||
|
||||
2. **Non-Destructive Testing**
|
||||
```bash
|
||||
# Packet capture without service impact
|
||||
sudo tcpdump -i eth0 -w debug.pcap host 10.0.1.5 and port 3306 -C 100 -W 5
|
||||
```
|
||||
|
||||
3. **Change Management**
|
||||
- Canary deployments (1% traffic first)
|
||||
- Automated rollback hooks
|
||||
- SSM Session Manager for emergency access
|
||||
|
||||
---
|
||||
|
||||
## **III. War Stories: Lessons From the Trenches**
|
||||
### **1. The Case of the Vanishing Packets**
|
||||
**Symptoms:** Intermittent database timeouts
|
||||
**Root Cause:** Overlapping security group rules being silently deduped
|
||||
**Fix:**
|
||||
```bash
|
||||
# Find duplicate SG rules
|
||||
aws ec2 describe-security-groups \
|
||||
--query 'SecurityGroups[*].IpPermissions' \
|
||||
| jq '.[] | group_by(.FromPort, .ToPort, .IpRanges)[] | select(length > 1)'
|
||||
```
|
||||
|
||||
### **2. The $15,000 NAT Surprise**
|
||||
**Symptoms:** Unexpected bill spike
|
||||
**Discovery:**
|
||||
```bash
|
||||
# Find idle NAT Gateways
|
||||
aws ec2 describe-nat-gateways \
|
||||
--filter "Name=state,Values=available" \
|
||||
--query 'NatGateways[?subnetId==`null`]'
|
||||
```
|
||||
**Prevention:** Tag all resources with Owner and Purpose
|
||||
|
||||
### **3. The Peering Paradox**
|
||||
**Issue:** Cross-account VPC peering with broken DNS
|
||||
**Solution:
|
||||
```bash
|
||||
# Share private hosted zones
|
||||
aws route53 create-vpc-association-authorization \
|
||||
--hosted-zone-id Z123 \
|
||||
--vpc VPCRegion=us-east-1,VPCId=vpc-456
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## **IV. The Resiliency Toolkit**
|
||||
### **Must-Have Automation**
|
||||
1. **Auto-Rollback Systems**
|
||||
```python
|
||||
# Lambda function monitoring CloudTrail for dangerous changes
|
||||
def lambda_handler(event, context):
|
||||
if event['detail']['eventName'] == 'DeleteNetworkAcl':
|
||||
revert_nacl(event['detail']['requestParameters']['networkAclId'])
|
||||
```
|
||||
|
||||
2. **Chaos Engineering Tests**
|
||||
- Scheduled NAT failure drills
|
||||
- AZ isolation simulations
|
||||
- Route table corruption tests
|
||||
|
||||
### **The 5-Minute Recovery Checklist**
|
||||
1. **Diagnose**
|
||||
```bash
|
||||
aws ec2 describe-network-interfaces --filters "Name=status,Values=available"
|
||||
```
|
||||
2. **Contain**
|
||||
- Freeze CI/CD pipelines
|
||||
- Disable problematic security groups
|
||||
3. **Restore**
|
||||
- Terraform rollback
|
||||
- Route table replacement
|
||||
|
||||
---
|
||||
|
||||
## **V. Pro Tips Archive**
|
||||
### **Security Group Wisdom**
|
||||
```hcl
|
||||
# Terraform best practice
|
||||
resource "aws_security_group" "example" {
|
||||
egress {
|
||||
# Never leave empty - defaults to deny all!
|
||||
from_port = 0
|
||||
to_port = 0
|
||||
protocol = "-1"
|
||||
cidr_blocks = ["0.0.0.0/0"] # Restrict in prod
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### **NACL Gotchas**
|
||||
- Ephemeral ports (32768-60999) must be allowed bidirectionally
|
||||
- Rule evaluation order matters (lowest number first)
|
||||
- Default NACL allows all traffic (custom NACLs deny)
|
||||
|
||||
### **Direct Connect Pro Tips**
|
||||
- Set BGP timers to 10s keepalive/30s holddown
|
||||
- Hardcode MTU to 1500
|
||||
- Monitor with:
|
||||
```bash
|
||||
aws directconnect describe-virtual-interfaces --query 'virtualInterfaces[*].[virtualInterfaceId,bgpPeers[0].bgpStatus]'
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## **Final Checklist for Production Safety**
|
||||
1. [ ] Enable flow logs in all VPCs
|
||||
2. [ ] Document rollback procedures
|
||||
3. [ ] Test failure scenarios regularly
|
||||
4. [ ] Implement change controls
|
||||
5. [ ] Tag all network resources
|
||||
|
||||
**Remember:** The best troubleshooting is avoiding problems through design. Use this guide as your playbook for building and maintaining bulletproof AWS networks.
|
||||
|
||||
Would you like me to develop any specific section further with more technical depth or real-world examples?
|
||||
|
||||
---
|
||||
|
||||
You're absolutely right—**using the AWS Console (GUI) is often the fastest and most intuitive way to analyze Flow Logs**, especially for SMEs who need quick answers. Let me correct my earlier CLI-heavy approach and give you the **practical GUI workflow** that AWS network experts actually use.
|
||||
|
||||
---
|
||||
|
||||
Reference in New Issue
Block a user