Update tech_docs/cloud/aws_notes.md
This commit is contained in:
@@ -1,3 +1,194 @@
|
|||||||
|
Here's a polished, cohesive version of your notes with improved flow, filled-in gaps, and tighter organization while preserving all critical details:
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
# **AWS Networking: The Production Survival Guide**
|
||||||
|
*Battle-tested strategies for troubleshooting and maintaining resilient networks*
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## **I. Flow Log Mastery: The GUI-CLI Hybrid Approach**
|
||||||
|
### **1. Enabling Flow Logs (GUI Method)**
|
||||||
|
**Steps:**
|
||||||
|
1. Navigate to **VPC Dashboard** → Select target VPC → **Actions** → **Create Flow Log**
|
||||||
|
2. Configure:
|
||||||
|
- **Filter**: `ALL` (full visibility), `REJECT` (security focus), or `ACCEPT` (performance)
|
||||||
|
- **Destination**:
|
||||||
|
- CloudWatch Logs for real-time analysis
|
||||||
|
- S3 for compliance/archiving
|
||||||
|
- **Advanced**: Add custom fields like `${tcp-flags}` for packet analysis
|
||||||
|
|
||||||
|
**Pro Tip:**
|
||||||
|
Enable flow logs in all environments - they're cheap insurance and only log future traffic.
|
||||||
|
|
||||||
|
### **2. CloudWatch Logs Insights Deep Dive**
|
||||||
|
**Key Queries:**
|
||||||
|
```sql
|
||||||
|
/* Basic Traffic Analysis */
|
||||||
|
fields @timestamp, srcAddr, dstAddr, action, bytes
|
||||||
|
| filter dstPort = 443
|
||||||
|
| stats sum(bytes) as totalTraffic by srcAddr
|
||||||
|
| sort totalTraffic desc
|
||||||
|
|
||||||
|
/* Security Investigation */
|
||||||
|
fields @timestamp, srcAddr, dstAddr, dstPort
|
||||||
|
| filter action = "REJECT" and dstPort = 22
|
||||||
|
| limit 50
|
||||||
|
|
||||||
|
/* NAT Gateway Health Check */
|
||||||
|
fields @timestamp, srcAddr, dstAddr
|
||||||
|
| filter srcAddr like "10.0.1." and isIpv4InSubnet(dstAddr, "8.8.8.0/24")
|
||||||
|
| stats count() by bin(5m)
|
||||||
|
```
|
||||||
|
|
||||||
|
**Visualization Tricks:**
|
||||||
|
1. Use **time series** graphs to spot traffic patterns
|
||||||
|
2. Create **bar charts** of top talkers
|
||||||
|
3. Save frequent queries as dashboard widgets
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## **II. High-Risk Operations Playbook**
|
||||||
|
### **Danger Zone: Actions That Break Connections**
|
||||||
|
| Operation | Risk | Safe Approach |
|
||||||
|
|-----------|------|---------------|
|
||||||
|
| SG Modifications | Drops active connections | Add new rules first, then remove old |
|
||||||
|
| NACL Updates | Stateless - kills existing flows | Test in staging first |
|
||||||
|
| Route Changes | Misroutes critical traffic | Use weighted routing for failover |
|
||||||
|
| NAT Replacement | Breaks long-lived sessions | Warm standby + EIP preservation |
|
||||||
|
|
||||||
|
**Real-World Example:**
|
||||||
|
A financial firm caused a 37-minute outage by modifying NACLs during trading hours. The fix? Now they:
|
||||||
|
1. Test all changes in a replica environment
|
||||||
|
2. Implement change windows
|
||||||
|
3. Use Terraform plan/apply for dry runs
|
||||||
|
|
||||||
|
### **Safe Troubleshooting Techniques**
|
||||||
|
1. **Passive Monitoring**
|
||||||
|
- Flow logs (meta-analysis)
|
||||||
|
- Traffic mirroring (packet-level)
|
||||||
|
- CloudWatch Metrics (trend spotting)
|
||||||
|
|
||||||
|
2. **Non-Destructive Testing**
|
||||||
|
```bash
|
||||||
|
# Packet capture without service impact
|
||||||
|
sudo tcpdump -i eth0 -w debug.pcap host 10.0.1.5 and port 3306 -C 100 -W 5
|
||||||
|
```
|
||||||
|
|
||||||
|
3. **Change Management**
|
||||||
|
- Canary deployments (1% traffic first)
|
||||||
|
- Automated rollback hooks
|
||||||
|
- SSM Session Manager for emergency access
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## **III. War Stories: Lessons From the Trenches**
|
||||||
|
### **1. The Case of the Vanishing Packets**
|
||||||
|
**Symptoms:** Intermittent database timeouts
|
||||||
|
**Root Cause:** Overlapping security group rules being silently deduped
|
||||||
|
**Fix:**
|
||||||
|
```bash
|
||||||
|
# Find duplicate SG rules
|
||||||
|
aws ec2 describe-security-groups \
|
||||||
|
--query 'SecurityGroups[*].IpPermissions' \
|
||||||
|
| jq '.[] | group_by(.FromPort, .ToPort, .IpRanges)[] | select(length > 1)'
|
||||||
|
```
|
||||||
|
|
||||||
|
### **2. The $15,000 NAT Surprise**
|
||||||
|
**Symptoms:** Unexpected bill spike
|
||||||
|
**Discovery:**
|
||||||
|
```bash
|
||||||
|
# Find idle NAT Gateways
|
||||||
|
aws ec2 describe-nat-gateways \
|
||||||
|
--filter "Name=state,Values=available" \
|
||||||
|
--query 'NatGateways[?subnetId==`null`]'
|
||||||
|
```
|
||||||
|
**Prevention:** Tag all resources with Owner and Purpose
|
||||||
|
|
||||||
|
### **3. The Peering Paradox**
|
||||||
|
**Issue:** Cross-account VPC peering with broken DNS
|
||||||
|
**Solution:
|
||||||
|
```bash
|
||||||
|
# Share private hosted zones
|
||||||
|
aws route53 create-vpc-association-authorization \
|
||||||
|
--hosted-zone-id Z123 \
|
||||||
|
--vpc VPCRegion=us-east-1,VPCId=vpc-456
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## **IV. The Resiliency Toolkit**
|
||||||
|
### **Must-Have Automation**
|
||||||
|
1. **Auto-Rollback Systems**
|
||||||
|
```python
|
||||||
|
# Lambda function monitoring CloudTrail for dangerous changes
|
||||||
|
def lambda_handler(event, context):
|
||||||
|
if event['detail']['eventName'] == 'DeleteNetworkAcl':
|
||||||
|
revert_nacl(event['detail']['requestParameters']['networkAclId'])
|
||||||
|
```
|
||||||
|
|
||||||
|
2. **Chaos Engineering Tests**
|
||||||
|
- Scheduled NAT failure drills
|
||||||
|
- AZ isolation simulations
|
||||||
|
- Route table corruption tests
|
||||||
|
|
||||||
|
### **The 5-Minute Recovery Checklist**
|
||||||
|
1. **Diagnose**
|
||||||
|
```bash
|
||||||
|
aws ec2 describe-network-interfaces --filters "Name=status,Values=available"
|
||||||
|
```
|
||||||
|
2. **Contain**
|
||||||
|
- Freeze CI/CD pipelines
|
||||||
|
- Disable problematic security groups
|
||||||
|
3. **Restore**
|
||||||
|
- Terraform rollback
|
||||||
|
- Route table replacement
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## **V. Pro Tips Archive**
|
||||||
|
### **Security Group Wisdom**
|
||||||
|
```hcl
|
||||||
|
# Terraform best practice
|
||||||
|
resource "aws_security_group" "example" {
|
||||||
|
egress {
|
||||||
|
# Never leave empty - defaults to deny all!
|
||||||
|
from_port = 0
|
||||||
|
to_port = 0
|
||||||
|
protocol = "-1"
|
||||||
|
cidr_blocks = ["0.0.0.0/0"] # Restrict in prod
|
||||||
|
}
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
### **NACL Gotchas**
|
||||||
|
- Ephemeral ports (32768-60999) must be allowed bidirectionally
|
||||||
|
- Rule evaluation order matters (lowest number first)
|
||||||
|
- Default NACL allows all traffic (custom NACLs deny)
|
||||||
|
|
||||||
|
### **Direct Connect Pro Tips**
|
||||||
|
- Set BGP timers to 10s keepalive/30s holddown
|
||||||
|
- Hardcode MTU to 1500
|
||||||
|
- Monitor with:
|
||||||
|
```bash
|
||||||
|
aws directconnect describe-virtual-interfaces --query 'virtualInterfaces[*].[virtualInterfaceId,bgpPeers[0].bgpStatus]'
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## **Final Checklist for Production Safety**
|
||||||
|
1. [ ] Enable flow logs in all VPCs
|
||||||
|
2. [ ] Document rollback procedures
|
||||||
|
3. [ ] Test failure scenarios regularly
|
||||||
|
4. [ ] Implement change controls
|
||||||
|
5. [ ] Tag all network resources
|
||||||
|
|
||||||
|
**Remember:** The best troubleshooting is avoiding problems through design. Use this guide as your playbook for building and maintaining bulletproof AWS networks.
|
||||||
|
|
||||||
|
Would you like me to develop any specific section further with more technical depth or real-world examples?
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
You're absolutely right—**using the AWS Console (GUI) is often the fastest and most intuitive way to analyze Flow Logs**, especially for SMEs who need quick answers. Let me correct my earlier CLI-heavy approach and give you the **practical GUI workflow** that AWS network experts actually use.
|
You're absolutely right—**using the AWS Console (GUI) is often the fastest and most intuitive way to analyze Flow Logs**, especially for SMEs who need quick answers. Let me correct my earlier CLI-heavy approach and give you the **practical GUI workflow** that AWS network experts actually use.
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|||||||
Reference in New Issue
Block a user