Update tech_docs/cloud/aws_notes.md

This commit is contained in:
2025-07-20 21:03:54 -05:00
parent 133b478404
commit 5a719ed2b6

View File

@@ -1,4 +1,244 @@
Here's a polished, cohesive version of your notes with improved flow, filled-in gaps, and tighter organization while preserving all critical details:
# **Deep Dive: Mastering AWS Flow Logs for Advanced Troubleshooting**
## **1. Flow Logs Fundamentals**
### **What Flow Logs Capture**
Flow Logs record **IP traffic metadata** (not payload data) for:
- **VPCs**
- **Subnets**
- **Elastic Network Interfaces (ENIs)**
**Key Fields:**
| Field | Description | Example |
|-------|-------------|---------|
| `version` | Flow log version | `2` |
| `account-id` | AWS account ID | `123456789012` |
| `interface-id` | ENI ID | `eni-12345abc` |
| `srcaddr` | Source IP | `10.0.1.5` |
| `dstaddr` | Destination IP | `8.8.8.8` |
| `srcport` | Source port | `32768` |
| `dstport` | Destination port | `443` |
| `protocol` | IP protocol number | `6` (TCP) |
| `packets` | Packets in flow | `5` |
| `bytes` | Bytes transferred | `1024` |
| `start` | Flow start (Unix epoch) | `1625097600` |
| `end` | Flow end (Unix epoch) | `1625097605` |
| `action` | `ACCEPT` or `REJECT` | `REJECT` |
| `log-status` | Logging status | `OK` |
### **When to Use Flow Logs**
**Troubleshooting connectivity issues**
**Security incident investigations**
**Network performance analysis**
**Compliance auditing**
---
## **2. Enabling & Configuring Flow Logs**
### **GUI Method (Quick Setup)**
1. **VPC Dashboard** → Select VPC → **Actions****Create Flow Log**
2. Configure:
- **Filter**: `ALL` (recommended), `ACCEPT`, or `REJECT`
- **Destination**:
- **CloudWatch Logs** (real-time analysis)
- **S3** (long-term storage)
- **Log Format**: Default or custom (e.g., add `${tcp-flags}`)
### **CLI Method (Automation-Friendly)**
```bash
# Send to CloudWatch Logs
aws ec2 create-flow-logs \
--resource-type VPC \
--resource-id vpc-123abc \
--traffic-type ALL \
--log-destination-type cloud-watch-logs \
--log-group-name "VPCFlowLogs" \
--log-format '${version} ${account-id} ${interface-id} ${srcaddr} ${dstaddr} ${srcport} ${dstport} ${protocol} ${packets} ${bytes} ${start} ${end} ${action} ${log-status}'
# Send to S3 (for compliance)
aws ec2 create-flow-logs \
--resource-type Subnet \
--resource-id subnet-456def \
--traffic-type REJECT \ # Only log blocked traffic
--log-destination-type s3 \
--log-destination "arn:aws:s3:::my-flow-logs-bucket"
```
### **Advanced Custom Fields**
Add these to `--log-format` for deeper insights:
- `${pkt-srcaddr}` / `${pkt-dstaddr}` (NAT-translated IPs)
- `${tcp-flags}` (SYN, ACK, RST)
- `${type}` (IPv4/IPv6)
---
## **3. Analyzing Flow Logs**
### **CloudWatch Logs Insights (GUI)**
**Best for:** Ad-hoc troubleshooting
**Key Queries:**
#### **1. Top Talkers (Bandwidth Analysis)**
```sql
fields @timestamp, srcAddr, dstAddr, bytes
| stats sum(bytes) as totalBytes by srcAddr, dstAddr
| sort totalBytes desc
| limit 20
```
#### **2. Blocked Traffic Investigation**
```sql
fields @timestamp, srcAddr, dstAddr, dstPort, action
| filter action = "REJECT"
| sort @timestamp desc
| limit 50
```
#### **3. NAT Gateway Health Check**
```sql
fields @timestamp, srcAddr, dstAddr, action
| filter srcAddr like "10.0.1." and dstAddr like "8.8.8."
| stats count(*) as attempts by bin(5m)
| sort @timestamp desc
```
#### **4. Suspicious Port Scanning**
```sql
fields @timestamp, srcAddr, dstPort
| filter dstPort >= 3000 and dstPort <= 4000
| stats count(*) by srcAddr, dstPort
| sort count(*) desc
```
### **Athena (S3-Based SQL Analysis)**
**Best for:** Large-scale historical analysis
**Setup:**
1. Create Athena table:
```sql
CREATE EXTERNAL TABLE vpc_flow_logs (
version int,
account_id string,
interface_id string,
srcaddr string,
dstaddr string,
srcport int,
dstport int,
protocol int,
packets bigint,
bytes bigint,
start bigint,
end bigint,
action string,
log_status string
)
PARTITIONED BY (dt string)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ' '
LOCATION 's3://my-flow-logs-bucket/AWSLogs/123456789012/vpcflowlogs/us-east-1/'
```
**Query Example:**
```sql
-- Find all blocked SSH attempts
SELECT srcaddr, COUNT(*) as block_count
FROM vpc_flow_logs
WHERE dstport = 22 AND action = 'REJECT'
GROUP BY srcaddr
ORDER BY block_count DESC
```
---
## **4. Real-World Troubleshooting Scenarios**
### **Case 1: "Why Cant My Instance Reach the Internet?"**
**Steps:**
1. **Check Flow Logs for Rejects:**
```sql
fields @timestamp, srcAddr, dstAddr, dstPort, action
| filter srcAddr = "10.0.1.5" and dstAddr like "8.8.8."
| sort @timestamp desc
```
2. **If `REJECT`:**
- Check **NACLs** and **Security Groups**
3. **If No Logs:**
- Verify **route tables** (`0.0.0.0/0 → nat-xxx`)
### **Case 2: "Whos Accessing My Database?"**
```sql
fields @timestamp, srcAddr, dstAddr, dstPort
| filter dstAddr = "10.0.2.10" and dstPort = 3306
| stats count(*) by srcAddr
| sort count(*) desc
```
### **Case 3: "Is My Application Generating Excessive Traffic?"**
```sql
fields @timestamp, srcAddr, dstAddr, bytes
| filter dstAddr like "10.0.3."
| stats sum(bytes) as totalBytes by bin(1h)
| sort totalBytes desc
```
---
## **5. Pro Tips for Production**
### **1. Optimize Costs**
- Use **S3 + Athena** for long-term storage (cheaper than CloudWatch)
- Filter `REJECT`-only logs for security use cases
### **2. Automate Alerts**
```bash
# CloudWatch Alarm for DDoS-like traffic
aws cloudwatch put-metric-alarm \
--alarm-name "High-Reject-Rate" \
--metric-name "RejectedPackets" \
--namespace "AWS/Logs" \
--statistic "Sum" \
--period 300 \
--threshold 1000 \
--comparison-operator "GreaterThanThreshold" \
--evaluation-periods 1
```
### **3. Centralized Logging**
Aggregate logs from multiple accounts:
```bash
aws logs put-subscription-filter \
--log-group-name "VPCFlowLogs" \
--filter-name "CrossAccountStream" \
--filter-pattern "" \
--destination-arn "arn:aws:logs:us-east-1:123456789012:destination:CentralAccount"
```
### **4. Security Hardening**
```sql
-- Detect port scanning
fields @timestamp, srcAddr, dstPort
| filter dstPort >= 0 and dstPort <= 1024
| stats count_distinct(dstPort) as portsScanned by srcAddr
| filter portsScanned > 5
| sort portsScanned desc
```
---
## **6. Limitations & Workarounds**
| Limitation | Workaround |
|------------|------------|
| No payload data | Use **Traffic Mirroring** + `tcpdump` |
| ~15 min delay | Use **CloudWatch Metrics** for near-real-time |
| No MAC addresses | Correlate with `describe-network-interfaces` |
---
## **Final Checklist**
1. [ ] Enable flow logs on all critical VPCs
2. [ ] Set up CloudWatch dashboards for key queries
3. [ ] Configure S3 archiving for compliance
4. [ ] Automate security alerts (e.g., port scans)
5. [ ] Document common troubleshooting queries
**Flow logs are your networks black box recorder—enable them before you need them!**
Would you like a **hands-on lab walkthrough** for a specific troubleshooting scenario?
---