Files

medusa 4cd6a95e54 Update tech_docs/cloud/aws_notes.md

2025-07-20 21:10:46 -05:00

38 KiB

Raw Blame History

Absolutely—mastering tcpdump is invaluable for cloud engineers, even in AWS/GCP/Azure environments. Here’s why, when to use it, and how it complements cloud-native tools:

1. Why Learn `tcpdump` in the Cloud Era?

Situations Where It Shines:

Debugging EC2 Instance Connectivity:
When Security Groups, NACLs, and Flow Logs show "allowed" traffic but packets still aren’t reaching your app.
```
sudo tcpdump -i eth0 host 10.0.1.5 and port 80 -nnv
```
- -nnv: Disables DNS resolution (faster) and adds verbose output.
Validating Encryption:
Verify TLS handshakes (e.g., AWS ALB → EC2 traffic).
```
sudo tcpdump -i eth0 'tcp port 443 and (tcp-syn|tcp-ack)!=0' -XX
```
Packet-Level Drops:
Flow Logs show REJECT but don’t explain why—tcpdump reveals RST packets, MTU issues, or malformed headers.

Cloud-Native Gaps It Fills:

Cloud Tool	Limitation	How `tcpdump` Helps
VPC Flow Logs	No packet payloads	Inspect HTTP headers, TLS versions
Security Groups	No TCP flag logging	Check SYN/ACK/RST flags
Network ACLs	No visibility into interface drops	See if packets reach the ENI

2. Key `tcpdump` Commands for Cloud Engineers

Basic Capture (Save to File)

sudo tcpdump -i eth0 -w /tmp/debug.pcap host 10.0.1.10 and port 443

Use Case: Post-mortem analysis with Wireshark.

Filter AWS Metadata Service

sudo tcpdump -i eth0 dst 169.254.169.254 -nnv

Why: Verify IMDSv2 token hops or SSRF vulnerabilities.

Check MTU Issues

sudo tcpdump -i eth0 'icmp and icmp[0] == 3 and icmp[1] == 4' -vv

Interpretation: ICMP "Fragmentation Needed" messages (AWS drops these by default).

Validate NAT Gateway Traffic

sudo tcpdump -i eth0 src 10.0.1.5 and dst not 10.0.0.0/16 -nn

Why: Confirm outbound traffic is SNAT’d correctly.

**3. When to Avoid `tcpdump` in the Cloud**

For VPC-Wide Analysis: Use VPC Flow Logs instead (lower overhead).
Encrypted Traffic: Without decryption keys, tcpdump only shows gibberish (use Layer 7 tools like ALB access logs).
High-Throughput Services: Capturing 100 Gbps traffic will crush your instance.

4. Cloud-Specific `tcpdump` Tricks

Traffic Mirroring (AWS)

Set up a Traffic Mirror Session to copy packets to a monitoring instance.

Capture on the mirror interface:

sudo tcpdump -i ens5 -w /tmp/mirror.pcap

Containerized Workloads (EKS/EKS)

kubectl exec -it <pod> -- tcpdump -i eth0 -nn -c 10 'port 53'

Why: Debug DNS issues in Kubernetes pods.

Lambda Cold Starts

Not Possible: No shell access, but use VPC Flow Logs instead.

5. How `tcpdump` Complements Cloud Tools

Debugging Flow Log "REJECT" Entries

Flow Log Query:

fields @timestamp, srcAddr, dstAddr, action | filter action="REJECT"

tcpdump Follow-Up:

sudo tcpdump -i eth0 src 10.0.1.5 and dst 10.0.2.10 -nnvv

Look for RST packets (indicates a reject).

Validating Security Groups

Flow Logs Say: Traffic allowed.
But App Fails:
```
sudo tcpdump -i eth0 port 5432 -nn -c 5
```
- If no packets arrive, check OS firewall (iptables/ufw).

6. Learning Roadmap

Start With Basics:

sudo tcpdump -i eth0 -nnvv 'tcp port 22'

Progress To:
- Filtering by TCP flags (tcp[tcpflags] & (tcp-syn|tcp-ack) != 0).
- Decoding HTTP headers (tcpdump -A -s0 port 80).
Cloud Integration:
- Automate captures during deployments.
- Correlate with CloudWatch Logs.

7. Alternatives in Managed Services

Scenario	Native Tool	`tcpdump` Equivalent
VPC Traffic Inspection	Traffic Mirroring	`tcpdump` on mirror target
Container Networking	EKS/ECS logs	`kubectl exec + tcpdump`
Serverless	VPC Flow Logs	Not applicable

Final Verdict

Yes, master tcpdump—but strategically:

Essential For:
- Instance-level debugging.
- Validating encryption/MTU.
- Hybrid cloud (on-prem + cloud).
Optional For:
- Pure serverless architectures.
- High-throughput analytics (use Flow Logs + Athena instead).

Pro Tip: Combine with tshark (Wireshark CLI) for advanced analysis:

sudo tcpdump -i eth0 -w - | tshark -r - -Y 'http.request.method=="GET"'

A Cloud Network SME operates at the same level of mastery as a traditional network engineer but with a cloud-native lens. Here’s what they have top of mind, structured like the OSI model for clarity:

1. Addressing & Segmentation (Cloud’s "Layer 3")

Top of Mind:

RFC 1918 in the Cloud:
- Knows 10.0.0.0/8, 172.16.0.0/12, 192.168.0.0/16 but also:
- AWS Reserved Ranges: 169.254.0.0/16 (link-local), 100.64.0.0/10 (Carrier NAT)
- Avoids Overlaps: Never peers 10.0.0.0/16 with another 10.0.0.0/16 (silent failure).

Subnetting at Scale:

/28 Minimum in AWS (5 IPs reserved per subnet).

AZ-Aware Design:

# Example: 10.0.0.0/16 → /20 per AZ (AWS best practice)
us-east-1a: 10.0.0.0/20  
us-east-1b: 10.0.16.0/20

CLI Command They Use Daily:

aws ec2 describe-subnets --query 'Subnets[*].{AZ:AvailabilityZone,CIDR:CidrBlock,Name:Tags[?Key==`Name`].Value|[0]}' --output table

2. Cloud "Layer 4" Mastery (Transport Layer)

Top of Mind:

Stateful vs. Stateless:
- Security Groups (Stateful): Return traffic auto-allowed.
- NACLs (Stateless): Must allow ephemeral ports (32768-60999) bidirectionally.
Port Knowledge:
- Not Just 80/443:
  - 2879 (BGP over Direct Connect)
  - 6081 (Geneve for AWS VPC Traffic Mirroring)
  - 53 (DNS for PrivateLink endpoints)

War Story:

"Why is my NAT Gateway not working?"
→ Forgot to allow outbound 1024-65535 in the private subnet’s NACL.

CLI Command They Use Daily:

# Check ephemeral port range on Linux instances
cat /proc/sys/net/ipv4/ip_local_port_range

3. Cloud "Layer 7" (Application Layer)

Top of Mind:

Load Balancer Types:

Type	Use Case	Key Detail
ALB	HTTP/HTTPS	Supports path-based routing (`/api/*`)
NLB	Ultra-low latency	Preserves source IP (no X-Forwarded-For)
GWLB	Threat inspection	Chains with Firewall (Palo Alto, Fortinet)

PrivateLink:
- Knows com.amazonaws.vpce.{region}.vpce-svc-xxxx DNS format.
- Gotcha: Doesn’t auto-share Route 53 Private Hosted Zones.

CLI Command They Use Daily:

aws ec2 describe-vpc-endpoint-services --query 'ServiceDetails[?ServiceType==`Interface`].ServiceName'

4. Cloud-Specific Protocols

Top of Mind:

Geneve (UDP 6081):
- Encapsulation protocol for AWS Traffic Mirroring.
BGP over Direct Connect:
- Default keepalive=60s is too high—sets to 10s.
VXLAN (Overlay for Transit Gateway):
- Knows TGW attachments use VXLAN headers for cross-account routing.

War Story:

"Why is my Direct Connect flapping?"
→ BGP holddown timer was left at default (180s).

**CLI Command They Use Daily:

aws directconnect describe-virtual-interfaces --query 'virtualInterfaces[*].[virtualInterfaceId,bgpPeers[0].bgpStatus]'

5. Troubleshooting Tools (Like `tcpdump` for Cloud)

Top of Mind:

Flow Logs:

Query with CloudWatch Insights:

fields @timestamp, srcAddr, dstAddr, action | filter action="REJECT" | sort @timestamp desc

VPC Traffic Mirroring:
- Copies traffic to an analysis instance (like SPAN in trad networks).
Reachability Analyzer:
- Pre-checks paths before making changes.

CLI Command They Use Daily:

aws ec2 create-network-insights-path --source <eni-id> --destination-port 443 --protocol tcp

6. Cloud Network Limits (Like MTU in Trad Nets)

Top of Mind:

AWS MTU: Always 1500 (jumbo frames not supported over internet/DX).
NAT Gateway Throughput:
- Up to 100 Gbps but 5 Gbps per flow.
Security Group Limits:
- 60 rules per SG, 5 SGs per ENI.

War Story:

"Why is my throughput capped at 5 Gbps?"
→ Single TCP flow hitting NAT Gateway limit.

**CLI Command They Use Daily:

aws ec2 describe-account-attributes --query 'AccountAttributes[?AttributeName==`max-instances`].AttributeValues'

7. Automation Mindset (Like Config Templates)

**Top of Mind:

Infrastructure as Code (IaC):

Terraform snippets for zero-downtime SG updates:

resource "aws_security_group_rule" "temp_rule" {
  lifecycle { create_before_destroy = true }
}

AWS APIs:
- Uses modify-network-interface-attribute over console clicks.

**CLI Command They Use Daily:

aws ec2 modify-instance-metadata-options --instance-id i-123abc --http-put-response-hop-limit 2

The Cloud Network SME’s Cheat Sheet

Traditional	Cloud Equivalent
Subnetting	VPC CIDR design + AZ distribution
BGP	Direct Connect BGP timers
SPAN port	VPC Traffic Mirroring
Firewall rules	Security Groups + NACLs
tcpdump	Flow Logs + Athena SQL

Final Tip: A true cloud SME doesn’t just know these—they automate them. For example:

# Auto-remediate overly permissive SGs
aws ec2 revoke-security-group-egress --group-id sg-123 --ip-permissions 'IpProtocol=-1,FromPort=-1,ToPort=-1,IpRanges=[{CidrIp=0.0.0.0/0}]'

Would you like a hands-on lab for any of these scenarios?

Deep Dive: Mastering AWS Flow Logs for Advanced Troubleshooting

1. Flow Logs Fundamentals

What Flow Logs Capture

Flow Logs record IP traffic metadata (not payload data) for:

VPCs
Subnets
Elastic Network Interfaces (ENIs)

Key Fields:

Field	Description	Example
`version`	Flow log version	`2`
`account-id`	AWS account ID	`123456789012`
`interface-id`	ENI ID	`eni-12345abc`
`srcaddr`	Source IP	`10.0.1.5`
`dstaddr`	Destination IP	`8.8.8.8`
`srcport`	Source port	`32768`
`dstport`	Destination port	`443`
`protocol`	IP protocol number	`6` (TCP)
`packets`	Packets in flow	`5`
`bytes`	Bytes transferred	`1024`
`start`	Flow start (Unix epoch)	`1625097600`
`end`	Flow end (Unix epoch)	`1625097605`
`action`	`ACCEPT` or `REJECT`	`REJECT`
`log-status`	Logging status	`OK`

When to Use Flow Logs

✅ Troubleshooting connectivity issues
✅ Security incident investigations
✅ Network performance analysis
✅ Compliance auditing

2. Enabling & Configuring Flow Logs

GUI Method (Quick Setup)

VPC Dashboard → Select VPC → Actions → Create Flow Log
Configure:
- Filter: ALL (recommended), ACCEPT, or REJECT
- Destination:
  - CloudWatch Logs (real-time analysis)
  - S3 (long-term storage)
- Log Format: Default or custom (e.g., add ${tcp-flags})

CLI Method (Automation-Friendly)

# Send to CloudWatch Logs
aws ec2 create-flow-logs \
  --resource-type VPC \
  --resource-id vpc-123abc \
  --traffic-type ALL \
  --log-destination-type cloud-watch-logs \
  --log-group-name "VPCFlowLogs" \
  --log-format '${version} ${account-id} ${interface-id} ${srcaddr} ${dstaddr} ${srcport} ${dstport} ${protocol} ${packets} ${bytes} ${start} ${end} ${action} ${log-status}'

# Send to S3 (for compliance)
aws ec2 create-flow-logs \
  --resource-type Subnet \
  --resource-id subnet-456def \
  --traffic-type REJECT \  # Only log blocked traffic
  --log-destination-type s3 \
  --log-destination "arn:aws:s3:::my-flow-logs-bucket"

Advanced Custom Fields

Add these to --log-format for deeper insights:

${pkt-srcaddr} / ${pkt-dstaddr} (NAT-translated IPs)
${tcp-flags} (SYN, ACK, RST)
${type} (IPv4/IPv6)

3. Analyzing Flow Logs

CloudWatch Logs Insights (GUI)

Best for: Ad-hoc troubleshooting
Key Queries:

1. Top Talkers (Bandwidth Analysis)

fields @timestamp, srcAddr, dstAddr, bytes
| stats sum(bytes) as totalBytes by srcAddr, dstAddr
| sort totalBytes desc
| limit 20

2. Blocked Traffic Investigation

fields @timestamp, srcAddr, dstAddr, dstPort, action
| filter action = "REJECT"
| sort @timestamp desc
| limit 50

3. NAT Gateway Health Check

fields @timestamp, srcAddr, dstAddr, action
| filter srcAddr like "10.0.1." and dstAddr like "8.8.8."
| stats count(*) as attempts by bin(5m)
| sort @timestamp desc

4. Suspicious Port Scanning

fields @timestamp, srcAddr, dstPort
| filter dstPort >= 3000 and dstPort <= 4000
| stats count(*) by srcAddr, dstPort
| sort count(*) desc

Athena (S3-Based SQL Analysis)

Best for: Large-scale historical analysis
Setup:

Create Athena table:

CREATE EXTERNAL TABLE vpc_flow_logs (
  version int,
  account_id string,
  interface_id string,
  srcaddr string,
  dstaddr string,
  srcport int,
  dstport int,
  protocol int,
  packets bigint,
  bytes bigint,
  start bigint,
  end bigint,
  action string,
  log_status string
)
PARTITIONED BY (dt string)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ' '
LOCATION 's3://my-flow-logs-bucket/AWSLogs/123456789012/vpcflowlogs/us-east-1/'

Query Example:

-- Find all blocked SSH attempts
SELECT srcaddr, COUNT(*) as block_count
FROM vpc_flow_logs
WHERE dstport = 22 AND action = 'REJECT'
GROUP BY srcaddr
ORDER BY block_count DESC

4. Real-World Troubleshooting Scenarios

Case 1: "Why Can’t My Instance Reach the Internet?"

Steps:

Check Flow Logs for Rejects:

fields @timestamp, srcAddr, dstAddr, dstPort, action
| filter srcAddr = "10.0.1.5" and dstAddr like "8.8.8."
| sort @timestamp desc

If REJECT:
- Check NACLs and Security Groups
If No Logs:
- Verify route tables (0.0.0.0/0 → nat-xxx)

Case 2: "Who’s Accessing My Database?"

fields @timestamp, srcAddr, dstAddr, dstPort
| filter dstAddr = "10.0.2.10" and dstPort = 3306
| stats count(*) by srcAddr
| sort count(*) desc

Case 3: "Is My Application Generating Excessive Traffic?"

fields @timestamp, srcAddr, dstAddr, bytes
| filter dstAddr like "10.0.3."
| stats sum(bytes) as totalBytes by bin(1h)
| sort totalBytes desc

5. Pro Tips for Production

1. Optimize Costs

Use S3 + Athena for long-term storage (cheaper than CloudWatch)
Filter REJECT-only logs for security use cases

2. Automate Alerts

# CloudWatch Alarm for DDoS-like traffic
aws cloudwatch put-metric-alarm \
  --alarm-name "High-Reject-Rate" \
  --metric-name "RejectedPackets" \
  --namespace "AWS/Logs" \
  --statistic "Sum" \
  --period 300 \
  --threshold 1000 \
  --comparison-operator "GreaterThanThreshold" \
  --evaluation-periods 1

3. Centralized Logging

Aggregate logs from multiple accounts:

aws logs put-subscription-filter \
  --log-group-name "VPCFlowLogs" \
  --filter-name "CrossAccountStream" \
  --filter-pattern "" \
  --destination-arn "arn:aws:logs:us-east-1:123456789012:destination:CentralAccount"

4. Security Hardening

-- Detect port scanning
fields @timestamp, srcAddr, dstPort
| filter dstPort >= 0 and dstPort <= 1024
| stats count_distinct(dstPort) as portsScanned by srcAddr
| filter portsScanned > 5
| sort portsScanned desc

6. Limitations & Workarounds

Limitation	Workaround
No payload data	Use Traffic Mirroring + `tcpdump`
~15 min delay	Use CloudWatch Metrics for near-real-time
No MAC addresses	Correlate with `describe-network-interfaces`

Final Checklist

Enable flow logs on all critical VPCs
Set up CloudWatch dashboards for key queries
Configure S3 archiving for compliance
Automate security alerts (e.g., port scans)
Document common troubleshooting queries

Flow logs are your network’s black box recorder—enable them before you need them!

Would you like a hands-on lab walkthrough for a specific troubleshooting scenario?

AWS Networking: The Production Survival Guide

Battle-tested strategies for troubleshooting and maintaining resilient networks

I. Flow Log Mastery: The GUI-CLI Hybrid Approach

1. Enabling Flow Logs (GUI Method)

Steps:

Navigate to VPC Dashboard → Select target VPC → Actions → Create Flow Log
Configure:
- Filter: ALL (full visibility), REJECT (security focus), or ACCEPT (performance)
- Destination:
  - CloudWatch Logs for real-time analysis
  - S3 for compliance/archiving
- Advanced: Add custom fields like ${tcp-flags} for packet analysis

Pro Tip:
Enable flow logs in all environments - they're cheap insurance and only log future traffic.

2. CloudWatch Logs Insights Deep Dive

Key Queries:

/* Basic Traffic Analysis */
fields @timestamp, srcAddr, dstAddr, action, bytes
| filter dstPort = 443
| stats sum(bytes) as totalTraffic by srcAddr
| sort totalTraffic desc

/* Security Investigation */
fields @timestamp, srcAddr, dstAddr, dstPort
| filter action = "REJECT" and dstPort = 22
| limit 50

/* NAT Gateway Health Check */
fields @timestamp, srcAddr, dstAddr
| filter srcAddr like "10.0.1." and isIpv4InSubnet(dstAddr, "8.8.8.0/24")
| stats count() by bin(5m)

Visualization Tricks:

Use time series graphs to spot traffic patterns
Create bar charts of top talkers
Save frequent queries as dashboard widgets

II. High-Risk Operations Playbook

Danger Zone: Actions That Break Connections

Operation	Risk	Safe Approach
SG Modifications	Drops active connections	Add new rules first, then remove old
NACL Updates	Stateless - kills existing flows	Test in staging first
Route Changes	Misroutes critical traffic	Use weighted routing for failover
NAT Replacement	Breaks long-lived sessions	Warm standby + EIP preservation

Real-World Example:
A financial firm caused a 37-minute outage by modifying NACLs during trading hours. The fix? Now they:

Test all changes in a replica environment
Implement change windows
Use Terraform plan/apply for dry runs

Safe Troubleshooting Techniques

Passive Monitoring
- Flow logs (meta-analysis)
- Traffic mirroring (packet-level)
- CloudWatch Metrics (trend spotting)

Non-Destructive Testing

# Packet capture without service impact
sudo tcpdump -i eth0 -w debug.pcap host 10.0.1.5 and port 3306 -C 100 -W 5

Change Management
- Canary deployments (1% traffic first)
- Automated rollback hooks
- SSM Session Manager for emergency access

III. War Stories: Lessons From the Trenches

1. The Case of the Vanishing Packets

Symptoms: Intermittent database timeouts
Root Cause: Overlapping security group rules being silently deduped
Fix:

# Find duplicate SG rules
aws ec2 describe-security-groups \
  --query 'SecurityGroups[*].IpPermissions' \
  | jq '.[] | group_by(.FromPort, .ToPort, .IpRanges)[] | select(length > 1)'

2. The $15,000 NAT Surprise

Symptoms: Unexpected bill spike
Discovery:

# Find idle NAT Gateways
aws ec2 describe-nat-gateways \
  --filter "Name=state,Values=available" \
  --query 'NatGateways[?subnetId==`null`]'

Prevention: Tag all resources with Owner and Purpose

3. The Peering Paradox

Issue: Cross-account VPC peering with broken DNS
**Solution:

# Share private hosted zones
aws route53 create-vpc-association-authorization \
  --hosted-zone-id Z123 \
  --vpc VPCRegion=us-east-1,VPCId=vpc-456

IV. The Resiliency Toolkit

Must-Have Automation

Auto-Rollback Systems

# Lambda function monitoring CloudTrail for dangerous changes
def lambda_handler(event, context):
    if event['detail']['eventName'] == 'DeleteNetworkAcl':
        revert_nacl(event['detail']['requestParameters']['networkAclId'])

Chaos Engineering Tests
- Scheduled NAT failure drills
- AZ isolation simulations
- Route table corruption tests

The 5-Minute Recovery Checklist

Diagnose

aws ec2 describe-network-interfaces --filters "Name=status,Values=available"

Contain
- Freeze CI/CD pipelines
- Disable problematic security groups
Restore
- Terraform rollback
- Route table replacement

V. Pro Tips Archive

Security Group Wisdom

# Terraform best practice
resource "aws_security_group" "example" {
  egress {
    # Never leave empty - defaults to deny all!
    from_port   = 0
    to_port     = 0
    protocol    = "-1"
    cidr_blocks = ["0.0.0.0/0"] # Restrict in prod
  }
}

NACL Gotchas

Ephemeral ports (32768-60999) must be allowed bidirectionally
Rule evaluation order matters (lowest number first)
Default NACL allows all traffic (custom NACLs deny)

Direct Connect Pro Tips

Set BGP timers to 10s keepalive/30s holddown
Hardcode MTU to 1500

Monitor with:

aws directconnect describe-virtual-interfaces --query 'virtualInterfaces[*].[virtualInterfaceId,bgpPeers[0].bgpStatus]'

Final Checklist for Production Safety

Enable flow logs in all VPCs
Document rollback procedures
Test failure scenarios regularly
Implement change controls
Tag all network resources

Remember: The best troubleshooting is avoiding problems through design. Use this guide as your playbook for building and maintaining bulletproof AWS networks.

Would you like me to develop any specific section further with more technical depth or real-world examples?

You're absolutely right—using the AWS Console (GUI) is often the fastest and most intuitive way to analyze Flow Logs, especially for SMEs who need quick answers. Let me correct my earlier CLI-heavy approach and give you the practical GUI workflow that AWS network experts actually use.

Step-by-Step: Troubleshooting with Flow Logs in the AWS Console

1. Enable Flow Logs (GUI Method)

Go to VPC Dashboard → Your VPC → Select VPC → Actions → Create Flow Log.
Choose:
- Filter: ALL (accepts + rejects), REJECT (only blocks), or ACCEPT (only allows).
- Destination: Send to CloudWatch Logs (for real-time queries) or S3 (for long-term storage).
- Log Format: Default works, but advanced users add custom fields (e.g., ${tcp-flags}).
No CLI needed—just 3 clicks.

2. Analyze Flow Logs in CloudWatch Logs Insights

Where GUI Beats CLI:

No query syntax memorization → Pre-built queries.
Visual filtering → Click-to-analyze.

Steps:

Go to CloudWatch → Logs Insights.
Select your Flow Logs group (e.g., VPCFlowLogs).

Key Pre-Built Queries (Click + Run)

A. "Why is my traffic blocked?"

fields @timestamp, srcAddr, dstAddr, dstPort, action
| filter action = "REJECT"
| sort @timestamp desc
| limit 50

GUI Advantage: Hover over REJECT entries to see blocked ports/IPs instantly.

B. "Who’s talking to this suspicious IP?"

fields @timestamp, srcAddr, dstAddr, bytes
| filter dstAddr = "54.239.25.200"  # Example: AWS external IP
| stats sum(bytes) as totalBytes by srcAddr
| sort totalBytes desc

GUI Advantage: Click on srcAddr to drill into specific instances.

C. "Is my NAT Gateway working?"

fields @timestamp, srcAddr, dstAddr, action
| filter srcAddr like "10.0.1." and dstAddr like "8.8.8."
| stats count(*) by bin(5m)  # Traffic volume over time

GUI Advantage: Switch to Visualization tab to see graphs.

3. Visualize Traffic Patterns (No CLI)

In CloudWatch Logs Insights, run a query.
Click Visualization → Choose:
- Bar chart: Top talkers (e.g., stats count(*) by srcAddr).
- Time series: Traffic spikes (e.g., stats sum(bytes) by bin(1h)).

When to Use GUI vs. CLI for Flow Logs

Scenario	GUI (Console)	CLI
One-off troubleshooting	✅ Faster (pre-built queries, point+click)	❌ Overkill
Daily audits	✅ Logs Insights + dashboards	❌ Manual queries slow
Automation (e.g., SOC)	❌ Not scalable	✅ Script with `aws logs start-query`
Deep packet analysis	❌ Limited to metadata	✅ Pipe logs to Athena/S3 for SQL queries

Pro Tips for GUI-Based SMEs

Save Queries: Click Save → Add to dashboard for recurring checks.
Alerts: Create CloudWatch Alerts for anomalies (e.g., spike in REJECT).
- Example: Alert if >100 REJECTs in 5 mins.
Cross-Account Flow Logs: Use Centralized Logging Account for multi-VPC views.

Real-World Example: Troubleshooting a Broken NAT Gateway

Symptoms: Instances in private subnet can’t reach the internet.

GUI Flow:

Flow Logs Query:

fields @timestamp, srcAddr, dstAddr, action
| filter srcAddr like "10.0.1." and dstAddr like "8.8.8."
| sort @timestamp desc

Findings:
- If action = "REJECT" → Check NACLs.
- If no logs → Check route tables (no path to NAT Gateway).
Fix:
- GUI route table edit: Add 0.0.0.0/0 → nat-gateway-id.

Why GUI Wins for SMEs

No syntax errors: Pre-built queries reduce mistakes.
Speed: Click-to-insight in seconds (vs. CLI trial/error).
Collaboration: Share CloudWatch dashboards with team.

Exception: Use CLI only if:

You’re scripting (e.g., nightly security scans).
AWS Console is down (rare, but happens).

Final Answer

Yes, the GUI (AWS Console + CloudWatch Logs Insights) is the best way to analyze Flow Logs for most troubleshooting. CLI is for automation—not day-to-day SME work.

Actionable Checklist:

✅ Enable Flow Logs in GUI (VPC → Actions → Create Flow Log).
✅ Use CloudWatch Logs Insights for point+click queries.
✅ Save key queries to dashboards (e.g., REJECTed traffic).
✅ Set alerts for traffic anomalies.

Want a screenshot-guided walkthrough for a specific scenario (e.g., investigating a DDoS attempt)? I can mock up the exact GUI steps.

1. High-Risk Actions That Can Break Traffic

A. Security Group Rule Modifications

Risk: Removing/updating rules can drop active connections.
Example:
- Revoking an inbound HTTPS (443) rule kills live sessions.
- Changing egress rules can disrupt outbound API calls.
Mitigation:
- Stage changes: Add new rules before removing old ones.
- Use temporary rules: Set short-lived rules (e.g., aws ec2 authorize-security-group-ingress --cidr 1.2.3.4/32 --port 443 --protocol tcp --group-id sg-123).

B. Network ACL (NACL) Updates

Risk: NACLs are stateless—updates drop existing connections.
Example:
- Adding a deny rule for 10.0.1.0/24 kills active TCP sessions.
Mitigation:
- Test in non-prod first.
- Modify NACLs during low-traffic windows.

C. Route Table Changes

Risk: Misrouting traffic (e.g., removing a NAT Gateway route).
Example:
- Deleting 0.0.0.0/0 → igw-123 makes public subnets unreachable.

Mitigation:

Pre-validate routes:

aws ec2 describe-route-tables --route-table-id rtb-123 --query 'RouteTables[*].Routes'

Use weighted routing (e.g., Transit Gateway) for failover.

D. NAT Gateway Replacement

Risk: Swapping NAT Gateways breaks long-lived connections (e.g., SFTP, WebSockets).
Mitigation:
- Preserve Elastic IPs (attach to new NAT Gateway first).
- Warm standby: Deploy new NAT Gateway before decommissioning old one.

2. Safe Troubleshooting Techniques

A. Passive Monitoring (Zero Impact)

Flow Logs: Query logs without touching infrastructure.

# CloudWatch Logs Insights (GUI)  
fields @timestamp, srcAddr, dstAddr, action  
| filter dstAddr = "10.0.2.5" and action = "REJECT"

VPC Traffic Mirroring: Copy traffic to a monitoring instance (no production impact).

B. Non-Destructive Testing

Packet Captures on Test Instances:

sudo tcpdump -i eth0 -w /tmp/capture.pcap host 10.0.1.10  # No service restart needed

Canary Deployments: Test changes on 1% of traffic (e.g., weighted ALB routes).

C. Connection-Preserving Changes

Security Groups:
- Add new rules with higher priority (lower rule numbers) before deleting old ones.
NACLs:
- Temporarily set Ephemeral Ports (32768-60999) to ALLOW during changes.

3. Redundancy Patterns to Reduce Risk

Scenario	Failover Strategy
NAT Gateway Failure	Deploy NAT Gateway per AZ + test failover.
Route Table Corruption	Use version-controlled Terraform rollback.
SG Lockout	Pre-configure backup admin access (e.g., SSM).

4. Worst-Case Recovery Plan

Rollback Immediately:

Revert NACLs/SGs to last-known-good state.

aws ec2 replace-network-acl-association --association-id aclassoc-123 --network-acl-id acl-456

Bypass Troubleshooting:
- Use AWS Systems Manager (SSM) to debug instances without SSH (no SG changes needed).

Post-Mortem:

Check CloudTrail for who made changes:

aws cloudtrail lookup-events --lookup-attributes AttributeKey=ResourceName,AttributeValue=sg-123

Key Takeaways

✅ Avoid Live Changes: Prefer read-only tools (Flow Logs, CloudWatch).
✅ Stage Changes: Test in non-prod, then deploy with canaries.
✅ Preserve State: Never drop NACL/SG rules without redundancy.
✅ Automate Rollbacks: Use Terraform/CloudFormation for quick recovery.

Pro Tip: For critical systems, run chaos engineering tests (e.g., simulate NAT failure) during off-hours to validate resilience.

Need a step-by-step rollback guide for a specific scenario? I can tailor it to your architecture.

You’re spot on—real-world war stories and battle-tested tricks are what separate theoretical knowledge from production-hardened expertise. Below is a concise technical guide distilled from AWS networking trenches, covering lessons learned, hidden pitfalls, and pro tips rarely found in docs.

AWS Networking War Stories: The Unwritten Guide

"Good judgment comes from experience. Experience comes from bad judgment."

1. Security Groups (SGs): The Silent Killers

War Story: The Case of the Phantom Timeouts

Symptoms: Intermittent HTTP timeouts between microservices.
Root Cause: Overlapping SG rules with different description fields but identical IP permissions. AWS silently dedupes them, causing random drops.

Fix:

# Audit duplicate rules (CLI reveals what GUI hides)
aws ec2 describe-security-groups --query 'SecurityGroups[*].IpPermissions' | jq '.[] | group_by(.FromPort, .ToPort, .IpProtocol, .IpRanges)[] | select(length > 1)'

Lesson: Never trust the GUI alone—use CLI to audit SGs.

Pro Tip: The "Deny All" Egress Trap

Mistake: Setting egress = [] in Terraform (defaults to deny all).
Outcome: Instances lose SSM, patch management, and API connectivity.

Fix: Always explicitly allow:

egress {
  from_port   = 0
  to_port     = 0
  protocol    = "-1"
  cidr_blocks = ["0.0.0.0/0"]  # Or restrict to necessary IPs
}

2. NACLs: The Stateless Nightmare

War Story: The 5-Minute Outage

Symptoms: Database replication breaks after NACL "minor update."
Root Cause: NACL rule #100 allowed TCP/3306, but rule #200 denied Ephemeral Ports (32768-60999)—breaking replies.

Fix:

# Allow ephemeral ports INBOUND for responses
aws ec2 create-network-acl-entry --network-acl-id acl-123 --rule-number 150 --protocol tcp --port-range From=32768,To=60999 --cidr-block 10.0.1.0/24 --rule-action allow --ingress

Lesson: NACLs need mirror rules for ingress/egress. Test with telnet before deploying.

Pro Tip: The Rule-Order Bomb

Mistake: Adding a deny rule at #50 after allowing at #100.
Outcome: Traffic silently drops (first match wins).

Fix: Use describe-network-acls to audit rule ordering:

aws ec2 describe-network-acls --query 'NetworkAcls[*].Entries[?RuleNumber==`50`]'

3. NAT Gateways: The $0.045/hr Landmine

War Story: The 4 AM Bill Shock

Symptoms: $3k/month bill from "idle" NAT Gateways.
Root Cause: Leftover NAT Gateways in unused AZs (auto-created by Terraform).

Fix:

# Find unattached NAT Gateways
aws ec2 describe-nat-gateways --filter "Name=state,Values=available" --query 'NatGateways[?subnetId==`null`].NatGatewayId'

Lesson: Always tag NAT Gateways with Owner and Expiry.

Pro Tip: The TCP Connection Black Hole

Mistake: Replacing a NAT Gateway without draining connections.
Outcome: Active sessions (SSH, RDP, DB) hang until TCP timeout (30+ mins).
Fix:
- Before replacement: Reduce TCP timeouts on clients.
- Use Network Load Balancer (NLB) for stateful failover.

4. VPC Peering: The Cross-Account Trap

War Story: The DNS That Wasn’t

Symptoms: EC2 instances can’t resolve peered VPC’s private hosted zones.
Root Cause: Peering doesn’t auto-share Route53 Private Hosted Zones.

Fix:

# Associate PHZ with peer VPC
aws route53 create-vpc-association-authorization --hosted-zone-id Z123 --vpc VPCRegion=us-east-1,VPCId=vpc-456

Lesson: Test DNS resolution early in peering setups.

Pro Tip: The Overlapping CIDR Silent Fail

Mistake: Peering 10.0.0.0/16 with another 10.0.0.0/16.
Outcome: Routes appear, but traffic fails.
Fix: Always design non-overlapping CIDRs (e.g., 10.0.0.0/16 + 10.1.0.0/16).

5. Direct Connect: The BGP Rollercoaster

War Story: The 1-Packet-Per-Second Mystery

Symptoms: Applications crawl over Direct Connect.
Root Cause: BGP keepalive set to 60s (default), causing route flapping.

Fix:

# Adjust BGP timers (via AWS Console or CLI)
aws directconnect create-bgp-peer --virtual-interface-id dxvif-123 --bgp-peer 192.0.2.1,65000 --bgp-options '{"PeeringMode": "PRIVATE", "BgpAsn": 65101, "KeepaliveInterval": 10}'

Lesson: Override defaults—set keepalive = 10s, holddown = 30s.

Pro Tip: The MTU Mismatch

Mistake: Assuming AWS supports jumbo frames (9001 MTU).
Outcome: Packet fragmentation kills throughput.

Fix: Hard-set MTU to 1500 on on-prem routers:

# Linux example
ip link set dev eth0 mtu 1500

6. The Ultimate Troubleshooting Checklist

Before Making Changes:

Backup Configs:

aws ec2 describe-security-groups --query 'SecurityGroups[*].{GroupId:GroupId,IpPermissions:IpPermissions}' > sg-backup.json

Enable Flow Logs:

aws ec2 create-flow-logs --resource-type VPC --resource-id vpc-123 --traffic-type ALL --log-destination-type cloud-watch-logs

Test with Canary: Deploy changes to one AZ/subnet first.

When Things Break:

Rollback Fast: Use Terraform terraform apply -replace or CLI.
SSM Session Manager: Access instances without SSH (bypass broken SGs).

CloudTrail Forensics:

aws cloudtrail lookup-events --lookup-attributes AttributeKey=EventName,AttributeValue=DeleteSecurityGroup

Final Wisdom

Document Your "Murder Mystery" Stories: Every outage teaches something.
Automate Recovery: Use Lambda + EventBridge to auto-rollback NACL changes.
Pressure-Test Resiliency: Run GameDays (e.g., randomly kill NAT Gateways).

Want this as a PDF cheatsheet? I can structure it with more war stories and code snippets. Let me know!

38 KiB Raw Blame History Unescape Escape

1. Why Learn tcpdump in the Cloud Era?

Situations Where It Shines:

Cloud-Native Gaps It Fills:

2. Key tcpdump Commands for Cloud Engineers

Basic Capture (Save to File)

Filter AWS Metadata Service

Check MTU Issues

Validate NAT Gateway Traffic

3. When to Avoid tcpdump in the Cloud

4. Cloud-Specific tcpdump Tricks

Traffic Mirroring (AWS)

Containerized Workloads (EKS/EKS)

Lambda Cold Starts

5. How tcpdump Complements Cloud Tools

Debugging Flow Log "REJECT" Entries

Validating Security Groups

6. Learning Roadmap

7. Alternatives in Managed Services

Final Verdict

1. Addressing & Segmentation (Cloud’s "Layer 3")

Top of Mind:

CLI Command They Use Daily:

2. Cloud "Layer 4" Mastery (Transport Layer)

Top of Mind:

War Story:

CLI Command They Use Daily:

3. Cloud "Layer 7" (Application Layer)

Top of Mind:

CLI Command They Use Daily:

4. Cloud-Specific Protocols

Top of Mind:

War Story:

**CLI Command They Use Daily:

5. Troubleshooting Tools (Like tcpdump for Cloud)

Top of Mind:

CLI Command They Use Daily:

6. Cloud Network Limits (Like MTU in Trad Nets)

Top of Mind:

War Story:

**CLI Command They Use Daily:

7. Automation Mindset (Like Config Templates)

**Top of Mind:

**CLI Command They Use Daily:

The Cloud Network SME’s Cheat Sheet

Deep Dive: Mastering AWS Flow Logs for Advanced Troubleshooting

1. Flow Logs Fundamentals

What Flow Logs Capture

When to Use Flow Logs

2. Enabling & Configuring Flow Logs

GUI Method (Quick Setup)

CLI Method (Automation-Friendly)

Advanced Custom Fields

3. Analyzing Flow Logs

CloudWatch Logs Insights (GUI)

1. Top Talkers (Bandwidth Analysis)

2. Blocked Traffic Investigation

3. NAT Gateway Health Check

4. Suspicious Port Scanning

Athena (S3-Based SQL Analysis)

4. Real-World Troubleshooting Scenarios

Case 1: "Why Can’t My Instance Reach the Internet?"

Case 2: "Who’s Accessing My Database?"

Case 3: "Is My Application Generating Excessive Traffic?"

5. Pro Tips for Production

1. Optimize Costs

2. Automate Alerts

3. Centralized Logging

4. Security Hardening

6. Limitations & Workarounds

Final Checklist

AWS Networking: The Production Survival Guide

I. Flow Log Mastery: The GUI-CLI Hybrid Approach

1. Enabling Flow Logs (GUI Method)

2. CloudWatch Logs Insights Deep Dive

II. High-Risk Operations Playbook

Danger Zone: Actions That Break Connections

Safe Troubleshooting Techniques

III. War Stories: Lessons From the Trenches

1. The Case of the Vanishing Packets

38 KiB

Raw Blame History

1. Why Learn `tcpdump` in the Cloud Era?

2. Key `tcpdump` Commands for Cloud Engineers

**3. When to Avoid `tcpdump` in the Cloud**

4. Cloud-Specific `tcpdump` Tricks

5. How `tcpdump` Complements Cloud Tools

5. Troubleshooting Tools (Like `tcpdump` for Cloud)