Update tech_docs/cloud/aws_notes.md

This commit is contained in:
2025-07-20 21:07:56 -05:00
parent 5a719ed2b6
commit 947b14ec5f

View File

@@ -1,3 +1,170 @@
A **Cloud Network SME** operates at the same level of mastery as a traditional network engineer but with a cloud-native lens. Heres what they have **top of mind**, structured like the OSI model for clarity:
---
### **1. Addressing & Segmentation (Clouds "Layer 3")**
#### **Top of Mind:**
- **RFC 1918 in the Cloud**:
- Knows `10.0.0.0/8`, `172.16.0.0/12`, `192.168.0.0/16` but also:
- **AWS Reserved Ranges**: `169.254.0.0/16` (link-local), `100.64.0.0/10` (Carrier NAT)
- **Avoids Overlaps**: Never peers `10.0.0.0/16` with another `10.0.0.0/16` (silent failure).
- **Subnetting at Scale**:
- **/28 Minimum in AWS** (5 IPs reserved per subnet).
- **AZ-Aware Design**:
```bash
# Example: 10.0.0.0/16 → /20 per AZ (AWS best practice)
us-east-1a: 10.0.0.0/20
us-east-1b: 10.0.16.0/20
```
#### **CLI Command They Use Daily:**
```bash
aws ec2 describe-subnets --query 'Subnets[*].{AZ:AvailabilityZone,CIDR:CidrBlock,Name:Tags[?Key==`Name`].Value|[0]}' --output table
```
---
### **2. Cloud "Layer 4" Mastery (Transport Layer)**
#### **Top of Mind:**
- **Stateful vs. Stateless**:
- **Security Groups (Stateful)**: Return traffic auto-allowed.
- **NACLs (Stateless)**: Must allow ephemeral ports (`32768-60999`) bidirectionally.
- **Port Knowledge**:
- **Not Just 80/443**:
- `2879` (BGP over Direct Connect)
- `6081` (Geneve for AWS VPC Traffic Mirroring)
- `53` (DNS for PrivateLink endpoints)
#### **War Story:**
*"Why is my NAT Gateway not working?"*
→ Forgot to allow outbound `1024-65535` in the private subnets NACL.
#### **CLI Command They Use Daily:**
```bash
# Check ephemeral port range on Linux instances
cat /proc/sys/net/ipv4/ip_local_port_range
```
---
### **3. Cloud "Layer 7" (Application Layer)**
#### **Top of Mind:**
- **Load Balancer Types**:
| Type | Use Case | Key Detail |
|------|----------|------------|
| ALB | HTTP/HTTPS | Supports path-based routing (`/api/*`) |
| NLB | Ultra-low latency | Preserves source IP (no X-Forwarded-For) |
| GWLB | Threat inspection | Chains with Firewall (Palo Alto, Fortinet) |
- **PrivateLink**:
- Knows `com.amazonaws.vpce.{region}.vpce-svc-xxxx` DNS format.
- **Gotcha**: Doesnt auto-share Route 53 Private Hosted Zones.
#### **CLI Command They Use Daily:**
```bash
aws ec2 describe-vpc-endpoint-services --query 'ServiceDetails[?ServiceType==`Interface`].ServiceName'
```
---
### **4. Cloud-Specific Protocols**
#### **Top of Mind:**
- **Geneve (UDP 6081)**:
- Encapsulation protocol for AWS Traffic Mirroring.
- **BGP over Direct Connect**:
- Default `keepalive=60s` is too high—sets to `10s`.
- **VXLAN (Overlay for Transit Gateway)**:
- Knows TGW attachments use VXLAN headers for cross-account routing.
#### **War Story:**
*"Why is my Direct Connect flapping?"*
→ BGP `holddown` timer was left at default (`180s`).
#### **CLI Command They Use Daily:
```bash
aws directconnect describe-virtual-interfaces --query 'virtualInterfaces[*].[virtualInterfaceId,bgpPeers[0].bgpStatus]'
```
---
### **5. Troubleshooting Tools (Like `tcpdump` for Cloud)**
#### **Top of Mind:**
- **Flow Logs**:
- Query with CloudWatch Insights:
```sql
fields @timestamp, srcAddr, dstAddr, action | filter action="REJECT" | sort @timestamp desc
```
- **VPC Traffic Mirroring**:
- Copies traffic to an analysis instance (like SPAN in trad networks).
- **Reachability Analyzer**:
- Pre-checks paths before making changes.
#### **CLI Command They Use Daily:**
```bash
aws ec2 create-network-insights-path --source <eni-id> --destination-port 443 --protocol tcp
```
---
### **6. Cloud Network Limits (Like MTU in Trad Nets)**
#### **Top of Mind:**
- **AWS MTU**: Always **1500** (jumbo frames not supported over internet/DX).
- **NAT Gateway Throughput**:
- Up to **100 Gbps** but 5 Gbps per flow.
- **Security Group Limits**:
- 60 rules per SG, 5 SGs per ENI.
#### **War Story:**
*"Why is my throughput capped at 5 Gbps?"*
→ Single TCP flow hitting NAT Gateway limit.
#### **CLI Command They Use Daily:
```bash
aws ec2 describe-account-attributes --query 'AccountAttributes[?AttributeName==`max-instances`].AttributeValues'
```
---
### **7. Automation Mindset (Like Config Templates)**
#### **Top of Mind:
- **Infrastructure as Code (IaC)**:
- Terraform snippets for zero-downtime SG updates:
```hcl
resource "aws_security_group_rule" "temp_rule" {
lifecycle { create_before_destroy = true }
}
```
- **AWS APIs**:
- Uses `modify-network-interface-attribute` over console clicks.
#### **CLI Command They Use Daily:
```bash
aws ec2 modify-instance-metadata-options --instance-id i-123abc --http-put-response-hop-limit 2
```
---
### **The Cloud Network SMEs Cheat Sheet**
| **Traditional** | **Cloud Equivalent** |
|-----------------------|------------------------------------|
| Subnetting | VPC CIDR design + AZ distribution |
| BGP | Direct Connect BGP timers |
| SPAN port | VPC Traffic Mirroring |
| Firewall rules | Security Groups + NACLs |
| tcpdump | Flow Logs + Athena SQL |
**Final Tip:** A true cloud SME doesnt just *know* these—they automate them. For example:
```bash
# Auto-remediate overly permissive SGs
aws ec2 revoke-security-group-egress --group-id sg-123 --ip-permissions 'IpProtocol=-1,FromPort=-1,ToPort=-1,IpRanges=[{CidrIp=0.0.0.0/0}]'
```
Would you like a **hands-on lab** for any of these scenarios?
---
# **Deep Dive: Mastering AWS Flow Logs for Advanced Troubleshooting**
## **1. Flow Logs Fundamentals**