Update tech_docs/cloud/aws_notes.md

This commit is contained in:
2025-07-20 21:29:18 -05:00
parent 22959c1090
commit 9d367f4f46

View File

@@ -1,3 +1,171 @@
Heres a **mini-lab** to practice the killer skills from our discussion, using only AWS Free Tier resources where possible. Youll diagnose a real-world scenario, optimize costs, and enforce tagging—just like a cloud network SME would.
---
### **Lab: "The Case of the Phantom Bill"**
**Scenario**: Your companys AWS bill spiked by \$2,000 last month. CFO is furious. Youve been tasked to find and fix the issue.
#### **Lab Objectives**
1. **Find** the cost culprit using AWS tools
2. **Fix** the issue with zero downtime
3. **Prevent** recurrence via automation
---
### **Step 1: Set Up the Crime Scene**
**Deploy the problem environment (5 minutes)**:
```bash
# Create a rogue NAT Gateway (billable item)
VPC_ID=$(aws ec2 describe-vpcs --query 'Vpcs[0].VpcId' --output text)
SUBNET_ID=$(aws ec2 describe-subnets --query 'Subnets[0].SubnetId' --output text)
AWS_REGION=$(aws configure get region)
# Deploy untagged NAT Gateway (the "phantom bill" culprit)
aws ec2 create-nat-gateway \
--subnet-id $SUBNET_ID \
--region $AWS_REGION \
--tag-specifications 'ResourceType=natgateway,Tags=[{Key=Name,Value=UNUSED_NAT}]'
# Simulate untagged dev resources (common brownfield mess)
aws ec2 run-instances \
--image-id ami-0abcdef1234567890 \
--instance-type t2.micro \
--subnet-id $SUBNET_ID \
--tag-specifications 'ResourceType=instance,Tags=[{Key=Environment,Value=dev}]'
```
---
### **Step 2: Investigate Like a SME**
#### **Skill 1: Cost Forensics with AWS CLI**
```bash
# Find top 5 cost drivers this month (replace dates)
aws ce get-cost-and-usage \
--time-period Start=2024-01-01,End=2024-01-31 \
--granularity MONTHLY \
--metrics "UnblendedCost" \
--group-by Type=DIMENSION,Key=SERVICE \
--query 'ResultsByTime[].Groups[?Metrics.UnblendedCost.Amount > `0`] | sort_by(@, &to_number(Metrics.UnblendedCost.Amount))[-5:]' \
--output table
```
**Expected Finding**: `AmazonVPC` costs are abnormally high.
#### **Skill 2: Packet-Level Verification**
Check if NAT Gateway is actually used:
```bash
# Get NAT Gateway IP
NAT_IP=$(aws ec2 describe-nat-gateways --query 'NatGateways[0].NatGatewayAddresses[0].PublicIp' --output text)
# Start traffic capture (run on an EC2 instance in private subnet)
sudo tcpdump -i eth0 host $NAT_IP -nn -c 10 -w nat_traffic.pcap
```
**Analysis**: No packets? NAT is unused.
---
### **Step 3: Fix & Automate**
#### **Skill 3: Zero-Downtime Remediation**
```bash
# Step 1: Tag the NAT for deletion (avoid killing active resources)
aws ec2 create-tags \
--resources $(aws ec2 describe-nat-gateways --query 'NatGateways[0].NatGatewayId' --output text) \
--tags Key=ExpirationDate,Value=$(date -d "+7 days" +%Y-%m-%d)
# Step 2: Deploy Lambda auto-cleanup (prevents future issues)
cat > lambda_function.py <<'EOF'
import boto3, datetime
def lambda_handler(event, context):
ec2 = boto3.client('ec2')
expired = ec2.describe_nat_gateways(Filters=[{
'Name': 'tag:ExpirationDate',
'Values': [datetime.datetime.now().strftime('%Y-%m-%d')]
}])
for nat in expired['NatGateways']:
ec2.delete_nat_gateway(NatGatewayId=nat['NatGatewayId'])
EOF
# Deploy Lambda (Python 3.9)
aws lambda create-function \
--function-name CleanupNATs \
--runtime python3.9 \
--handler lambda_function.lambda_handler \
--role arn:aws:iam::$(aws sts get-caller-identity --query Account --output text):role/lambda-execution-role \
--zip-file fileb://lambda_function.py
```
#### **Skill 4: Tag Enforcement**
```bash
# Deploy SCP to block untagged resources
aws organizations create-policy \
--name "RequireTags" \
--description "No tags, no resources" \
--content '{
"Version": "2012-10-17",
"Statement": [{
"Effect": "Deny",
"Action": ["ec2:RunInstances", "ec2:CreateNatGateway"],
"Resource": "*",
"Condition": {
"Null": {
"aws:RequestTag/Owner": "true",
"aws:RequestTag/CostCenter": "true"
}
}
}]
}'
```
---
### **Step 4: Prove Your Value**
**Generate a Cost Savings Report**:
```bash
# Calculate savings (NAT Gateway: $0.045/hr * 24 * 30 = $32.40/month)
echo "## Monthly Savings Report" > report.md
echo "- **Deleted Unused NAT Gateway**: \$32.40/month" >> report.md
echo "- **Prevented Future Waste**: \$100+/month (estimated)" >> report.md
echo "**Total Annualized Savings**: \$1,588.80" >> report.md
# Share with leadership
aws ses send-email \
--from "you@company.com" \
--to "boss@company.com" \
--subject "Cost Optimization Results" \
--text file://report.md
```
---
### **Lab Extensions (Bonus Points)**
1. **Find Cross-AZ Traffic**:
```bash
aws ce get-cost-and-usage \
--filter '{"Dimensions": {"Key": "USAGE_TYPE", "Values": ["DataTransfer-BetweenAZ"]}}'
```
2. **Set Up Budget Alarms**:
```bash
aws budgets create-budget \
--budget '{"BudgetName": "NAT-Gateway-Alert", "BudgetLimit": {"Amount": "50", "Unit": "USD"}, "CostFilters": {"Service": ["AmazonVPC"]}}'
```
---
### **Why This Lab Matters**
- **Real AWS Resources**: Uses actual billable services (but stays Free Tier friendly).
- **SME Skills Practiced**:
- Cost analysis via CLI
- Packet-level verification
- Zero-downtime fixes
- Tag governance
- **Career Impact**: These exact skills have gotten engineers promoted.
**Time to Complete**: ~30 minutes.
**Cost**: < \$0.50 (delete NAT Gateway immediately after lab).
Want a **more advanced version** with Direct Connect or hybrid cloud scenarios? Let me know!
---
Heres the **killer skill set** that combines "boring" fundamentals with cloud-native expertise to make you the **unquestioned SME**—the one who fixes what others cant, optimizes what others overlook, and becomes indispensable: Heres the **killer skill set** that combines "boring" fundamentals with cloud-native expertise to make you the **unquestioned SME**—the one who fixes what others cant, optimizes what others overlook, and becomes indispensable:
--- ---