Update tech_docs/cloud/aws_notes.md
This commit is contained in:
@@ -1,3 +1,171 @@
|
||||
Here’s a **mini-lab** to practice the killer skills from our discussion, using only AWS Free Tier resources where possible. You’ll diagnose a real-world scenario, optimize costs, and enforce tagging—just like a cloud network SME would.
|
||||
|
||||
---
|
||||
|
||||
### **Lab: "The Case of the Phantom Bill"**
|
||||
**Scenario**: Your company’s AWS bill spiked by \$2,000 last month. CFO is furious. You’ve been tasked to find and fix the issue.
|
||||
|
||||
#### **Lab Objectives**
|
||||
1. **Find** the cost culprit using AWS tools
|
||||
2. **Fix** the issue with zero downtime
|
||||
3. **Prevent** recurrence via automation
|
||||
|
||||
---
|
||||
|
||||
### **Step 1: Set Up the Crime Scene**
|
||||
**Deploy the problem environment (5 minutes)**:
|
||||
```bash
|
||||
# Create a rogue NAT Gateway (billable item)
|
||||
VPC_ID=$(aws ec2 describe-vpcs --query 'Vpcs[0].VpcId' --output text)
|
||||
SUBNET_ID=$(aws ec2 describe-subnets --query 'Subnets[0].SubnetId' --output text)
|
||||
AWS_REGION=$(aws configure get region)
|
||||
|
||||
# Deploy untagged NAT Gateway (the "phantom bill" culprit)
|
||||
aws ec2 create-nat-gateway \
|
||||
--subnet-id $SUBNET_ID \
|
||||
--region $AWS_REGION \
|
||||
--tag-specifications 'ResourceType=natgateway,Tags=[{Key=Name,Value=UNUSED_NAT}]'
|
||||
|
||||
# Simulate untagged dev resources (common brownfield mess)
|
||||
aws ec2 run-instances \
|
||||
--image-id ami-0abcdef1234567890 \
|
||||
--instance-type t2.micro \
|
||||
--subnet-id $SUBNET_ID \
|
||||
--tag-specifications 'ResourceType=instance,Tags=[{Key=Environment,Value=dev}]'
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### **Step 2: Investigate Like a SME**
|
||||
#### **Skill 1: Cost Forensics with AWS CLI**
|
||||
```bash
|
||||
# Find top 5 cost drivers this month (replace dates)
|
||||
aws ce get-cost-and-usage \
|
||||
--time-period Start=2024-01-01,End=2024-01-31 \
|
||||
--granularity MONTHLY \
|
||||
--metrics "UnblendedCost" \
|
||||
--group-by Type=DIMENSION,Key=SERVICE \
|
||||
--query 'ResultsByTime[].Groups[?Metrics.UnblendedCost.Amount > `0`] | sort_by(@, &to_number(Metrics.UnblendedCost.Amount))[-5:]' \
|
||||
--output table
|
||||
```
|
||||
**Expected Finding**: `AmazonVPC` costs are abnormally high.
|
||||
|
||||
#### **Skill 2: Packet-Level Verification**
|
||||
Check if NAT Gateway is actually used:
|
||||
```bash
|
||||
# Get NAT Gateway IP
|
||||
NAT_IP=$(aws ec2 describe-nat-gateways --query 'NatGateways[0].NatGatewayAddresses[0].PublicIp' --output text)
|
||||
|
||||
# Start traffic capture (run on an EC2 instance in private subnet)
|
||||
sudo tcpdump -i eth0 host $NAT_IP -nn -c 10 -w nat_traffic.pcap
|
||||
```
|
||||
**Analysis**: No packets? NAT is unused.
|
||||
|
||||
---
|
||||
|
||||
### **Step 3: Fix & Automate**
|
||||
#### **Skill 3: Zero-Downtime Remediation**
|
||||
```bash
|
||||
# Step 1: Tag the NAT for deletion (avoid killing active resources)
|
||||
aws ec2 create-tags \
|
||||
--resources $(aws ec2 describe-nat-gateways --query 'NatGateways[0].NatGatewayId' --output text) \
|
||||
--tags Key=ExpirationDate,Value=$(date -d "+7 days" +%Y-%m-%d)
|
||||
|
||||
# Step 2: Deploy Lambda auto-cleanup (prevents future issues)
|
||||
cat > lambda_function.py <<'EOF'
|
||||
import boto3, datetime
|
||||
def lambda_handler(event, context):
|
||||
ec2 = boto3.client('ec2')
|
||||
expired = ec2.describe_nat_gateways(Filters=[{
|
||||
'Name': 'tag:ExpirationDate',
|
||||
'Values': [datetime.datetime.now().strftime('%Y-%m-%d')]
|
||||
}])
|
||||
for nat in expired['NatGateways']:
|
||||
ec2.delete_nat_gateway(NatGatewayId=nat['NatGatewayId'])
|
||||
EOF
|
||||
|
||||
# Deploy Lambda (Python 3.9)
|
||||
aws lambda create-function \
|
||||
--function-name CleanupNATs \
|
||||
--runtime python3.9 \
|
||||
--handler lambda_function.lambda_handler \
|
||||
--role arn:aws:iam::$(aws sts get-caller-identity --query Account --output text):role/lambda-execution-role \
|
||||
--zip-file fileb://lambda_function.py
|
||||
```
|
||||
|
||||
#### **Skill 4: Tag Enforcement**
|
||||
```bash
|
||||
# Deploy SCP to block untagged resources
|
||||
aws organizations create-policy \
|
||||
--name "RequireTags" \
|
||||
--description "No tags, no resources" \
|
||||
--content '{
|
||||
"Version": "2012-10-17",
|
||||
"Statement": [{
|
||||
"Effect": "Deny",
|
||||
"Action": ["ec2:RunInstances", "ec2:CreateNatGateway"],
|
||||
"Resource": "*",
|
||||
"Condition": {
|
||||
"Null": {
|
||||
"aws:RequestTag/Owner": "true",
|
||||
"aws:RequestTag/CostCenter": "true"
|
||||
}
|
||||
}
|
||||
}]
|
||||
}'
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### **Step 4: Prove Your Value**
|
||||
**Generate a Cost Savings Report**:
|
||||
```bash
|
||||
# Calculate savings (NAT Gateway: $0.045/hr * 24 * 30 = $32.40/month)
|
||||
echo "## Monthly Savings Report" > report.md
|
||||
echo "- **Deleted Unused NAT Gateway**: \$32.40/month" >> report.md
|
||||
echo "- **Prevented Future Waste**: \$100+/month (estimated)" >> report.md
|
||||
echo "**Total Annualized Savings**: \$1,588.80" >> report.md
|
||||
|
||||
# Share with leadership
|
||||
aws ses send-email \
|
||||
--from "you@company.com" \
|
||||
--to "boss@company.com" \
|
||||
--subject "Cost Optimization Results" \
|
||||
--text file://report.md
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### **Lab Extensions (Bonus Points)**
|
||||
1. **Find Cross-AZ Traffic**:
|
||||
```bash
|
||||
aws ce get-cost-and-usage \
|
||||
--filter '{"Dimensions": {"Key": "USAGE_TYPE", "Values": ["DataTransfer-BetweenAZ"]}}'
|
||||
```
|
||||
2. **Set Up Budget Alarms**:
|
||||
```bash
|
||||
aws budgets create-budget \
|
||||
--budget '{"BudgetName": "NAT-Gateway-Alert", "BudgetLimit": {"Amount": "50", "Unit": "USD"}, "CostFilters": {"Service": ["AmazonVPC"]}}'
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### **Why This Lab Matters**
|
||||
- **Real AWS Resources**: Uses actual billable services (but stays Free Tier friendly).
|
||||
- **SME Skills Practiced**:
|
||||
- Cost analysis via CLI
|
||||
- Packet-level verification
|
||||
- Zero-downtime fixes
|
||||
- Tag governance
|
||||
- **Career Impact**: These exact skills have gotten engineers promoted.
|
||||
|
||||
**Time to Complete**: ~30 minutes.
|
||||
**Cost**: < \$0.50 (delete NAT Gateway immediately after lab).
|
||||
|
||||
Want a **more advanced version** with Direct Connect or hybrid cloud scenarios? Let me know!
|
||||
|
||||
---
|
||||
|
||||
Here’s the **killer skill set** that combines "boring" fundamentals with cloud-native expertise to make you the **unquestioned SME**—the one who fixes what others can’t, optimizes what others overlook, and becomes indispensable:
|
||||
|
||||
---
|
||||
|
||||
Reference in New Issue
Block a user