Update tech_docs/cloud/aws_notes.md
This commit is contained in:
@@ -1,3 +1,174 @@
|
||||
Here’s the **killer skill set** that combines "boring" fundamentals with cloud-native expertise to make you the **unquestioned SME**—the one who fixes what others can’t, optimizes what others overlook, and becomes indispensable:
|
||||
|
||||
---
|
||||
|
||||
### **1. The "Boring" Fundamentals That Make You Dangerous**
|
||||
#### **A. Packet-Level Kung Fu**
|
||||
- **Mastery**: `tcpdump`, `Wireshark`, `mtr`
|
||||
- **Cloud Application**:
|
||||
- Diagnose HTTPS handshake failures between ALB and EC2 when Security Groups "look fine."
|
||||
- Prove MTU issues causing packet drops in VPN tunnels.
|
||||
**Pro Move**:
|
||||
```bash
|
||||
# Capture TLS handshakes to prove cert mismatches
|
||||
sudo tcpdump -i eth0 'tcp port 443 and (tcp-syn|tcp-ack)!=0' -XX -w tls.pcap
|
||||
```
|
||||
|
||||
#### **B. DNS & Routing Wizardry**
|
||||
- **Mastery**: `dig`, `route tables`, BGP
|
||||
- **Cloud Application**:
|
||||
- Explain why PrivateLink endpoints resolve but don’t connect (spoiler: missing Route53 private zone associations).
|
||||
- Fix Direct Connect flapping by adjusting BGP timers (`keepalive=10`, `hold=30`).
|
||||
**Pro Move**:
|
||||
```bash
|
||||
# Find DNS leaks in hybrid cloud
|
||||
dig +short myapp.internal | grep -v '10\.' # Non-RFC1918 responses = bad
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### **2. Cloud-Native Cost Surgery**
|
||||
#### **A. Billable Event Forensics**
|
||||
- **Mastery**: AWS Cost Explorer, CUR, OpenCost
|
||||
- **Cloud Application**:
|
||||
- Trace a $15k/month spike to orphaned NAT Gateways in unused AZs.
|
||||
- Prove dev teams are routing traffic cross-AZ ($$$) when same-AZ paths exist.
|
||||
**Pro Move**:
|
||||
```sql
|
||||
-- Find cross-AZ traffic in CUR
|
||||
SELECT line_item_usage_type, SUM(line_item_unblended_cost)
|
||||
FROM aws_cur
|
||||
WHERE line_item_usage_type LIKE '%DataTransfer-BetweenAZ%'
|
||||
GROUP BY 1;
|
||||
```
|
||||
|
||||
#### **B. Tagging Dictatorship**
|
||||
- **Mastery**: AWS SCPs, AWS Config, Resource Groups
|
||||
- **Cloud Application**:
|
||||
- Force 100% tagging compliance by denying untagged resource creation.
|
||||
- Automatically nuke resources with `ExpirationDate=2023-12-31`.
|
||||
**Pro Move**:
|
||||
```bash
|
||||
# Find untagged resources costing >$500/month
|
||||
aws ce get-cost-and-usage \
|
||||
--time-period Start=2024-01-01,End=2024-01-31 \
|
||||
--filter '{"Not": {"Dimensions": {"Key": "ResourceTags:Owner", "Values": ["*"]}}}'
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### **3. Hybrid Cloud Debugging**
|
||||
#### **A. VPN/DC Troubleshooting**
|
||||
- **Mastery**: `ping -s`, `aws directconnect describe-virtual-interfaces`
|
||||
- **Cloud Application**:
|
||||
- Prove on-prem firewall drops AWS’s ICMP fragmentation needed packets (MTU 1500).
|
||||
- Diagnose BGP route flapping with `route -n` and AWS CLI.
|
||||
**Pro Move**:
|
||||
```bash
|
||||
# Test MTU end-to-end (AWS → on-prem)
|
||||
ping -M do -s 1472 10.1.1.1 # 1472 + 28 = 1500 bytes
|
||||
```
|
||||
|
||||
#### **B. Traffic Mirroring + IDS**
|
||||
- **Mastery**: `tcpdump`, Zeek, Suricata
|
||||
- **Cloud Application**:
|
||||
- Mirror suspicious ENI traffic to a security VPC for analysis.
|
||||
- Detect cryptojacking via anomalous outbound connections.
|
||||
**Pro Move**:
|
||||
```bash
|
||||
# Mirror traffic to a security appliance
|
||||
aws ec2 create-traffic-mirror-target --network-interface-id eni-123abc
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### **4. Automation That Scares People**
|
||||
#### **A. CLI-Fu**
|
||||
- **Mastery**: AWS CLI + `jq` + `xargs`
|
||||
- **Cloud Application**:
|
||||
- One-liner to delete all untagged EBS volumes older than 30 days:
|
||||
```bash
|
||||
aws ec2 describe-volumes \
|
||||
--query 'Volumes[?Tags==null && CreateTime<`2024-01-01`].VolumeId' \
|
||||
--output text | xargs -I {} aws ec2 delete-volume --volume-id {}
|
||||
```
|
||||
|
||||
#### **B. Terraform Modules for Zero-Downtime Changes**
|
||||
- **Mastery**: `create_before_destroy`, `count`
|
||||
- **Cloud Application**:
|
||||
- Swap NACLs without dropping connections:
|
||||
```hcl
|
||||
resource "aws_network_acl_rule" "new" {
|
||||
lifecycle { create_before_destroy = true }
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### **5. The "Soft" Skills That Lock In Your Authority**
|
||||
#### **A. Cost Attribution Storytelling**
|
||||
- **Mastery**: PowerBI/Grafana dashboards showing cost by team
|
||||
- **Pro Move**:
|
||||
```sql
|
||||
-- PowerBI Query for Team Accountability
|
||||
SELECT
|
||||
resource_tags_user_team,
|
||||
SUM(line_item_unblended_cost) AS cost
|
||||
FROM aws_cur
|
||||
WHERE line_item_product_code = 'AmazonVPC'
|
||||
GROUP BY 1
|
||||
ORDER BY cost DESC
|
||||
```
|
||||
|
||||
#### **B. Post-Mortem Brutal Honesty**
|
||||
- **Template**:
|
||||
```markdown
|
||||
## Root Cause:
|
||||
Untagged NAT Gateway left running in us-west-2 ($1,200 wasted).
|
||||
## Fix:
|
||||
SCP enforcing `Owner` tag on all EC2 resources.
|
||||
## Savings:
|
||||
$14,400/year.
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### **Why This Works When Others Fail**
|
||||
1. **You Speak Packet-Level Truth**: When the "cloud-native" team says "the Security Groups are open," you show the `tcpdump` proving RST packets.
|
||||
2. **You Attribute Costs Ruthlessly**: Finance teams will love you when you prove Team X caused a $50k spike.
|
||||
3. **You Automate the Pain Away**: Your scripts make you look like a wizard.
|
||||
|
||||
---
|
||||
|
||||
### **Skill Acquisition Roadmap**
|
||||
1. **Week 1-2**: Master `tcpdump` + VPC Flow Logs.
|
||||
2. **Week 3-4**: Build a Cost Dashboard with CUR.
|
||||
3. **Week 5-6**: Enforce Tagging with SCPs.
|
||||
4. **Week 7+**: Automate NACL/SG changes with Terraform.
|
||||
|
||||
---
|
||||
|
||||
### **Final Boss Move**
|
||||
Deploy this **NAT Gateway Hunter-Killer Lambda**:
|
||||
```python
|
||||
import boto3
|
||||
def lambda_handler(event, context):
|
||||
ec2 = boto3.client('ec2')
|
||||
idle_nats = ec2.describe_nat_gateways(Filters=[
|
||||
{'Name': 'state', 'Values': ['available']},
|
||||
{'Name': 'tag:LastUsed', 'Values': ['<90days-ago>']}
|
||||
])
|
||||
for nat in idle_nats['NatGateways']:
|
||||
ec2.delete_nat_gateway(NatGatewayId=nat['NatGatewayId'])
|
||||
```
|
||||
*(Saves $1,000/month per idle NAT Gateway)*
|
||||
|
||||
---
|
||||
|
||||
You’re not just another cloud engineer—you’re the **cloud network surgeon** who cuts costs, fixes outages, and owns the untouchable skills. Want me to drill into a specific skill with a hands-on lab?
|
||||
|
||||
---
|
||||
|
||||
Here’s a **FinOps-focused battle plan** to master cloud cost optimization, with specific AWS billable events to hunt down and tools to control them—ensuring your salary stays funded by savings you generate:
|
||||
|
||||
---
|
||||
|
||||
Reference in New Issue
Block a user