Update tech_docs/cloud/aws_notes.md

This commit is contained in:
2025-07-20 21:21:27 -05:00
parent 4cd6a95e54
commit 3216570537

View File

@@ -1,3 +1,493 @@
Heres a **FinOps-focused battle plan** to master cloud cost optimization, with specific AWS billable events to hunt down and tools to control them—ensuring your salary stays funded by savings you generate:
---
### **1. Network-Specific Cost Killers**
#### **A. NAT Gateways ($0.045/hr + $0.045/GB)**
- **Key Actions**:
- **Find idle NATs**:
```bash
aws ec2 describe-nat-gateways --filter "Name=state,Values=available" --query "NatGateways[?subnetId=='null']"
```
- **Replace with NAT Instances** for dev workloads (~70% cheaper).
- **Use VPC Endpoints** for S3/DynamoDB (free intra-AWS traffic).
#### **B. Cross-AZ Data Transfer ($0.01/GB)**
- **Hotspots**:
- ALBs routing between AZs
- RDS read replicas in different AZs
- **Fix**:
```bash
# Find cross-AZ traffic in Flow Logs
fields @timestamp, srcAddr, dstAddr, bytes | filter srcAZ != dstAZ | stats sum(bytes) by srcAddr, dstAddr
```
#### **C. Direct Connect ($0.03-$0.12/GB)**
- **Optimize**:
- Use **compression** for repetitive data (e.g., database syncs).
- Set up **BGP communities** to prefer cheaper routes.
---
### **2. Hidden Billable Events**
#### **A. VPC Flow Logs ($0.50/GB ingested)**
- **Optimize**:
- Filter to `REJECT` only for security use cases.
- Send to S3 instead of CloudWatch for long-term storage.
#### **B. Elastic IPs ($0.005/hr if unattached)**
- **Nuke orphaned IPs**:
```bash
aws ec2 describe-addresses --query 'Addresses[?AssociationId==null]' | jq -r '.[].PublicIp' | xargs -I {} aws ec2 release-address --public-ip {}
```
#### **C. Traffic Mirroring ($0.15/GB)**
- **Only enable for forensic investigations**, not 24/7.
---
### **3. FinOps Tools Mastery**
#### **A. AWS Cost Explorer**
- **Pro Query**:
`Service=EC2, Group By=Usage Type` → Look for `DataTransfer-Out-Bytes`.
- **Set Alerts**: For sudden spikes in `AWSDataTransfer`.
#### **B. AWS Cost & Usage Report (CUR)**
- **Critical Fields**:
```sql
SELECT line_item_usage_type, sum(line_item_unblended_cost)
FROM cur
WHERE product_product_name='Amazon Virtual Private Cloud'
GROUP BY line_item_usage_type
```
#### **C. OpenCost (Kubernetes)**
- **Install**:
```bash
helm install opencost opencost/opencost --namespace opencost
```
- **Find**: Pods with high egress costs to internet.
---
### **4. Prevention Framework**
#### **A. Tagging Strategy (Non-Negotiable)**
- **Mandatory Tags**:
```plaintext
Owner, CostCenter, Environment (prod/dev), ExpirationDate
```
- **Enforce via SCP**:
```json
{
"Condition": {
"Null": {
"aws:RequestTag/CostCenter": "false"
}
}
}
```
#### **B. Automated Cleanup**
- **Lambda to kill old resources**:
```python
def lambda_handler(event, context):
ec2 = boto3.client('ec2')
old_amis = ec2.describe_images(Filters=[{'Name': 'tag:LastUsed', 'Values': ['<90days-ago>']}])
ec2.deregister_image(ImageId=old_amis['Images'][0]['ImageId'])
```
#### **C. Budget Alerts**
```bash
aws budgets create-budget \
--account-id 123456789012 \
--budget '{
"BudgetName": "network-monthly",
"BudgetLimit": {"Amount": "1000", "Unit": "USD"},
"CostFilters": {"Service": ["AmazonVPC", "EC2"]}
}'
```
---
### **5. Cost Attribution**
#### **A. Chargeback Models**
- **Network Cost Allocation**:
- Bill teams by **VPC ID** or **Security Group** usage.
- Use **AWS Tags** + **Cost Categories**.
#### **B. Showback Reports**
- **Sample PowerBI Query**:
```sql
SELECT [Product], [UsageType], SUM([Cost])
FROM aws_cur
WHERE [ResourceTags.CostCenter] = 'NetworkingTeam'
```
---
### **6. Pro Tips from Cloud Economists**
1. **Reserved Capacity**:
- Buy **Savings Plans** for NAT Gateways if 24/7 usage.
2. **Shutdown Schedules**:
- Use **AWS Instance Scheduler** for non-prod resources.
3. **Negotiate Discounts**:
- Enterprise Support can get **10-15% off** data transfer.
---
### **Your FinOps Cheat Sheet**
| **Cost Trap** | **Detection Method** | **Fix** |
|------------------------|-----------------------------------------------|------------------------------------------|
| Unused NAT Gateway | `aws ec2 describe-nat-gateways` + Cost Explorer | Replace with NAT instance or delete |
| Cross-AZ ALB traffic | Flow Logs + `filter srcAZ != dstAZ` | Enable AZ affinity |
| Orphaned EIPs | `aws ec2 describe-addresses` | Automated nightly cleanup Lambda |
| S3-to-EC2 data transfer| CUR filtered by `UsageType=DataTransfer-In` | Deploy VPC Gateway Endpoint (free) |
---
### **Final Step: Prove Your Value**
- **Monthly Report Template**:
```markdown
## Cost Savings Achieved (Month: YYYY-MM)
- **NAT Gateway Optimization**: $1,200/mo
- **Cross-AZ Reduction**: $450/mo
- **Tag-Based Cleanup**: $800/mo
**Total Saved**: $2,450/mo → **$29,400/year**
```
This directly ties your work to **protecting your team's budget**—and your salary.
Want me to **drill into a specific cost-saving tactic** with exact CLI commands? For example, how to force all S3 traffic through VPC endpoints?
---
Youre speaking my language! Tagging is the **ultimate power move** for cost accountability—whether youre walking into a pristine greenfield setup or a chaotic brownfield spaghetti mess. Heres how to **enforce tagging like a dictator** and bill teams with surgical precision, no matter the environments complexity:
---
### **1. The Tagging "Iron Fist" Strategy**
#### **A. Mandatory Tags (Non-Negotiable)**
Every resource must have these **4 tags** (enforced via SCPs/IAM):
| Tag Key | Example Value | Why It Matters |
|---------|---------------|----------------|
| `Owner` | `team-networking@company.com` | Who to shame in cost reports |
| `CostCenter` | `CC-12345` | Finance departments best friend |
| `Environment` | `prod`/`dev`/`staging` | Never let dev costs bleed into prod |
| `ExpirationDate` | `2024-12-31` | Auto-nuke stale resources |
**Enforcement via AWS SCP (Service Control Policy):**
```json
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "DenyCreateUntaggedResources",
"Effect": "Deny",
"Action": ["ec2:RunInstances", "ec2:CreateVpc"],
"Resource": "*",
"Condition": {
"Null": {
"aws:RequestTag/Owner": "true",
"aws:RequestTag/CostCenter": "true"
}
}
}
]
}
```
*(Attach this to OUs in AWS Organizations)*
---
### **2. Cost Attribution Like a Boss**
#### **A. Bill Back by VPC/Security Group**
**Step 1:** Isolate teams into dedicated VPCs or tag SGs:
```bash
# Tag SGs to teams (even in shared VPCs)
aws ec2 create-tags \
--resources sg-123abc \
--tags Key=Team,Value=marketing
```
**Step 2:** Use AWS **Cost Categories** to group costs:
1. **Console**: AWS Cost Explorer → **Cost Categories** → Define rules like:
- `Team = ${aws:ResourceTag/Team}`
- `Project = ${aws:ResourceTag/Project}`
**Step 3:** Generate team-specific invoices:
```sql
-- AWS CUR SQL Query (Athena/PowerBI)
SELECT
line_item_usage_account_id,
resource_tags_user_team, # Extracted from tags
SUM(line_item_unblended_cost) AS cost
FROM cost_and_usage_report
WHERE line_item_product_code = 'AmazonVPC'
GROUP BY 1, 2
ORDER BY cost DESC
```
#### **B. Chargeback for Network Services**
- **NAT Gateway Costs**: Bill teams by **private subnet usage** (tag subnets to teams).
- **Data Transfer**: Use **Cost Explorer** → Filter by `UsageType=DataTransfer-Out-Bytes` and group by `ResourceTag/Team`.
---
### **3. Brownfield Tagging Triage**
#### **A. Tag Existing Chaos**
**Option 1:** CLI Mass-Tagging
```bash
# Tag ALL untagged EC2 instances to 'Team=Unassigned'
aws ec2 describe-instances \
--query 'Reservations[].Instances[?Tags==null || Tags[?Key==`Team`].Value==`null`].InstanceId' \
--output text | xargs -I {} aws ec2 create-tags \
--resources {} \
--tags Key=Team,Value=Unassigned
```
**Option 2:** AWS **Resource Groups** + Tag Editor
1. **Console**: AWS Resource Groups → **Tag Editor** → Bulk tag by:
- Resource type (e.g., all EC2 instances)
- Region (e.g., `us-east-1`)
#### **B. Find Untagged Billable Resources**
```bash
# Find untagged resources costing >$100/month (using AWS Cost Explorer API)
aws ce get-cost-and-usage \
--time-period Start=2024-01-01,End=2024-01-31 \
--granularity MONTHLY \
--metrics UnblendedCost \
--filter '{
"Not": {
"Dimensions": { "Key": "ResourceTags:Team", "Values": ["*"] }
}
}'
```
---
### **4. Pro Tips for Tagging Dominance**
#### **A. Automate Tag Governance**
- **AWS Config Rules**: Auto-remediate untagged resources:
```bash
aws configservice put-remediation-configurations \
--config-rule-name "tag-compliance-rule" \
--target-id "AWS-AddTagsToResource" \
--parameters '{
"ResourceType": {"StaticValue": {"Values": ["AWS::EC2::Instance"]}},
"Tags": {"StaticValue": {"Values": [{"Key": "Team", "Value": "Unassigned"}]}}
}'
```
#### **B. Tag-Based Resource Nuking**
```bash
# Delete all dev resources older than 30 days (via Lambda)
aws ec2 describe-instances \
--filters "Name=tag:Environment,Values=dev" \
--query 'Reservations[].Instances[?LaunchTime<`2024-01-01`].InstanceId' \
--output text | xargs -I {} aws ec2 terminate-instances --instance-ids {}
```
#### **C. Cost Transparency Dashboards**
- **Grafana + AWS CUR**: Visualize costs by team/tag:
```sql
SELECT resource_tags_user_team, SUM(line_item_unblended_cost)
FROM aws_cur
WHERE line_item_usage_start_date >= '2024-01-01'
GROUP BY 1
```
---
### **5. Real-World Tagging War Story**
**Problem**: A 300-VPC brownfield environment with **$250k/month untagged spend**.
**Solution**:
1. Used **AWS Resource Groups** to tag all resources with `Team=Unknown`.
2. Created **SCPs** blocking new untagged resources.
3. Published **weekly cost reports** by tag to executives.
**Result**:
- **$80k/month reclaimed** from orphaned resources.
- Teams suddenly started tagging properly when their budgets were on the line.
---
### **Your Tagging Command Cheat Sheet**
| Task | Command |
|------|---------|
| Find untagged EC2 instances | `aws ec2 describe-instances --query 'Reservations[].Instances[?!not_null(Tags[])]'` |
| Bulk tag VPCs | `aws ec2 describe-vpcs --query 'Vpcs[].VpcId' --output text \| xargs -I {} aws ec2 create-tags --resources {} --tags Key=Team,Value=networking` |
| Enforce tags via SCP | [See Section 1A](#1-the-tagging-iron-fist-strategy) |
---
### **Final Move: Become the Tagging Hero**
1. **Phase 1**: Deploy mandatory tags (SCPs + automation).
2. **Phase 2**: Report costs by team/project (CEO loves this).
3. **Phase 3**: Profit when you **tie infrastructure savings to your bonus**.
Want me to **build a custom tagging enforcement script** for your specific environment? I can generate Terraform/CLI snippets that fit your orgs structure.
---
Heres a **curated list of essential tools** for cloud network engineers, categorized by use case and paired with real-world scenarios:
---
### **1. Core Networking Tools**
#### **A. Packet Analysis**
- **`tcpdump`**
- **Why**: Debug EC2 instance traffic when Security Groups/Flow Logs arent enough.
- **Pro Command**:
```bash
sudo tcpdump -i eth0 'host 10.0.1.5 and port 443' -nnvv -w debug.pcap
```
- **Wireshark**
- **Why**: GUI analysis of `tcpdump` captures (TLS handshakes, retransmits).
#### **B. DNS & Connectivity**
- **`dig`/`nslookup`**
- **Why**: Validate PrivateLink endpoints, Route53 resolver issues.
- **Pro Command**:
```bash
dig +short myapp.privatesvc.us-east-1.vpce.amazonaws.com
```
- **`mtr` (My Traceroute)**
- **Why**: Hybrid cloud latency diagnosis (AWS → on-prem).
---
### **2. Cloud-Native Tools**
#### **A. AWS-Centric**
| Tool | Use Case | Pro Tip |
|------|----------|---------|
| **VPC Flow Logs + CloudWatch Insights** | Detect REJECTed traffic | `filter action="REJECT" \| stats count(*) by srcAddr` |
| **AWS Reachability Analyzer** | Pre-check route table changes | `aws ec2 create-network-insights-path` |
| **Traffic Mirroring** | Capture ENI traffic for IDS | Mirror to a **Grafana Loki** instance |
| **PrivateLink** | Secure cross-account services | Always check DNS resolution (`vpce-xxx-123.region.vpce.amazonaws.com`) |
#### **B. Multi-Cloud**
- **Terraform**
- **Why**: Automate NACL/SG changes with zero-downtime rollouts.
- **Pro Tip**: Use `create_before_destroy` for rule updates.
- **Pulumi**
- **Why**: Code-based networking (Python/TypeScript) for complex TGW designs.
---
### **3. Automation & Scripting**
#### **A. CLI Mastery**
- **AWS CLI**
- **Key Command**:
```bash
aws ec2 describe-security-groups --query 'SecurityGroups[?length(IpPermissions)>`5`].GroupId'
```
- **`jq`**
- **Why**: Parse JSON outputs (e.g., filter Flow Logs for anomalies).
- **Example**:
```bash
aws ec2 describe-network-acls | jq '.NetworkAcls[] | select(.IsDefault==true)'
```
#### **B. Infrastructure as Code (IaC)**
- **Ansible**
- **Why**: Bulk EC2 instance configs (iptables, sysctl tuning).
- **CDK (Cloud Development Kit)**
- **Why**: Programmatically build VPC peering with failover.
---
### **4. Security & Compliance**
| Tool | Use Case | Pro Tip |
|------|----------|---------|
| **Zeek (formerly Bro)** | IDS for Traffic Mirroring | Use with **Suricata** rules |
| **OpenVPN/AWS Client VPN** | Secure access to private subnets | Enforce MFA via `aws ec2 create-client-vpn-endpoint` |
| **AWS Network Firewall** | Layer 7 protection | Deploy with **Strict Domain List** for egress filtering |
---
### **5. Performance & Monitoring**
#### **A. Real-Time**
- **Grafana + Prometheus**
- **Why**: Visualize NAT Gateway throughput drops.
- **Pro Setup**: Scrape `aws_cloudwatch_metrics`.
- **ELK Stack**
- **Why**: Index Flow Logs for threat hunting.
#### **B. Synthetic Testing**
- **CloudWatch Synthetics**
- **Why**: Simulate user traffic through TGW attachments.
- **Pingdom**
- **Why**: Monitor hybrid cloud (AWS → on-prem VPN).
---
### **6. Hybrid & On-Prem Integration**
- **Megaport/AWS Direct Connect**
- **Key Command**:
```bash
aws directconnect describe-virtual-interfaces --query 'virtualInterfaces[?virtualGatewayId==`null`]'
```
- **Cisco CSR/VM-Series**
- **Why**: Site-to-cloud VPN with BGP failover.
---
### **7. Must-Know Concepts (Not Tools, But Critical)**
1. **Ephemeral Ports (32768-60999)**:
- Required for NAT Gateway/Security Group replies.
2. **MTU 1500**:
- AWS drops ICMP fragmentation messages—always test with:
```bash
ping -s 1472 -M do 8.8.8.8 # (1472 + 28 header = 1500)
```
3. **BGP Timers**:
- Override defaults (`keepalive=10`, `holddown=30`) for Direct Connect.
---
### **Real-World Toolchain Example**
**Scenario**: "Why is my ALB dropping HTTPS traffic?"
1. **Flow Logs**:
```sql
fields @timestamp, srcAddr, dstAddr, action | filter dstPort=443 and action="REJECT"
```
2. **`tcpdump` on EC2**:
```bash
sudo tcpdump -i eth0 'tcp port 443 and (tcp-syn|tcp-ack)!=0' -XX
```
3. **Terraform Rollback**:
```hcl
lifecycle { ignore_changes = [security_group_ids] } # Preserve temp rules
```
---
### **Cheat Sheet: Top 10 Commands**
1. Find overly permissive SGs:
```bash
aws ec2 describe-security-groups --query 'SecurityGroups[?length(IpPermissions)>`10`]'
```
2. Check NAT Gateway health:
```bash
aws ec2 describe-nat-gateways --filter Name=state,Values=failed
```
3. Test Direct Connect BGP:
```bash
aws directconnect describe-virtual-interfaces --query 'virtualInterfaces[*].[virtualInterfaceId,bgpPeers[0].bgpStatus]'
```
---
### **Final Advice**
- **Master 20% of tools that handle 80% of issues** (Flow Logs, `tcpdump`, AWS CLI).
- **Automate repetitive checks** (e.g., nightly NACL audits with Lambda).
- **Bookmark the Unusual**:
- [AWS VPC Flow Logs Field Meanings](https://docs.aws.amazon.com/vpc/latest/userguide/flow-logs.html#flow-log-records)
- [IANA Port Assignments](https://www.iana.org/assignments/service-names-port-numbers/service-names-port-numbers.xhtml)
Want a **deep dive on a specific tool**? I can provide battle-tested config templates or troubleshooting playbooks.
---
Absolutely—**mastering `tcpdump` is invaluable** for cloud engineers, even in AWS/GCP/Azure environments. Heres why, when to use it, and how it complements cloud-native tools:
---