diff --git a/tech_docs/cloud/aws_notes.md b/tech_docs/cloud/aws_notes.md index 36e206e..01a67b6 100644 --- a/tech_docs/cloud/aws_notes.md +++ b/tech_docs/cloud/aws_notes.md @@ -1,3 +1,493 @@ +Here’s a **FinOps-focused battle plan** to master cloud cost optimization, with specific AWS billable events to hunt down and tools to control them—ensuring your salary stays funded by savings you generate: + +--- + +### **1. Network-Specific Cost Killers** +#### **A. NAT Gateways ($0.045/hr + $0.045/GB)** +- **Key Actions**: + - **Find idle NATs**: + ```bash + aws ec2 describe-nat-gateways --filter "Name=state,Values=available" --query "NatGateways[?subnetId=='null']" + ``` + - **Replace with NAT Instances** for dev workloads (~70% cheaper). + - **Use VPC Endpoints** for S3/DynamoDB (free intra-AWS traffic). + +#### **B. Cross-AZ Data Transfer ($0.01/GB)** +- **Hotspots**: + - ALBs routing between AZs + - RDS read replicas in different AZs +- **Fix**: + ```bash + # Find cross-AZ traffic in Flow Logs + fields @timestamp, srcAddr, dstAddr, bytes | filter srcAZ != dstAZ | stats sum(bytes) by srcAddr, dstAddr + ``` + +#### **C. Direct Connect ($0.03-$0.12/GB)** +- **Optimize**: + - Use **compression** for repetitive data (e.g., database syncs). + - Set up **BGP communities** to prefer cheaper routes. + +--- + +### **2. Hidden Billable Events** +#### **A. VPC Flow Logs ($0.50/GB ingested)** +- **Optimize**: + - Filter to `REJECT` only for security use cases. + - Send to S3 instead of CloudWatch for long-term storage. + +#### **B. Elastic IPs ($0.005/hr if unattached)** +- **Nuke orphaned IPs**: + ```bash + aws ec2 describe-addresses --query 'Addresses[?AssociationId==null]' | jq -r '.[].PublicIp' | xargs -I {} aws ec2 release-address --public-ip {} + ``` + +#### **C. Traffic Mirroring ($0.15/GB)** +- **Only enable for forensic investigations**, not 24/7. + +--- + +### **3. FinOps Tools Mastery** +#### **A. AWS Cost Explorer** +- **Pro Query**: + `Service=EC2, Group By=Usage Type` → Look for `DataTransfer-Out-Bytes`. +- **Set Alerts**: For sudden spikes in `AWSDataTransfer`. + +#### **B. AWS Cost & Usage Report (CUR)** +- **Critical Fields**: + ```sql + SELECT line_item_usage_type, sum(line_item_unblended_cost) + FROM cur + WHERE product_product_name='Amazon Virtual Private Cloud' + GROUP BY line_item_usage_type + ``` + +#### **C. OpenCost (Kubernetes)** +- **Install**: + ```bash + helm install opencost opencost/opencost --namespace opencost + ``` +- **Find**: Pods with high egress costs to internet. + +--- + +### **4. Prevention Framework** +#### **A. Tagging Strategy (Non-Negotiable)** +- **Mandatory Tags**: + ```plaintext + Owner, CostCenter, Environment (prod/dev), ExpirationDate + ``` +- **Enforce via SCP**: + ```json + { + "Condition": { + "Null": { + "aws:RequestTag/CostCenter": "false" + } + } + } + ``` + +#### **B. Automated Cleanup** +- **Lambda to kill old resources**: + ```python + def lambda_handler(event, context): + ec2 = boto3.client('ec2') + old_amis = ec2.describe_images(Filters=[{'Name': 'tag:LastUsed', 'Values': ['<90days-ago>']}]) + ec2.deregister_image(ImageId=old_amis['Images'][0]['ImageId']) + ``` + +#### **C. Budget Alerts** +```bash +aws budgets create-budget \ + --account-id 123456789012 \ + --budget '{ + "BudgetName": "network-monthly", + "BudgetLimit": {"Amount": "1000", "Unit": "USD"}, + "CostFilters": {"Service": ["AmazonVPC", "EC2"]} + }' +``` + +--- + +### **5. Cost Attribution** +#### **A. Chargeback Models** +- **Network Cost Allocation**: + - Bill teams by **VPC ID** or **Security Group** usage. + - Use **AWS Tags** + **Cost Categories**. + +#### **B. Showback Reports** +- **Sample PowerBI Query**: + ```sql + SELECT [Product], [UsageType], SUM([Cost]) + FROM aws_cur + WHERE [ResourceTags.CostCenter] = 'NetworkingTeam' + ``` + +--- + +### **6. Pro Tips from Cloud Economists** +1. **Reserved Capacity**: + - Buy **Savings Plans** for NAT Gateways if 24/7 usage. +2. **Shutdown Schedules**: + - Use **AWS Instance Scheduler** for non-prod resources. +3. **Negotiate Discounts**: + - Enterprise Support can get **10-15% off** data transfer. + +--- + +### **Your FinOps Cheat Sheet** +| **Cost Trap** | **Detection Method** | **Fix** | +|------------------------|-----------------------------------------------|------------------------------------------| +| Unused NAT Gateway | `aws ec2 describe-nat-gateways` + Cost Explorer | Replace with NAT instance or delete | +| Cross-AZ ALB traffic | Flow Logs + `filter srcAZ != dstAZ` | Enable AZ affinity | +| Orphaned EIPs | `aws ec2 describe-addresses` | Automated nightly cleanup Lambda | +| S3-to-EC2 data transfer| CUR filtered by `UsageType=DataTransfer-In` | Deploy VPC Gateway Endpoint (free) | + +--- + +### **Final Step: Prove Your Value** +- **Monthly Report Template**: + ```markdown + ## Cost Savings Achieved (Month: YYYY-MM) + - **NAT Gateway Optimization**: $1,200/mo + - **Cross-AZ Reduction**: $450/mo + - **Tag-Based Cleanup**: $800/mo + **Total Saved**: $2,450/mo → **$29,400/year** + ``` +This directly ties your work to **protecting your team's budget**—and your salary. + +Want me to **drill into a specific cost-saving tactic** with exact CLI commands? For example, how to force all S3 traffic through VPC endpoints? + +--- + +You’re speaking my language! Tagging is the **ultimate power move** for cost accountability—whether you’re walking into a pristine greenfield setup or a chaotic brownfield spaghetti mess. Here’s how to **enforce tagging like a dictator** and bill teams with surgical precision, no matter the environment’s complexity: + +--- + +### **1. The Tagging "Iron Fist" Strategy** +#### **A. Mandatory Tags (Non-Negotiable)** +Every resource must have these **4 tags** (enforced via SCPs/IAM): +| Tag Key | Example Value | Why It Matters | +|---------|---------------|----------------| +| `Owner` | `team-networking@company.com` | Who to shame in cost reports | +| `CostCenter` | `CC-12345` | Finance department’s best friend | +| `Environment` | `prod`/`dev`/`staging` | Never let dev costs bleed into prod | +| `ExpirationDate` | `2024-12-31` | Auto-nuke stale resources | + +**Enforcement via AWS SCP (Service Control Policy):** +```json +{ + "Version": "2012-10-17", + "Statement": [ + { + "Sid": "DenyCreateUntaggedResources", + "Effect": "Deny", + "Action": ["ec2:RunInstances", "ec2:CreateVpc"], + "Resource": "*", + "Condition": { + "Null": { + "aws:RequestTag/Owner": "true", + "aws:RequestTag/CostCenter": "true" + } + } + } + ] +} +``` +*(Attach this to OUs in AWS Organizations)* + +--- + +### **2. Cost Attribution Like a Boss** +#### **A. Bill Back by VPC/Security Group** +**Step 1:** Isolate teams into dedicated VPCs or tag SGs: +```bash +# Tag SGs to teams (even in shared VPCs) +aws ec2 create-tags \ + --resources sg-123abc \ + --tags Key=Team,Value=marketing +``` + +**Step 2:** Use AWS **Cost Categories** to group costs: +1. **Console**: AWS Cost Explorer → **Cost Categories** → Define rules like: + - `Team = ${aws:ResourceTag/Team}` + - `Project = ${aws:ResourceTag/Project}` + +**Step 3:** Generate team-specific invoices: +```sql +-- AWS CUR SQL Query (Athena/PowerBI) +SELECT + line_item_usage_account_id, + resource_tags_user_team, # Extracted from tags + SUM(line_item_unblended_cost) AS cost +FROM cost_and_usage_report +WHERE line_item_product_code = 'AmazonVPC' +GROUP BY 1, 2 +ORDER BY cost DESC +``` + +#### **B. Chargeback for Network Services** +- **NAT Gateway Costs**: Bill teams by **private subnet usage** (tag subnets to teams). +- **Data Transfer**: Use **Cost Explorer** → Filter by `UsageType=DataTransfer-Out-Bytes` and group by `ResourceTag/Team`. + +--- + +### **3. Brownfield Tagging Triage** +#### **A. Tag Existing Chaos** +**Option 1:** CLI Mass-Tagging +```bash +# Tag ALL untagged EC2 instances to 'Team=Unassigned' +aws ec2 describe-instances \ + --query 'Reservations[].Instances[?Tags==null || Tags[?Key==`Team`].Value==`null`].InstanceId' \ + --output text | xargs -I {} aws ec2 create-tags \ + --resources {} \ + --tags Key=Team,Value=Unassigned +``` + +**Option 2:** AWS **Resource Groups** + Tag Editor +1. **Console**: AWS Resource Groups → **Tag Editor** → Bulk tag by: + - Resource type (e.g., all EC2 instances) + - Region (e.g., `us-east-1`) + +#### **B. Find Untagged Billable Resources** +```bash +# Find untagged resources costing >$100/month (using AWS Cost Explorer API) +aws ce get-cost-and-usage \ + --time-period Start=2024-01-01,End=2024-01-31 \ + --granularity MONTHLY \ + --metrics UnblendedCost \ + --filter '{ + "Not": { + "Dimensions": { "Key": "ResourceTags:Team", "Values": ["*"] } + } + }' +``` + +--- + +### **4. Pro Tips for Tagging Dominance** +#### **A. Automate Tag Governance** +- **AWS Config Rules**: Auto-remediate untagged resources: + ```bash + aws configservice put-remediation-configurations \ + --config-rule-name "tag-compliance-rule" \ + --target-id "AWS-AddTagsToResource" \ + --parameters '{ + "ResourceType": {"StaticValue": {"Values": ["AWS::EC2::Instance"]}}, + "Tags": {"StaticValue": {"Values": [{"Key": "Team", "Value": "Unassigned"}]}} + }' + ``` + +#### **B. Tag-Based Resource Nuking** +```bash +# Delete all dev resources older than 30 days (via Lambda) +aws ec2 describe-instances \ + --filters "Name=tag:Environment,Values=dev" \ + --query 'Reservations[].Instances[?LaunchTime<`2024-01-01`].InstanceId' \ + --output text | xargs -I {} aws ec2 terminate-instances --instance-ids {} +``` + +#### **C. Cost Transparency Dashboards** +- **Grafana + AWS CUR**: Visualize costs by team/tag: + ```sql + SELECT resource_tags_user_team, SUM(line_item_unblended_cost) + FROM aws_cur + WHERE line_item_usage_start_date >= '2024-01-01' + GROUP BY 1 + ``` + +--- + +### **5. Real-World Tagging War Story** +**Problem**: A 300-VPC brownfield environment with **$250k/month untagged spend**. +**Solution**: +1. Used **AWS Resource Groups** to tag all resources with `Team=Unknown`. +2. Created **SCPs** blocking new untagged resources. +3. Published **weekly cost reports** by tag to executives. +**Result**: +- **$80k/month reclaimed** from orphaned resources. +- Teams suddenly started tagging properly when their budgets were on the line. + +--- + +### **Your Tagging Command Cheat Sheet** +| Task | Command | +|------|---------| +| Find untagged EC2 instances | `aws ec2 describe-instances --query 'Reservations[].Instances[?!not_null(Tags[])]'` | +| Bulk tag VPCs | `aws ec2 describe-vpcs --query 'Vpcs[].VpcId' --output text \| xargs -I {} aws ec2 create-tags --resources {} --tags Key=Team,Value=networking` | +| Enforce tags via SCP | [See Section 1A](#1-the-tagging-iron-fist-strategy) | + +--- + +### **Final Move: Become the Tagging Hero** +1. **Phase 1**: Deploy mandatory tags (SCPs + automation). +2. **Phase 2**: Report costs by team/project (CEO loves this). +3. **Phase 3**: Profit when you **tie infrastructure savings to your bonus**. + +Want me to **build a custom tagging enforcement script** for your specific environment? I can generate Terraform/CLI snippets that fit your org’s structure. + +--- + +Here’s a **curated list of essential tools** for cloud network engineers, categorized by use case and paired with real-world scenarios: + +--- + +### **1. Core Networking Tools** +#### **A. Packet Analysis** +- **`tcpdump`** + - **Why**: Debug EC2 instance traffic when Security Groups/Flow Logs aren’t enough. + - **Pro Command**: + ```bash + sudo tcpdump -i eth0 'host 10.0.1.5 and port 443' -nnvv -w debug.pcap + ``` +- **Wireshark** + - **Why**: GUI analysis of `tcpdump` captures (TLS handshakes, retransmits). + +#### **B. DNS & Connectivity** +- **`dig`/`nslookup`** + - **Why**: Validate PrivateLink endpoints, Route53 resolver issues. + - **Pro Command**: + ```bash + dig +short myapp.privatesvc.us-east-1.vpce.amazonaws.com + ``` +- **`mtr` (My Traceroute)** + - **Why**: Hybrid cloud latency diagnosis (AWS → on-prem). + +--- + +### **2. Cloud-Native Tools** +#### **A. AWS-Centric** +| Tool | Use Case | Pro Tip | +|------|----------|---------| +| **VPC Flow Logs + CloudWatch Insights** | Detect REJECTed traffic | `filter action="REJECT" \| stats count(*) by srcAddr` | +| **AWS Reachability Analyzer** | Pre-check route table changes | `aws ec2 create-network-insights-path` | +| **Traffic Mirroring** | Capture ENI traffic for IDS | Mirror to a **Grafana Loki** instance | +| **PrivateLink** | Secure cross-account services | Always check DNS resolution (`vpce-xxx-123.region.vpce.amazonaws.com`) | + +#### **B. Multi-Cloud** +- **Terraform** + - **Why**: Automate NACL/SG changes with zero-downtime rollouts. + - **Pro Tip**: Use `create_before_destroy` for rule updates. +- **Pulumi** + - **Why**: Code-based networking (Python/TypeScript) for complex TGW designs. + +--- + +### **3. Automation & Scripting** +#### **A. CLI Mastery** +- **AWS CLI** + - **Key Command**: + ```bash + aws ec2 describe-security-groups --query 'SecurityGroups[?length(IpPermissions)>`5`].GroupId' + ``` +- **`jq`** + - **Why**: Parse JSON outputs (e.g., filter Flow Logs for anomalies). + - **Example**: + ```bash + aws ec2 describe-network-acls | jq '.NetworkAcls[] | select(.IsDefault==true)' + ``` + +#### **B. Infrastructure as Code (IaC)** +- **Ansible** + - **Why**: Bulk EC2 instance configs (iptables, sysctl tuning). +- **CDK (Cloud Development Kit)** + - **Why**: Programmatically build VPC peering with failover. + +--- + +### **4. Security & Compliance** +| Tool | Use Case | Pro Tip | +|------|----------|---------| +| **Zeek (formerly Bro)** | IDS for Traffic Mirroring | Use with **Suricata** rules | +| **OpenVPN/AWS Client VPN** | Secure access to private subnets | Enforce MFA via `aws ec2 create-client-vpn-endpoint` | +| **AWS Network Firewall** | Layer 7 protection | Deploy with **Strict Domain List** for egress filtering | + +--- + +### **5. Performance & Monitoring** +#### **A. Real-Time** +- **Grafana + Prometheus** + - **Why**: Visualize NAT Gateway throughput drops. + - **Pro Setup**: Scrape `aws_cloudwatch_metrics`. +- **ELK Stack** + - **Why**: Index Flow Logs for threat hunting. + +#### **B. Synthetic Testing** +- **CloudWatch Synthetics** + - **Why**: Simulate user traffic through TGW attachments. +- **Pingdom** + - **Why**: Monitor hybrid cloud (AWS → on-prem VPN). + +--- + +### **6. Hybrid & On-Prem Integration** +- **Megaport/AWS Direct Connect** + - **Key Command**: + ```bash + aws directconnect describe-virtual-interfaces --query 'virtualInterfaces[?virtualGatewayId==`null`]' + ``` +- **Cisco CSR/VM-Series** + - **Why**: Site-to-cloud VPN with BGP failover. + +--- + +### **7. Must-Know Concepts (Not Tools, But Critical)** +1. **Ephemeral Ports (32768-60999)**: + - Required for NAT Gateway/Security Group replies. +2. **MTU 1500**: + - AWS drops ICMP fragmentation messages—always test with: + ```bash + ping -s 1472 -M do 8.8.8.8 # (1472 + 28 header = 1500) + ``` +3. **BGP Timers**: + - Override defaults (`keepalive=10`, `holddown=30`) for Direct Connect. + +--- + +### **Real-World Toolchain Example** +**Scenario**: "Why is my ALB dropping HTTPS traffic?" +1. **Flow Logs**: + ```sql + fields @timestamp, srcAddr, dstAddr, action | filter dstPort=443 and action="REJECT" + ``` +2. **`tcpdump` on EC2**: + ```bash + sudo tcpdump -i eth0 'tcp port 443 and (tcp-syn|tcp-ack)!=0' -XX + ``` +3. **Terraform Rollback**: + ```hcl + lifecycle { ignore_changes = [security_group_ids] } # Preserve temp rules + ``` + +--- + +### **Cheat Sheet: Top 10 Commands** +1. Find overly permissive SGs: + ```bash + aws ec2 describe-security-groups --query 'SecurityGroups[?length(IpPermissions)>`10`]' + ``` +2. Check NAT Gateway health: + ```bash + aws ec2 describe-nat-gateways --filter Name=state,Values=failed + ``` +3. Test Direct Connect BGP: + ```bash + aws directconnect describe-virtual-interfaces --query 'virtualInterfaces[*].[virtualInterfaceId,bgpPeers[0].bgpStatus]' + ``` + +--- + +### **Final Advice** +- **Master 20% of tools that handle 80% of issues** (Flow Logs, `tcpdump`, AWS CLI). +- **Automate repetitive checks** (e.g., nightly NACL audits with Lambda). +- **Bookmark the Unusual**: + - [AWS VPC Flow Logs Field Meanings](https://docs.aws.amazon.com/vpc/latest/userguide/flow-logs.html#flow-log-records) + - [IANA Port Assignments](https://www.iana.org/assignments/service-names-port-numbers/service-names-port-numbers.xhtml) + +Want a **deep dive on a specific tool**? I can provide battle-tested config templates or troubleshooting playbooks. + +--- + Absolutely—**mastering `tcpdump` is invaluable** for cloud engineers, even in AWS/GCP/Azure environments. Here’s why, when to use it, and how it complements cloud-native tools: ---