diff --git a/tech_docs/grafana_alloy.md b/tech_docs/grafana_alloy.md new file mode 100644 index 0000000..5838744 --- /dev/null +++ b/tech_docs/grafana_alloy.md @@ -0,0 +1,653 @@ +# **Grafana Alloy: Zero to Hero Guide** + +## **Part 1: Foundation & Core Concepts** + +### **What is Grafana Alloy?** +Alloy is a **unified telemetry agent** that collects, processes, and forwards: +- **Logs** (to Loki) +- **Metrics** (to Prometheus) +- **Traces** (to Tempo) - not covered in video but supported + +**Why use it?** +- Replaces: Promtail, node_exporter, cadvisor, Loki Docker plugin +- Single configuration file for everything +- Process/filter data BEFORE storage +- Component-based architecture (Lego blocks for monitoring) + +--- + +## **Part 2: Installation (5 Minutes)** + +### **Option A: Docker (Recommended for Testing)** +```yaml +# docker-compose.yml +version: '3.8' +services: + alloy: + image: grafana/alloy:latest + container_name: alloy + hostname: your-server-name # ← CRITICAL: Set your actual hostname + command: + - "--server.http.listen-addr=0.0.0.0:12345" + - "--storage.path=/var/lib/alloy/data" + - "--config.file=/etc/alloy/config.alloy" + ports: + - "12345:12345" # Web UI + volumes: + - ./config.alloy:/etc/alloy/config.alloy + - alloy-data:/var/lib/alloy/data + - /var/log:/var/log:ro # For host logs + - /var/run/docker.sock:/var/run/docker.sock:ro # For Docker + - /proc:/proc:ro # For metrics + - /sys:/sys:ro # For metrics + restart: unless-stopped + +volumes: + alloy-data: +``` + +### **Option B: Binary (Production)** +```bash +# Download latest +wget https://github.com/grafana/alloy/releases/latest/download/alloy-linux-amd64 +chmod +x alloy-linux-amd64 +sudo mv alloy-linux-amd64 /usr/local/bin/alloy + +# Create systemd service +sudo nano /etc/systemd/system/alloy.service +``` + +**Service file:** +```ini +[Unit] +Description=Grafana Alloy +After=network.target + +[Service] +Type=simple +User=alloy +ExecStart=/usr/local/bin/alloy run --config.file=/etc/alloy/config.alloy +Restart=always + +[Install] +WantedBy=multi-user.target +``` + +--- + +## **Part 3: Your First Configuration (Level 1)** + +### **Basic Structure - Understanding Components** +```alloy +// config.alloy +// 1. TARGETS (Where data goes) +loki.write "default" { + endpoint = "http://loki:3100/loki/api/v1/push" +} + +prometheus.remote_write "default" { + endpoint = "http://prometheus:9090/api/v1/write" +} + +// 2. SOURCES (Where data comes from) +// We'll add these in next steps +``` + +**Key Concept**: Alloy connects `sources` → `processors` → `targets` + +--- + +## **Part 4: Collect Host Logs (Level 2)** + +### **Replace Promtail - Simple Log Collection** +```alloy +// Add to config.alloy after targets + +// File discovery +local.file_match "syslog" { + path_targets = [{ + __address__ = "localhost", + __path__ = "/var/log/syslog", + }] +} + +// Log reader +loki.source.file "syslog" { + targets = local.file_match.syslog.targets + forward_to = [loki.write.default.receiver] +} + +// Convert existing Promtail config: +// alloy convert --source-format=promtail promtail.yaml config.alloy +``` + +**Test it:** +```bash +docker-compose up -d +curl http://localhost:12345/health # Should return healthy +``` + +--- + +## **Part 5: Collect Host Metrics (Level 3)** + +### **Replace node_exporter - System Metrics** +```alloy +// Prometheus config needs this flag: +// --web.enable-remote-write-receiver + +prometheus.exporter.unix "node_metrics" { + // Automatically collects CPU, memory, disk, network +} + +discovery.relabel "node_metrics" { + targets = prometheus.exporter.unix.node_metrics.targets + + rule { + source_labels = ["__address__"] + target_label = "instance" + replacement = constants.hostname // Uses system hostname + } + + rule { + target_label = "job" + replacement = "${constants.hostname}-metrics" // Dynamic job name + } +} + +prometheus.scrape "node_metrics" { + targets = discovery.relabel.node_metrics.output.targets + forward_to = [prometheus.remote_write.default.receiver] +} +``` + +**View metrics in Grafana:** +1. Import dashboard ID `1860` (Node Exporter Full) +2. Filter by `job=your-hostname-metrics` + +--- + +## **Part 6: Add Processing (Level 4)** + +### **Relabeling - Add Custom Labels** +```alloy +// For logs +loki.relabel "add_os_label" { + forward_to = [loki.write.default.receiver] + + rule { + target_label = "os" + replacement = constants.os // auto-populates "linux" + } + + rule { + target_label = "environment" + replacement = "production" + } +} + +// Update log source to use relabeler +loki.source.file "syslog" { + targets = local.file_match.syslog.targets + forward_to = [loki.relabel.add_os_label.receiver] // Changed! +} +``` + +### **Filtering - Drop Unwanted Logs** +```alloy +loki.relabel "filter_logs" { + forward_to = [loki.write.default.receiver] + + // Drop DEBUG logs + rule { + source_labels = ["__line__"] + regex = "(?i)debug" + action = "drop" + } + + // Keep only ERROR and WARN + rule { + source_labels = ["__line__"] + regex = "(?i)error|warn|fail" + action = "keep" + } +} +``` + +--- + +## **Part 7: Docker Monitoring (Level 5)** + +### **Collect Container Logs & Metrics** +```alloy +// Docker logs (no plugin needed!) +loki.source.docker "container_logs" { + host = "unix:///var/run/docker.sock" + forward_to = [loki.write.default.receiver] + + labels = { + container_name = "{{.Name}}", + image_name = "{{.ImageName}}", + } +} + +// Docker metrics +prometheus.exporter.docker "container_metrics" { + host = "unix:///var/run/docker.sock" +} + +discovery.relabel "docker_metrics" { + targets = prometheus.exporter.docker.container_metrics.targets + + rule { + target_label = "job" + replacement = "${constants.hostname}-docker" + } +} + +prometheus.scrape "docker_metrics" { + targets = discovery.relabel.docker_metrics.output.targets + forward_to = [prometheus.remote_write.default.receiver] +} +``` + +**No Docker Compose changes needed!** All containers are automatically discovered. + +--- + +## **Part 8: Advanced Scenarios (Level 6)** + +### **Multiple Log Sources** +```alloy +// System journal +loki.source.journal "journal" { + forward_to = [loki.write.default.receiver] + + // Required in containers + path = "/var/log/journal" + + labels = { + job = "${constants.hostname}-journal" + } +} + +// Application logs +local.file_match "app_logs" { + path_targets = [{ + __address__ = "localhost", + __path__ = "/app/*.log", + app = "myapp", // Custom label + }] +} + +loki.source.file "app_logs" { + targets = local.file_match.app_logs.targets + forward_to = [loki.write.default.receiver] +} +``` + +### **Multiple Output Destinations** +```alloy +// Development Loki +loki.write "dev" { + endpoint = "http://loki-dev:3100/loki/api/v1/push" + + external_labels = { + environment = "development" + } +} + +// Production Loki +loki.write "prod" { + endpoint = "http://loki-prod:3100/loki/api/v1/push" + + external_labels = { + environment = "production" + } +} + +// Route based on label +loki.relabel "route_logs" { + rule { + source_labels = ["environment"] + regex = "prod" + target_label = "__receiver__" + replacement = "loki.write.prod" + } + + rule { + source_labels = ["environment"] + regex = "dev" + target_label = "__receiver__" + replacement = "loki.write.dev" + } + + forward_to = [ + loki.write.prod.receiver, + loki.write.dev.receiver, + ] +} +``` + +--- + +## **Part 9: Best Practices** + +### **1. Configuration Organization** +```alloy +// 01-targets.alloy - Output destinations +loki.write "default" { /* ... */ } +prometheus.remote_write "default" { /* ... */ } + +// 02-system-metrics.alloy - Host metrics +prometheus.exporter.unix "node" { /* ... */ } + +// 03-system-logs.alloy - Host logs +local.file_match "logs" { /* ... */ } + +// 04-docker.alloy - Container monitoring +loki.source.docker "containers" { /* ... */ } + +// Load all +prometheus.remote_write "default" { + endpoint = "http://prometheus:9090/api/v1/write" +} + +// Import other files +import.git "configs" { + repository = "https://github.com/your-org/alloy-configs" + path = "*.alloy" + pull_frequency = "5m" +} +``` + +### **2. Label Strategy** +```alloy +// Consistent labeling template +discovery.relabel "standard_labels" { + rule { + target_label = "host" + replacement = constants.hostname + } + rule { + target_label = "region" + replacement = "us-east-1" + } + rule { + target_label = "team" + replacement = "platform" + } + rule { + target_label = "job" + replacement = "${constants.hostname}-${component}" + } +} +``` + +### **3. Buffering & Retry** +```alloy +loki.write "default" { + endpoint = "http://loki:3100/loki/api/v1/push" + + // Buffer when Loki is down + max_backoff_period = "5s" + min_backoff_period = "100ms" + max_retries = 10 + + external_labels = { + agent = "alloy" + } +} +``` + +--- + +## **Part 10: Complete Production Example** + +```alloy +// === PRODUCTION CONFIG === +// File: /etc/alloy/config.alloy + +// 1. OUTPUTS +loki.write "production" { + endpoint = "http://loki-prod:3100/loki/api/v1/push" + external_labels = {cluster="prod", agent="alloy"} + + // Buffering + max_streams = 10000 + batch_wait = "1s" + batch_size = 1048576 // 1MB +} + +prometheus.remote_write "production" { + endpoint = "http://prometheus-prod:9090/api/v1/write" + external_labels = {cluster="prod"} + + // Queue config + queue_config = { + capacity = 2500 + max_shards = 200 + min_shards = 1 + max_samples_per_send = 500 + } +} + +// 2. COMMON PROCESSING +loki.relabel "common_labels" { + rule { + target_label = "host" + replacement = constants.hostname + } + rule { + target_label = "os" + replacement = constants.os + } + rule { + target_label = "agent" + replacement = "alloy" + } + forward_to = [loki.write.production.receiver] +} + +// 3. SYSTEM METRICS +prometheus.exporter.unix {} + +discovery.relabel "system_metrics" { + targets = prometheus.exporter.unix.default.targets + + rule { + target_label = "job" + replacement = "${constants.hostname}-system" + } + rule { + target_label = "instance" + replacement = constants.hostname + } +} + +prometheus.scrape "system" { + targets = discovery.relabel.system_metrics.output.targets + forward_to = [prometheus.remote_write.production.receiver] +} + +// 4. SYSTEM LOGS +local.file_match "system_logs" { + path_targets = [ + {__path__ = "/var/log/syslog"}, + {__path__ = "/var/log/auth.log"}, + {__path__ = "/var/log/kern.log"}, + ] +} + +loki.source.file "system_logs" { + targets = local.file_match.system_logs.targets + forward_to = [loki.relabel.common_labels.receiver] +} + +// 5. JOURNAL +loki.source.journal "journal" { + path = "/var/log/journal" // Required in containers + forward_to = [loki.relabel.common_labels.receiver] + + labels = { + job = "${constants.hostname}-journal" + } +} + +// 6. DOCKER +loki.source.docker "containers" { + host = "unix:///var/run/docker.sock" + forward_to = [loki.relabel.common_labels.receiver] + + labels = { + job = "${constants.hostname}-docker" + } +} + +prometheus.exporter.docker {} + +discovery.relabel "docker_metrics" { + targets = prometheus.exporter.docker.default.targets + + rule { + target_label = "job" + replacement = "${constants.hostname}-docker" + } +} + +prometheus.scrape "docker" { + targets = discovery.relabel.docker_metrics.output.targets + forward_to = [prometheus.remote_write.production.receiver] +} +``` + +--- + +## **Part 11: Troubleshooting Cheat Sheet** + +### **Common Issues & Fixes:** + +1. **"No metrics in Prometheus"** + ```bash + # Check Prometheus has remote write enabled + ps aux | grep prometheus | grep enable-remote-write + + # Test connection + curl -X POST http://prometheus:9090/api/v1/write + ``` + +2. **"No logs in Loki"** + ```bash + # Check Alloy web UI + http://localhost:12345/graph + + # Check component health + curl http://localhost:12345/-/healthy + + # View Alloy logs + docker logs alloy + ``` + +3. **"Journal not working in container"** + ```alloy + // Add path to journal source + loki.source.journal "journal" { + path = "/var/log/journal" // ← THIS LINE + } + ``` + +4. **"Hostname wrong in metrics"** + ```yaml + # In docker-compose.yml + alloy: + hostname: your-actual-hostname # ← Set explicitly + ``` + +5. **Validate config:** + ```bash + alloy check config.alloy + alloy run --config.file=config.alloy --dry-run + ``` + +### **Debug Commands:** +```bash +# View component graph +open http://localhost:12345/graph + +# Check metrics Alloy generates about itself +curl http://localhost:12345/metrics + +# Live tail logs +curl -N http://localhost:12345/api/v0/logs/tail + +# Export config +curl http://localhost:12345/api/v0/config +``` + +--- + +## **Part 12: Migration Checklist** + +### **From Old Stack → Alloy** + +| Old Component | Alloy Replacement | Action | +|---------------|-------------------|--------| +| Promtail | `loki.source.file` + `local.file_match` | Run `alloy convert` | +| node_exporter | `prometheus.exporter.unix` | Remove node_exporter service | +| cadvisor | `prometheus.exporter.docker` | Remove cadvisor container | +| Loki Docker plugin | `loki.source.docker` | Remove plugin, keep logging=json | +| Multiple config files | Single `config.alloy` | Consolidate into sections | + +### **Step-by-Step Migration:** +1. **Stage 1**: Deploy Alloy alongside existing tools +2. **Stage 2**: Compare data between old/new in Grafana +3. **Stage 3**: Route 10% traffic to Alloy +4. **Stage 4**: Full cutover, decommission old tools + +--- + +## **Quick Reference Card** + +### **Essential Components:** +- `local.file_match` - Find log files +- `loki.source.file` - Read log files +- `loki.source.journal` - Read systemd journal +- `loki.source.docker` - Read container logs +- `prometheus.exporter.unix` - System metrics +- `prometheus.exporter.docker` - Container metrics +- `loki.relabel` - Process logs +- `discovery.relabel` - Process metrics +- `prometheus.scrape` - Send metrics +- `loki.write` - Send logs +- `prometheus.remote_write` - Send metrics + +### **Magic Variables:** +- `constants.hostname` - System hostname +- `constants.os` - Operating system +- `constants.architecture` - CPU arch +- `constants.version` - Alloy version + +### **Web UI Endpoints:** +- `:12345/` - Homepage +- `:12345/graph` - Component visualization +- `:12345/metrics` - Self-metrics +- `:12345/-/healthy` - Health check +- `:12345/api/v0/config` - Current config + +--- + +## **Next Steps After Mastery** + +1. **Add tracing**: `otelcol.*` components for OpenTelemetry +2. **Multi-cluster**: Use `import.http` for centralized config +3. **Custom components**: Build your own with Go +4. **Kubernetes**: Use Alloy Operator for dynamic discovery +5. **Alerting**: Add `prometheus.relabel` for alert rules +## **Resources** +- [Official Docs](https://grafana.com/docs/alloy/latest/) +- [Alloy Scenarios (Examples)](https://github.com/grafana/alloy-scenarios) +- [Configuration Reference](https://grafana.com/docs/alloy/latest/reference/components/) + +--- + +**Remember**: Start simple, use the web UI (`:12345/graph`) to visualize your pipeline, and incrementally add complexity. Alloy's power is in its composability - build your monitoring like Lego! \ No newline at end of file