medusa/the_information_nexus

Fork 0

Files

medusa aa932f0bde Add tech_docs/grafana_alloy.md

2026-02-07 23:19:27 +00:00

15 KiB

Raw Blame History

Grafana Alloy: Zero to Hero Guide

Part 1: Foundation & Core Concepts

What is Grafana Alloy?

Alloy is a unified telemetry agent that collects, processes, and forwards:

Logs (to Loki)
Metrics (to Prometheus)
Traces (to Tempo) - not covered in video but supported

Why use it?

Replaces: Promtail, node_exporter, cadvisor, Loki Docker plugin
Single configuration file for everything
Process/filter data BEFORE storage
Component-based architecture (Lego blocks for monitoring)

Part 2: Installation (5 Minutes)

Option A: Docker (Recommended for Testing)

# docker-compose.yml
version: '3.8'
services:
  alloy:
    image: grafana/alloy:latest
    container_name: alloy
    hostname: your-server-name  # ← CRITICAL: Set your actual hostname
    command:
      - "--server.http.listen-addr=0.0.0.0:12345"
      - "--storage.path=/var/lib/alloy/data"
      - "--config.file=/etc/alloy/config.alloy"
    ports:
      - "12345:12345"  # Web UI
    volumes:
      - ./config.alloy:/etc/alloy/config.alloy
      - alloy-data:/var/lib/alloy/data
      - /var/log:/var/log:ro  # For host logs
      - /var/run/docker.sock:/var/run/docker.sock:ro  # For Docker
      - /proc:/proc:ro  # For metrics
      - /sys:/sys:ro    # For metrics
    restart: unless-stopped

volumes:
  alloy-data:

Option B: Binary (Production)

# Download latest
wget https://github.com/grafana/alloy/releases/latest/download/alloy-linux-amd64
chmod +x alloy-linux-amd64
sudo mv alloy-linux-amd64 /usr/local/bin/alloy

# Create systemd service
sudo nano /etc/systemd/system/alloy.service

Service file:

[Unit]
Description=Grafana Alloy
After=network.target

[Service]
Type=simple
User=alloy
ExecStart=/usr/local/bin/alloy run --config.file=/etc/alloy/config.alloy
Restart=always

[Install]
WantedBy=multi-user.target

Part 3: Your First Configuration (Level 1)

Basic Structure - Understanding Components

// config.alloy
// 1. TARGETS (Where data goes)
loki.write "default" {
    endpoint = "http://loki:3100/loki/api/v1/push"
}

prometheus.remote_write "default" {
    endpoint = "http://prometheus:9090/api/v1/write"
}

// 2. SOURCES (Where data comes from)
// We'll add these in next steps

Key Concept: Alloy connects sources → processors → targets

Part 4: Collect Host Logs (Level 2)

Replace Promtail - Simple Log Collection

// Add to config.alloy after targets

// File discovery
local.file_match "syslog" {
    path_targets = [{
        __address__ = "localhost",
        __path__    = "/var/log/syslog",
    }]
}

// Log reader
loki.source.file "syslog" {
    targets    = local.file_match.syslog.targets
    forward_to = [loki.write.default.receiver]
}

// Convert existing Promtail config:
// alloy convert --source-format=promtail promtail.yaml config.alloy

Test it:

docker-compose up -d
curl http://localhost:12345/health  # Should return healthy

Part 5: Collect Host Metrics (Level 3)

Replace node_exporter - System Metrics

// Prometheus config needs this flag:
// --web.enable-remote-write-receiver

prometheus.exporter.unix "node_metrics" {
    // Automatically collects CPU, memory, disk, network
}

discovery.relabel "node_metrics" {
    targets = prometheus.exporter.unix.node_metrics.targets
    
    rule {
        source_labels = ["__address__"]
        target_label  = "instance"
        replacement   = constants.hostname  // Uses system hostname
    }
    
    rule {
        target_label = "job"
        replacement  = "${constants.hostname}-metrics"  // Dynamic job name
    }
}

prometheus.scrape "node_metrics" {
    targets    = discovery.relabel.node_metrics.output.targets
    forward_to = [prometheus.remote_write.default.receiver]
}

View metrics in Grafana:

Import dashboard ID 1860 (Node Exporter Full)
Filter by job=your-hostname-metrics

Part 6: Add Processing (Level 4)

Relabeling - Add Custom Labels

// For logs
loki.relabel "add_os_label" {
    forward_to = [loki.write.default.receiver]
    
    rule {
        target_label = "os"
        replacement  = constants.os  // auto-populates "linux"
    }
    
    rule {
        target_label = "environment"
        replacement  = "production"
    }
}

// Update log source to use relabeler
loki.source.file "syslog" {
    targets    = local.file_match.syslog.targets
    forward_to = [loki.relabel.add_os_label.receiver]  // Changed!
}

Filtering - Drop Unwanted Logs

loki.relabel "filter_logs" {
    forward_to = [loki.write.default.receiver]
    
    // Drop DEBUG logs
    rule {
        source_labels = ["__line__"]
        regex         = "(?i)debug"
        action        = "drop"
    }
    
    // Keep only ERROR and WARN
    rule {
        source_labels = ["__line__"]
        regex         = "(?i)error|warn|fail"
        action        = "keep"
    }
}

Part 7: Docker Monitoring (Level 5)

Collect Container Logs & Metrics

// Docker logs (no plugin needed!)
loki.source.docker "container_logs" {
    host       = "unix:///var/run/docker.sock"
    forward_to = [loki.write.default.receiver]
    
    labels = {
        container_name = "{{.Name}}",
        image_name     = "{{.ImageName}}",
    }
}

// Docker metrics
prometheus.exporter.docker "container_metrics" {
    host = "unix:///var/run/docker.sock"
}

discovery.relabel "docker_metrics" {
    targets = prometheus.exporter.docker.container_metrics.targets
    
    rule {
        target_label = "job"
        replacement  = "${constants.hostname}-docker"
    }
}

prometheus.scrape "docker_metrics" {
    targets    = discovery.relabel.docker_metrics.output.targets
    forward_to = [prometheus.remote_write.default.receiver]
}

No Docker Compose changes needed! All containers are automatically discovered.

Part 8: Advanced Scenarios (Level 6)

Multiple Log Sources

// System journal
loki.source.journal "journal" {
    forward_to = [loki.write.default.receiver]
    
    // Required in containers
    path = "/var/log/journal"
    
    labels = {
        job = "${constants.hostname}-journal"
    }
}

// Application logs
local.file_match "app_logs" {
    path_targets = [{
        __address__ = "localhost",
        __path__    = "/app/*.log",
        app         = "myapp",  // Custom label
    }]
}

loki.source.file "app_logs" {
    targets    = local.file_match.app_logs.targets
    forward_to = [loki.write.default.receiver]
}

Multiple Output Destinations

// Development Loki
loki.write "dev" {
    endpoint = "http://loki-dev:3100/loki/api/v1/push"
    
    external_labels = {
        environment = "development"
    }
}

// Production Loki  
loki.write "prod" {
    endpoint = "http://loki-prod:3100/loki/api/v1/push"
    
    external_labels = {
        environment = "production"
    }
}

// Route based on label
loki.relabel "route_logs" {
    rule {
        source_labels = ["environment"]
        regex         = "prod"
        target_label  = "__receiver__"
        replacement   = "loki.write.prod"
    }
    
    rule {
        source_labels = ["environment"] 
        regex         = "dev"
        target_label  = "__receiver__"
        replacement   = "loki.write.dev"
    }
    
    forward_to = [
        loki.write.prod.receiver,
        loki.write.dev.receiver,
    ]
}

Part 9: Best Practices

1. Configuration Organization

// 01-targets.alloy - Output destinations
loki.write "default" { /* ... */ }
prometheus.remote_write "default" { /* ... */ }

// 02-system-metrics.alloy - Host metrics
prometheus.exporter.unix "node" { /* ... */ }

// 03-system-logs.alloy - Host logs  
local.file_match "logs" { /* ... */ }

// 04-docker.alloy - Container monitoring
loki.source.docker "containers" { /* ... */ }

// Load all
prometheus.remote_write "default" {
    endpoint = "http://prometheus:9090/api/v1/write"
}

// Import other files
import.git "configs" {
    repository = "https://github.com/your-org/alloy-configs"
    path       = "*.alloy"
    pull_frequency = "5m"
}

2. Label Strategy

// Consistent labeling template
discovery.relabel "standard_labels" {
    rule {
        target_label = "host"
        replacement  = constants.hostname
    }
    rule {
        target_label = "region"
        replacement  = "us-east-1"
    }
    rule {
        target_label = "team"
        replacement  = "platform"
    }
    rule {
        target_label = "job"
        replacement  = "${constants.hostname}-${component}"
    }
}

3. Buffering & Retry

loki.write "default" {
    endpoint = "http://loki:3100/loki/api/v1/push"
    
    // Buffer when Loki is down
    max_backoff_period = "5s"
    min_backoff_period = "100ms"
    max_retries = 10
    
    external_labels = {
        agent = "alloy"
    }
}

Part 10: Complete Production Example

// === PRODUCTION CONFIG ===
// File: /etc/alloy/config.alloy

// 1. OUTPUTS
loki.write "production" {
    endpoint = "http://loki-prod:3100/loki/api/v1/push"
    external_labels = {cluster="prod", agent="alloy"}
    
    // Buffering
    max_streams = 10000
    batch_wait = "1s"
    batch_size = 1048576  // 1MB
}

prometheus.remote_write "production" {
    endpoint = "http://prometheus-prod:9090/api/v1/write"
    external_labels = {cluster="prod"}
    
    // Queue config
    queue_config = {
        capacity = 2500
        max_shards = 200
        min_shards = 1
        max_samples_per_send = 500
    }
}

// 2. COMMON PROCESSING
loki.relabel "common_labels" {
    rule {
        target_label = "host"
        replacement  = constants.hostname
    }
    rule {
        target_label = "os"
        replacement  = constants.os
    }
    rule {
        target_label = "agent"
        replacement  = "alloy"
    }
    forward_to = [loki.write.production.receiver]
}

// 3. SYSTEM METRICS
prometheus.exporter.unix {}

discovery.relabel "system_metrics" {
    targets = prometheus.exporter.unix.default.targets
    
    rule {
        target_label = "job"
        replacement  = "${constants.hostname}-system"
    }
    rule {
        target_label = "instance"
        replacement  = constants.hostname
    }
}

prometheus.scrape "system" {
    targets    = discovery.relabel.system_metrics.output.targets
    forward_to = [prometheus.remote_write.production.receiver]
}

// 4. SYSTEM LOGS
local.file_match "system_logs" {
    path_targets = [
        {__path__ = "/var/log/syslog"},
        {__path__ = "/var/log/auth.log"},
        {__path__ = "/var/log/kern.log"},
    ]
}

loki.source.file "system_logs" {
    targets    = local.file_match.system_logs.targets
    forward_to = [loki.relabel.common_labels.receiver]
}

// 5. JOURNAL
loki.source.journal "journal" {
    path       = "/var/log/journal"  // Required in containers
    forward_to = [loki.relabel.common_labels.receiver]
    
    labels = {
        job = "${constants.hostname}-journal"
    }
}

// 6. DOCKER
loki.source.docker "containers" {
    host = "unix:///var/run/docker.sock"
    forward_to = [loki.relabel.common_labels.receiver]
    
    labels = {
        job = "${constants.hostname}-docker"
    }
}

prometheus.exporter.docker {}

discovery.relabel "docker_metrics" {
    targets = prometheus.exporter.docker.default.targets
    
    rule {
        target_label = "job"
        replacement  = "${constants.hostname}-docker"
    }
}

prometheus.scrape "docker" {
    targets    = discovery.relabel.docker_metrics.output.targets
    forward_to = [prometheus.remote_write.production.receiver]
}

Part 11: Troubleshooting Cheat Sheet

Common Issues & Fixes:

"No metrics in Prometheus"

# Check Prometheus has remote write enabled
ps aux | grep prometheus | grep enable-remote-write

# Test connection
curl -X POST http://prometheus:9090/api/v1/write

"No logs in Loki"

# Check Alloy web UI
http://localhost:12345/graph

# Check component health
curl http://localhost:12345/-/healthy

# View Alloy logs
docker logs alloy

"Journal not working in container"

// Add path to journal source
loki.source.journal "journal" {
    path = "/var/log/journal"  // ← THIS LINE
}

"Hostname wrong in metrics"

# In docker-compose.yml
alloy:
  hostname: your-actual-hostname  # ← Set explicitly

Validate config:

alloy check config.alloy
alloy run --config.file=config.alloy --dry-run

Debug Commands:

# View component graph
open http://localhost:12345/graph

# Check metrics Alloy generates about itself
curl http://localhost:12345/metrics

# Live tail logs
curl -N http://localhost:12345/api/v0/logs/tail

# Export config
curl http://localhost:12345/api/v0/config

Part 12: Migration Checklist

From Old Stack → Alloy

Old Component	Alloy Replacement	Action
Promtail	`loki.source.file` + `local.file_match`	Run `alloy convert`
node_exporter	`prometheus.exporter.unix`	Remove node_exporter service
cadvisor	`prometheus.exporter.docker`	Remove cadvisor container
Loki Docker plugin	`loki.source.docker`	Remove plugin, keep logging=json
Multiple config files	Single `config.alloy`	Consolidate into sections

Step-by-Step Migration:

Stage 1: Deploy Alloy alongside existing tools
Stage 2: Compare data between old/new in Grafana
Stage 3: Route 10% traffic to Alloy
Stage 4: Full cutover, decommission old tools

Quick Reference Card

Essential Components:

local.file_match - Find log files
loki.source.file - Read log files
loki.source.journal - Read systemd journal
loki.source.docker - Read container logs
prometheus.exporter.unix - System metrics
prometheus.exporter.docker - Container metrics
loki.relabel - Process logs
discovery.relabel - Process metrics
prometheus.scrape - Send metrics
loki.write - Send logs
prometheus.remote_write - Send metrics

Magic Variables:

constants.hostname - System hostname
constants.os - Operating system
constants.architecture - CPU arch
constants.version - Alloy version

Web UI Endpoints:

:12345/ - Homepage
:12345/graph - Component visualization
:12345/metrics - Self-metrics
:12345/-/healthy - Health check
:12345/api/v0/config - Current config

Next Steps After Mastery

Add tracing: otelcol.* components for OpenTelemetry
Multi-cluster: Use import.http for centralized config
Custom components: Build your own with Go
Kubernetes: Use Alloy Operator for dynamic discovery
Alerting: Add prometheus.relabel for alert rules

Resources

Remember: Start simple, use the web UI (:12345/graph) to visualize your pipeline, and incrementally add complexity. Alloy's power is in its composability - build your monitoring like Lego!

15 KiB Raw Blame History