# **The Ultimate AWK Guide: From Basics to Advanced Data Wrangling** ## **Table of Contents** 1. [AWK Fundamentals](#1-awk-fundamentals) 2. [Patterns & Actions](#2-patterns--actions) 3. [Built-in Variables](#3-built-in-variables) 4. [Arrays & Data Structures](#4-arrays--data-structures) 5. [Control Flow](#5-control-flow) 6. [Functions & Math](#6-functions--math) 7. [Advanced Text Processing](#7-advanced-text-processing) 8. [Real-World Recipes](#8-real-world-recipes) 9. [Performance & Limitations](#9-performance--limitations) --- ## **1. AWK Fundamentals** ### **Core Syntax** ```bash awk [OPTIONS] 'PATTERN {ACTION}' file.txt ``` ### **Basic Structure** ```awk BEGIN { /* pre-processing */ } PATTERN { /* line processing */ } END { /* post-processing */ } ``` ### **Common Flags** ```bash awk -F: '{print $1}' /etc/passwd # Set field separator awk -v var=value '...' # Pass variables awk -f script.awk file.txt # Use script file ``` --- ## **2. Patterns & Actions** ### **Pattern Types** ```awk /error/ {print} # Regex match $3 > 100 {print $1} # Field comparison NR == 1 {print} # Line number BEGINFILE {print "Processing:", FILENAME} # Per-file ``` ### **Special Patterns** ```awk BEGIN {FS=":"; OFS="\t"} # Set input/output separators END {print "Total lines:", NR} # Final processing ``` --- ## **3. Built-in Variables** | Variable | Description | |----------|-------------| | `NR` | Current record number | | `NF` | Number of fields | | `FS` | Field separator (default: whitespace) | | `OFS` | Output field separator | | `FILENAME` | Current file name | | `FNR` | Record number per file | ### **Example** ```awk awk '{print NR, NF, $0}' file.txt # Show line stats ``` --- ## **4. Arrays & Data Structures** ### **Associative Arrays** ```awk {count[$1]++} # Count occurrences END {for (key in count) print key, count[key]} ``` ### **Multi-Dimensional Arrays** ```awk {array[$1,$2] = $3} # Fake multi-dim ``` ### **Array Functions** ```awk split(string, array, separator) asort(array) # Sort by value asorti(array) # Sort by index ``` --- ## **5. Control Flow** ### **Conditionals** ```awk {if ($3 > 100) print "High:", $1 else print "Low:", $1} ``` ### **Loops** ```awk for (i=1; i<=NF; i++) {print $i} # Fields for (key in array) {print key} # Array keys ``` ### **Switch/Case** ```awk switch($1) { case "foo": print "Found foo"; break case /^bar/: print "Starts with bar"; break default: print "Other" } ``` --- ## **6. Functions & Math** ### **Built-in Functions** ```awk length($0) # String length sub(/old/, "new", $1) # In-field substitution system("date") # Run shell command ``` ### **Math Operations** ```awk {sum += $3; sumsq += ($3)^2} END {print "Mean:", sum/NR, "Std Dev:", sqrt(sumsq/NR - (sum/NR)^2)} ``` ### **User Functions** ```awk function double(x) {return x*2} {d = double($1); print d} ``` --- ## **7. Advanced Text Processing** ### **Field Manipulation** ```awk {$1 = toupper($1); $NF = $NF "%"} # Modify fields ``` ### **Multi-Line Records** ```bash awk -v RS="" '{print $1}' # Paragraph mode ``` ### **CSV Processing** ```bash awk -FPAT='([^,]+)|("[^"]+")' '{print $2}' data.csv ``` --- ## **8. Real-World Recipes** ### **Log Analysis** ```awk # Top 10 frequent IPs in access.log awk '{ip[$1]++} END {for (i in ip) print ip[i], i}' access.log | sort -nr | head ``` ### **Data Transformation** ```awk # Convert TSV to CSV BEGIN {FS="\t"; OFS=","} {$1=$1; print} ``` ### **Column Statistics** ```awk # Compute column averages NR>1 {for(i=1; i<=NF; i++) sum[i]+=$i} END {for(i=1; i<=NF; i++) print "Col", i, "avg:", sum[i]/(NR-1)} ``` ### **JSON Generation** ```awk BEGIN {print "["; FS=","} {printf " {\"name\":\"%s\",\"value\":%s}%s\n", $1, $2, (NR==FNR?",":"")} END {print "]"} ``` --- ## **9. Performance & Limitations** ### **Optimization Tips** ```bash LC_ALL=C awk ... # 2-3x speedup for ASCII mawk (faster alternative to gawk) # For large datasets ``` ### **When Not to Use AWK** - Binary data processing - Complex nested data structures (use `jq` for JSON) - Multi-gigabyte files (consider `split + parallel`) ### **AWK vs Alternatives** | Task | Best Tool | |------|-----------| | Columnar data | AWK | | JSON/XML | `jq`/`xq` | | Complex stats | R/Python | | Multi-file joins | SQLite | --- ## **Pro Techniques** ### **Self-Contained Scripts** ```bash #!/usr/bin/awk -f BEGIN {print "Starting processing"} /pattern/ {count++} END {print "Found", count, "matches"} ``` ### **Two-File Processing** ```awk # Join two files on first field NR==FNR {a[$1]=$2; next} $1 in a {print $0, a[$1]} ``` ### **Bit Manipulation** ```awk function is_set(x,bit) {return and(x, lshift(1, bit-1))} ``` --- ## **Further Learning** - **Books**: "Effective AWK Programming" (GNU AWK manual) - **Cheat Sheets**: [awkcheatsheet.com](https://awkcheatsheet.com) - **Practice**: [exercism.org/tracks/awk](https://exercism.org/tracks/awk) **Need an AWK solution?** Describe your data format and desired transformation!