5.1 KiB
5.1 KiB
The Ultimate AWK Guide: From Basics to Advanced Data Wrangling
Table of Contents
- AWK Fundamentals
- Patterns & Actions
- Built-in Variables
- Arrays & Data Structures
- Control Flow
- Functions & Math
- Advanced Text Processing
- Real-World Recipes
- Performance & Limitations
1. AWK Fundamentals
Core Syntax
awk [OPTIONS] 'PATTERN {ACTION}' file.txt
Basic Structure
BEGIN { /* pre-processing */ }
PATTERN { /* line processing */ }
END { /* post-processing */ }
Common Flags
awk -F: '{print $1}' /etc/passwd # Set field separator
awk -v var=value '...' # Pass variables
awk -f script.awk file.txt # Use script file
2. Patterns & Actions
Pattern Types
/error/ {print} # Regex match
$3 > 100 {print $1} # Field comparison
NR == 1 {print} # Line number
BEGINFILE {print "Processing:", FILENAME} # Per-file
Special Patterns
BEGIN {FS=":"; OFS="\t"} # Set input/output separators
END {print "Total lines:", NR} # Final processing
3. Built-in Variables
| Variable | Description |
|---|---|
NR |
Current record number |
NF |
Number of fields |
FS |
Field separator (default: whitespace) |
OFS |
Output field separator |
FILENAME |
Current file name |
FNR |
Record number per file |
Example
awk '{print NR, NF, $0}' file.txt # Show line stats
4. Arrays & Data Structures
Associative Arrays
{count[$1]++} # Count occurrences
END {for (key in count) print key, count[key]}
Multi-Dimensional Arrays
{array[$1,$2] = $3} # Fake multi-dim
Array Functions
split(string, array, separator)
asort(array) # Sort by value
asorti(array) # Sort by index
5. Control Flow
Conditionals
{if ($3 > 100) print "High:", $1
else print "Low:", $1}
Loops
for (i=1; i<=NF; i++) {print $i} # Fields
for (key in array) {print key} # Array keys
Switch/Case
switch($1) {
case "foo": print "Found foo"; break
case /^bar/: print "Starts with bar"; break
default: print "Other"
}
6. Functions & Math
Built-in Functions
length($0) # String length
sub(/old/, "new", $1) # In-field substitution
system("date") # Run shell command
Math Operations
{sum += $3; sumsq += ($3)^2}
END {print "Mean:", sum/NR, "Std Dev:", sqrt(sumsq/NR - (sum/NR)^2)}
User Functions
function double(x) {return x*2}
{d = double($1); print d}
7. Advanced Text Processing
Field Manipulation
{$1 = toupper($1); $NF = $NF "%"} # Modify fields
Multi-Line Records
awk -v RS="" '{print $1}' # Paragraph mode
CSV Processing
awk -FPAT='([^,]+)|("[^"]+")' '{print $2}' data.csv
8. Real-World Recipes
Log Analysis
# Top 10 frequent IPs in access.log
awk '{ip[$1]++} END {for (i in ip) print ip[i], i}' access.log | sort -nr | head
Data Transformation
# Convert TSV to CSV
BEGIN {FS="\t"; OFS=","} {$1=$1; print}
Column Statistics
# Compute column averages
NR>1 {for(i=1; i<=NF; i++) sum[i]+=$i}
END {for(i=1; i<=NF; i++) print "Col", i, "avg:", sum[i]/(NR-1)}
JSON Generation
BEGIN {print "["; FS=","}
{printf " {\"name\":\"%s\",\"value\":%s}%s\n", $1, $2, (NR==FNR?",":"")}
END {print "]"}
9. Performance & Limitations
Optimization Tips
LC_ALL=C awk ... # 2-3x speedup for ASCII
mawk (faster alternative to gawk) # For large datasets
When Not to Use AWK
- Binary data processing
- Complex nested data structures (use
jqfor JSON) - Multi-gigabyte files (consider
split + parallel)
AWK vs Alternatives
| Task | Best Tool |
|---|---|
| Columnar data | AWK |
| JSON/XML | jq/xq |
| Complex stats | R/Python |
| Multi-file joins | SQLite |
Pro Techniques
Self-Contained Scripts
#!/usr/bin/awk -f
BEGIN {print "Starting processing"}
/pattern/ {count++}
END {print "Found", count, "matches"}
Two-File Processing
# Join two files on first field
NR==FNR {a[$1]=$2; next}
$1 in a {print $0, a[$1]}
Bit Manipulation
function is_set(x,bit) {return and(x, lshift(1, bit-1))}
Further Learning
- Books: "Effective AWK Programming" (GNU AWK manual)
- Cheat Sheets: awkcheatsheet.com
- Practice: exercism.org/tracks/awk
Need an AWK solution? Describe your data format and desired transformation!