246 lines
5.1 KiB
Markdown
246 lines
5.1 KiB
Markdown
# **The Ultimate AWK Guide: From Basics to Advanced Data Wrangling**
|
|
|
|
## **Table of Contents**
|
|
1. [AWK Fundamentals](#1-awk-fundamentals)
|
|
2. [Patterns & Actions](#2-patterns--actions)
|
|
3. [Built-in Variables](#3-built-in-variables)
|
|
4. [Arrays & Data Structures](#4-arrays--data-structures)
|
|
5. [Control Flow](#5-control-flow)
|
|
6. [Functions & Math](#6-functions--math)
|
|
7. [Advanced Text Processing](#7-advanced-text-processing)
|
|
8. [Real-World Recipes](#8-real-world-recipes)
|
|
9. [Performance & Limitations](#9-performance--limitations)
|
|
|
|
---
|
|
|
|
## **1. AWK Fundamentals**
|
|
|
|
### **Core Syntax**
|
|
```bash
|
|
awk [OPTIONS] 'PATTERN {ACTION}' file.txt
|
|
```
|
|
|
|
### **Basic Structure**
|
|
```awk
|
|
BEGIN { /* pre-processing */ }
|
|
PATTERN { /* line processing */ }
|
|
END { /* post-processing */ }
|
|
```
|
|
|
|
### **Common Flags**
|
|
```bash
|
|
awk -F: '{print $1}' /etc/passwd # Set field separator
|
|
awk -v var=value '...' # Pass variables
|
|
awk -f script.awk file.txt # Use script file
|
|
```
|
|
|
|
---
|
|
|
|
## **2. Patterns & Actions**
|
|
|
|
### **Pattern Types**
|
|
```awk
|
|
/error/ {print} # Regex match
|
|
$3 > 100 {print $1} # Field comparison
|
|
NR == 1 {print} # Line number
|
|
BEGINFILE {print "Processing:", FILENAME} # Per-file
|
|
```
|
|
|
|
### **Special Patterns**
|
|
```awk
|
|
BEGIN {FS=":"; OFS="\t"} # Set input/output separators
|
|
END {print "Total lines:", NR} # Final processing
|
|
```
|
|
|
|
---
|
|
|
|
## **3. Built-in Variables**
|
|
|
|
| Variable | Description |
|
|
|----------|-------------|
|
|
| `NR` | Current record number |
|
|
| `NF` | Number of fields |
|
|
| `FS` | Field separator (default: whitespace) |
|
|
| `OFS` | Output field separator |
|
|
| `FILENAME` | Current file name |
|
|
| `FNR` | Record number per file |
|
|
|
|
### **Example**
|
|
```awk
|
|
awk '{print NR, NF, $0}' file.txt # Show line stats
|
|
```
|
|
|
|
---
|
|
|
|
## **4. Arrays & Data Structures**
|
|
|
|
### **Associative Arrays**
|
|
```awk
|
|
{count[$1]++} # Count occurrences
|
|
END {for (key in count) print key, count[key]}
|
|
```
|
|
|
|
### **Multi-Dimensional Arrays**
|
|
```awk
|
|
{array[$1,$2] = $3} # Fake multi-dim
|
|
```
|
|
|
|
### **Array Functions**
|
|
```awk
|
|
split(string, array, separator)
|
|
asort(array) # Sort by value
|
|
asorti(array) # Sort by index
|
|
```
|
|
|
|
---
|
|
|
|
## **5. Control Flow**
|
|
|
|
### **Conditionals**
|
|
```awk
|
|
{if ($3 > 100) print "High:", $1
|
|
else print "Low:", $1}
|
|
```
|
|
|
|
### **Loops**
|
|
```awk
|
|
for (i=1; i<=NF; i++) {print $i} # Fields
|
|
for (key in array) {print key} # Array keys
|
|
```
|
|
|
|
### **Switch/Case**
|
|
```awk
|
|
switch($1) {
|
|
case "foo": print "Found foo"; break
|
|
case /^bar/: print "Starts with bar"; break
|
|
default: print "Other"
|
|
}
|
|
```
|
|
|
|
---
|
|
|
|
## **6. Functions & Math**
|
|
|
|
### **Built-in Functions**
|
|
```awk
|
|
length($0) # String length
|
|
sub(/old/, "new", $1) # In-field substitution
|
|
system("date") # Run shell command
|
|
```
|
|
|
|
### **Math Operations**
|
|
```awk
|
|
{sum += $3; sumsq += ($3)^2}
|
|
END {print "Mean:", sum/NR, "Std Dev:", sqrt(sumsq/NR - (sum/NR)^2)}
|
|
```
|
|
|
|
### **User Functions**
|
|
```awk
|
|
function double(x) {return x*2}
|
|
{d = double($1); print d}
|
|
```
|
|
|
|
---
|
|
|
|
## **7. Advanced Text Processing**
|
|
|
|
### **Field Manipulation**
|
|
```awk
|
|
{$1 = toupper($1); $NF = $NF "%"} # Modify fields
|
|
```
|
|
|
|
### **Multi-Line Records**
|
|
```bash
|
|
awk -v RS="" '{print $1}' # Paragraph mode
|
|
```
|
|
|
|
### **CSV Processing**
|
|
```bash
|
|
awk -FPAT='([^,]+)|("[^"]+")' '{print $2}' data.csv
|
|
```
|
|
|
|
---
|
|
|
|
## **8. Real-World Recipes**
|
|
|
|
### **Log Analysis**
|
|
```awk
|
|
# Top 10 frequent IPs in access.log
|
|
awk '{ip[$1]++} END {for (i in ip) print ip[i], i}' access.log | sort -nr | head
|
|
```
|
|
|
|
### **Data Transformation**
|
|
```awk
|
|
# Convert TSV to CSV
|
|
BEGIN {FS="\t"; OFS=","} {$1=$1; print}
|
|
```
|
|
|
|
### **Column Statistics**
|
|
```awk
|
|
# Compute column averages
|
|
NR>1 {for(i=1; i<=NF; i++) sum[i]+=$i}
|
|
END {for(i=1; i<=NF; i++) print "Col", i, "avg:", sum[i]/(NR-1)}
|
|
```
|
|
|
|
### **JSON Generation**
|
|
```awk
|
|
BEGIN {print "["; FS=","}
|
|
{printf " {\"name\":\"%s\",\"value\":%s}%s\n", $1, $2, (NR==FNR?",":"")}
|
|
END {print "]"}
|
|
```
|
|
|
|
---
|
|
|
|
## **9. Performance & Limitations**
|
|
|
|
### **Optimization Tips**
|
|
```bash
|
|
LC_ALL=C awk ... # 2-3x speedup for ASCII
|
|
mawk (faster alternative to gawk) # For large datasets
|
|
```
|
|
|
|
### **When Not to Use AWK**
|
|
- Binary data processing
|
|
- Complex nested data structures (use `jq` for JSON)
|
|
- Multi-gigabyte files (consider `split + parallel`)
|
|
|
|
### **AWK vs Alternatives**
|
|
| Task | Best Tool |
|
|
|------|-----------|
|
|
| Columnar data | AWK |
|
|
| JSON/XML | `jq`/`xq` |
|
|
| Complex stats | R/Python |
|
|
| Multi-file joins | SQLite |
|
|
|
|
---
|
|
|
|
## **Pro Techniques**
|
|
|
|
### **Self-Contained Scripts**
|
|
```bash
|
|
#!/usr/bin/awk -f
|
|
BEGIN {print "Starting processing"}
|
|
/pattern/ {count++}
|
|
END {print "Found", count, "matches"}
|
|
```
|
|
|
|
### **Two-File Processing**
|
|
```awk
|
|
# Join two files on first field
|
|
NR==FNR {a[$1]=$2; next}
|
|
$1 in a {print $0, a[$1]}
|
|
```
|
|
|
|
### **Bit Manipulation**
|
|
```awk
|
|
function is_set(x,bit) {return and(x, lshift(1, bit-1))}
|
|
```
|
|
|
|
---
|
|
|
|
## **Further Learning**
|
|
- **Books**: "Effective AWK Programming" (GNU AWK manual)
|
|
- **Cheat Sheets**: [awkcheatsheet.com](https://awkcheatsheet.com)
|
|
- **Practice**: [exercism.org/tracks/awk](https://exercism.org/tracks/awk)
|
|
|
|
**Need an AWK solution?** Describe your data format and desired transformation! |