Files
the_information_nexus/tech_docs/linux/awk.md
2025-07-01 12:58:26 +00:00

5.1 KiB

The Ultimate AWK Guide: From Basics to Advanced Data Wrangling

Table of Contents

  1. AWK Fundamentals
  2. Patterns & Actions
  3. Built-in Variables
  4. Arrays & Data Structures
  5. Control Flow
  6. Functions & Math
  7. Advanced Text Processing
  8. Real-World Recipes
  9. Performance & Limitations

1. AWK Fundamentals

Core Syntax

awk [OPTIONS] 'PATTERN {ACTION}' file.txt

Basic Structure

BEGIN { /* pre-processing */ }
PATTERN { /* line processing */ }
END { /* post-processing */ }

Common Flags

awk -F: '{print $1}' /etc/passwd  # Set field separator
awk -v var=value '...'            # Pass variables
awk -f script.awk file.txt        # Use script file

2. Patterns & Actions

Pattern Types

/error/ {print}                  # Regex match
$3 > 100 {print $1}              # Field comparison
NR == 1 {print}                  # Line number
BEGINFILE {print "Processing:", FILENAME}  # Per-file

Special Patterns

BEGIN {FS=":"; OFS="\t"}         # Set input/output separators
END {print "Total lines:", NR}    # Final processing

3. Built-in Variables

Variable Description
NR Current record number
NF Number of fields
FS Field separator (default: whitespace)
OFS Output field separator
FILENAME Current file name
FNR Record number per file

Example

awk '{print NR, NF, $0}' file.txt  # Show line stats

4. Arrays & Data Structures

Associative Arrays

{count[$1]++}                     # Count occurrences
END {for (key in count) print key, count[key]}

Multi-Dimensional Arrays

{array[$1,$2] = $3}               # Fake multi-dim

Array Functions

split(string, array, separator)
asort(array)                      # Sort by value
asorti(array)                     # Sort by index

5. Control Flow

Conditionals

{if ($3 > 100) print "High:", $1
 else print "Low:", $1}

Loops

for (i=1; i<=NF; i++) {print $i}  # Fields
for (key in array) {print key}     # Array keys

Switch/Case

switch($1) {
  case "foo": print "Found foo"; break
  case /^bar/: print "Starts with bar"; break
  default: print "Other"
}

6. Functions & Math

Built-in Functions

length($0)                        # String length
sub(/old/, "new", $1)             # In-field substitution
system("date")                    # Run shell command

Math Operations

{sum += $3; sumsq += ($3)^2}
END {print "Mean:", sum/NR, "Std Dev:", sqrt(sumsq/NR - (sum/NR)^2)}

User Functions

function double(x) {return x*2}
{d = double($1); print d}

7. Advanced Text Processing

Field Manipulation

{$1 = toupper($1); $NF = $NF "%"}  # Modify fields

Multi-Line Records

awk -v RS="" '{print $1}'          # Paragraph mode

CSV Processing

awk -FPAT='([^,]+)|("[^"]+")' '{print $2}' data.csv

8. Real-World Recipes

Log Analysis

# Top 10 frequent IPs in access.log
awk '{ip[$1]++} END {for (i in ip) print ip[i], i}' access.log | sort -nr | head

Data Transformation

# Convert TSV to CSV
BEGIN {FS="\t"; OFS=","} {$1=$1; print}

Column Statistics

# Compute column averages
NR>1 {for(i=1; i<=NF; i++) sum[i]+=$i}
END {for(i=1; i<=NF; i++) print "Col", i, "avg:", sum[i]/(NR-1)}

JSON Generation

BEGIN {print "["; FS=","}
{printf "  {\"name\":\"%s\",\"value\":%s}%s\n", $1, $2, (NR==FNR?",":"")}
END {print "]"}

9. Performance & Limitations

Optimization Tips

LC_ALL=C awk ...                  # 2-3x speedup for ASCII
mawk (faster alternative to gawk) # For large datasets

When Not to Use AWK

  • Binary data processing
  • Complex nested data structures (use jq for JSON)
  • Multi-gigabyte files (consider split + parallel)

AWK vs Alternatives

Task Best Tool
Columnar data AWK
JSON/XML jq/xq
Complex stats R/Python
Multi-file joins SQLite

Pro Techniques

Self-Contained Scripts

#!/usr/bin/awk -f
BEGIN {print "Starting processing"}
/pattern/ {count++}
END {print "Found", count, "matches"}

Two-File Processing

# Join two files on first field
NR==FNR {a[$1]=$2; next} 
$1 in a {print $0, a[$1]}

Bit Manipulation

function is_set(x,bit) {return and(x, lshift(1, bit-1))}

Further Learning

Need an AWK solution? Describe your data format and desired transformation!