From 8ddf8dba7244b50e81f5f99e25a040cd38b39b9c Mon Sep 17 00:00:00 2001 From: medusa Date: Wed, 25 Jun 2025 04:55:44 +0000 Subject: [PATCH] Add tech_docs/pattern_matching.md --- tech_docs/pattern_matching.md | 139 ++++++++++++++++++++++++++++++++++ 1 file changed, 139 insertions(+) create mode 100644 tech_docs/pattern_matching.md diff --git a/tech_docs/pattern_matching.md b/tech_docs/pattern_matching.md new file mode 100644 index 0000000..cd3cf08 --- /dev/null +++ b/tech_docs/pattern_matching.md @@ -0,0 +1,139 @@ +# Pattern Matching + +--- + +### **Expanded Key Takeaways: Choosing the Right Tool for Pattern Matching** + +Regular expressions (regex) are powerful, but they’re not always the best tool for every text-processing task. Below is an **expanded breakdown** of when to use regex versus alternatives, along with context and real-world examples. + +--- + +## **1. Regex is Best for Medium-Complexity Text Patterns** +**Context**: +- Regex excels at flexible, rule-based matching (e.g., email validation, log filtering). +- It balances expressiveness and readability for moderately complex cases. + +**When to Use**: +✔ Extracting structured data (e.g., `\d{3}-\d{2}-\d{4}` for SSNs). +✔ Finding variable patterns (e.g., `https?://[^\s]+` for URLs). +✔ Replacing substrings following a rule (e.g., `s/\bcolour\b/color/g`). + +**Limitations**: +❌ Becomes unreadable for very complex rules (e.g., nested brackets). +❌ Poor at recursive patterns (e.g., matching nested HTML tags). + +**Example**: +```python +import re +# Extract phone numbers in format (XXX) XXX-XXXX +text = "Call (123) 456-7890 or (987) 654-3210" +phones = re.findall(r'\(\d{3}\) \d{3}-\d{4}', text) +# Result: ['(123) 456-7890', '(987) 654-3210'] +``` + +--- + +## **2. For Simple Tasks, Built-in String Methods Are Cleaner** +**Context**: +- If the task is **exact matching** or **fixed-format parsing**, avoid regex overhead. + +**When to Use**: +✔ Checking prefixes/suffixes (`str.startswith()`, `str.endswith()`). +✔ Exact substring search (`str.find()`, `str.contains()`). +✔ Splitting on fixed delimiters (`str.split(',')`). + +**Example**: +```python +# Check if a filename ends with .csv (simpler than regex) +filename = "data_2024.csv" +if filename.endswith(".csv"): + print("CSV file detected.") +``` + +--- + +## **3. For Recursive/Nested Patterns, Use Grammars or Parsers** +**Context**: +- Regex **cannot** handle recursive structures (e.g., JSON, XML, math expressions). +- **Formal grammars** (e.g., CFG) or **parser combinators** are needed. + +**When to Use**: +✔ Parsing programming languages. +✔ Extracting nested data (e.g., HTML/XML). +✔ Validating structured documents. + +**Example (Using `lxml` for HTML)**: +```python +from lxml import html +doc = html.fromstring("

Hello world

") +text = doc.xpath("//p//text()") # Gets "Hello world" +``` + +--- + +## **4. Automata Are Theoretical Foundations (Rarely Hand-Coded)** +**Context**: +- Finite State Machines (FSMs) underpin regex but are **not practical to write manually** for most tasks. +- Useful for **educational purposes** or **low-level optimizations** (e.g., lexers). + +**When to Use**: +✔ Teaching how regex works internally. +✔ Writing ultra-efficient tokenizers (e.g., in compiler design). + +**Example (Toy FSM for `ab*c`)**: +```python +def is_ab_star_c(s): + state = 0 + for char in s: + if state == 0 and char == 'a': + state = 1 + elif state == 1 and char == 'b': + continue + elif state == 1 and char == 'c': + state = 2 + else: + return False + return state == 2 +``` + +--- + +## **5. For High-Performance Tokenizing, Use Lex/Flex** +**Context**: +- **Lex/Flex** generate **optimized C code** for pattern matching. +- Used in compilers (e.g., `gcc`, `clang`) for speed. + +**When to Use**: +✔ Building custom programming languages. +✔ Processing large log files efficiently. + +**Example (Lex Rule for Words and Numbers)**: +```lex +%% +[a-zA-Z]+ { printf("WORD: %s\n", yytext); } +[0-9]+ { printf("NUMBER: %s\n", yytext); } +%% +``` + +--- + +## **Task-to-Tool Decision Table** +| **Task** | **Best Tool** | **Example** | +|-----------------------------------|-----------------------------|--------------------------------------| +| Exact substring match | `str.contains()`, `str.find()` | `"error 404".find("404")` | +| Prefix/suffix check | `str.startswith()`/`endswith()` | `filename.endswith(".csv")` | +| Medium-complexity patterns | **Regex** | `re.findall(r'\b[A-Z]\w+', text)` | +| Nested structures (HTML/XML) | **Parsers (lxml, BeautifulSoup)** | `xpath("//div//p/text()")` | +| Recursive patterns (e.g., math) | **Grammars (ANTLR, PEG)** | Parsing `(1 + (2 * 3))` | +| High-speed tokenizing (e.g., logs)| **Lex/Flex** | Lex rules for Apache log parsing | +| Educational/state logic | **Finite State Machines** | Implementing `ab*c` manually | + +--- + +### **Final Advice** +- **Use regex** for flexible, non-recursive text patterns. +- **Use string methods** for trivial checks (faster, more readable). +- **Use parsers** for nested/structured data (HTML, code). +- **Use Lex/Flex** for maximum performance in tokenizers. + +Would you like a case study comparing these tools on a real-world problem (e.g., log parsing)? \ No newline at end of file