From 8ddf8dba7244b50e81f5f99e25a040cd38b39b9c Mon Sep 17 00:00:00 2001
From: medusa <newton214@gmail.com>
Date: Wed, 25 Jun 2025 04:55:44 +0000
Subject: [PATCH] Add tech_docs/pattern_matching.md

---
 tech_docs/pattern_matching.md | 139 ++++++++++++++++++++++++++++++++++
 1 file changed, 139 insertions(+)
 create mode 100644 tech_docs/pattern_matching.md
diff --git a/tech_docs/pattern_matching.md b/tech_docs/pattern_matching.md
new file mode 100644
index 0000000..cd3cf08
--- /dev/null
+++ b/tech_docs/pattern_matching.md
@@ -0,0 +1,139 @@
+# Pattern Matching
+
+---
+
+### **Expanded Key Takeaways: Choosing the Right Tool for Pattern Matching**  
+
+Regular expressions (regex) are powerful, but they’re not always the best tool for every text-processing task. Below is an **expanded breakdown** of when to use regex versus alternatives, along with context and real-world examples.  
+
+---
+
+## **1. Regex is Best for Medium-Complexity Text Patterns**  
+**Context**:  
+- Regex excels at flexible, rule-based matching (e.g., email validation, log filtering).  
+- It balances expressiveness and readability for moderately complex cases.  
+
+**When to Use**:  
+✔ Extracting structured data (e.g., `\d{3}-\d{2}-\d{4}` for SSNs).  
+✔ Finding variable patterns (e.g., `https?://[^\s]+` for URLs).  
+✔ Replacing substrings following a rule (e.g., `s/\bcolour\b/color/g`).  
+
+**Limitations**:  
+❌ Becomes unreadable for very complex rules (e.g., nested brackets).  
+❌ Poor at recursive patterns (e.g., matching nested HTML tags).  
+
+**Example**:  
+```python
+import re
+# Extract phone numbers in format (XXX) XXX-XXXX
+text = "Call (123) 456-7890 or (987) 654-3210"
+phones = re.findall(r'\(\d{3}\) \d{3}-\d{4}', text)
+# Result: ['(123) 456-7890', '(987) 654-3210']
+```
+
+---
+
+## **2. For Simple Tasks, Built-in String Methods Are Cleaner**  
+**Context**:  
+- If the task is **exact matching** or **fixed-format parsing**, avoid regex overhead.  
+
+**When to Use**:  
+✔ Checking prefixes/suffixes (`str.startswith()`, `str.endswith()`).  
+✔ Exact substring search (`str.find()`, `str.contains()`).  
+✔ Splitting on fixed delimiters (`str.split(',')`).  
+
+**Example**:  
+```python
+# Check if a filename ends with .csv (simpler than regex)
+filename = "data_2024.csv"
+if filename.endswith(".csv"):
+    print("CSV file detected.")
+```
+
+---
+
+## **3. For Recursive/Nested Patterns, Use Grammars or Parsers**  
+**Context**:  
+- Regex **cannot** handle recursive structures (e.g., JSON, XML, math expressions).  
+- **Formal grammars** (e.g., CFG) or **parser combinators** are needed.  
+
+**When to Use**:  
+✔ Parsing programming languages.  
+✔ Extracting nested data (e.g., HTML/XML).  
+✔ Validating structured documents.  
+
+**Example (Using `lxml` for HTML)**:  
+```python
+from lxml import html
+doc = html.fromstring("<div><p>Hello <b>world</b></p></div>")
+text = doc.xpath("//p//text()")  # Gets "Hello world"
+```
+
+---
+
+## **4. Automata Are Theoretical Foundations (Rarely Hand-Coded)**  
+**Context**:  
+- Finite State Machines (FSMs) underpin regex but are **not practical to write manually** for most tasks.  
+- Useful for **educational purposes** or **low-level optimizations** (e.g., lexers).  
+
+**When to Use**:  
+✔ Teaching how regex works internally.  
+✔ Writing ultra-efficient tokenizers (e.g., in compiler design).  
+
+**Example (Toy FSM for `ab*c`)**:  
+```python
+def is_ab_star_c(s):
+    state = 0
+    for char in s:
+        if state == 0 and char == 'a':
+            state = 1
+        elif state == 1 and char == 'b':
+            continue
+        elif state == 1 and char == 'c':
+            state = 2
+        else:
+            return False
+    return state == 2
+```
+
+---
+
+## **5. For High-Performance Tokenizing, Use Lex/Flex**  
+**Context**:  
+- **Lex/Flex** generate **optimized C code** for pattern matching.  
+- Used in compilers (e.g., `gcc`, `clang`) for speed.  
+
+**When to Use**:  
+✔ Building custom programming languages.  
+✔ Processing large log files efficiently.  
+
+**Example (Lex Rule for Words and Numbers)**:  
+```lex
+%%
+[a-zA-Z]+   { printf("WORD: %s\n", yytext); }
+[0-9]+      { printf("NUMBER: %s\n", yytext); }
+%%
+```
+
+---
+
+## **Task-to-Tool Decision Table**  
+| **Task**                          | **Best Tool**                | **Example**                          |
+|-----------------------------------|-----------------------------|--------------------------------------|
+| Exact substring match             | `str.contains()`, `str.find()` | `"error 404".find("404")`           |
+| Prefix/suffix check               | `str.startswith()`/`endswith()` | `filename.endswith(".csv")`         |
+| Medium-complexity patterns        | **Regex**                   | `re.findall(r'\b[A-Z]\w+', text)`   |
+| Nested structures (HTML/XML)      | **Parsers (lxml, BeautifulSoup)** | `xpath("//div//p/text()")`         |
+| Recursive patterns (e.g., math)   | **Grammars (ANTLR, PEG)**   | Parsing `(1 + (2 * 3))`             |
+| High-speed tokenizing (e.g., logs)| **Lex/Flex**                | Lex rules for Apache log parsing    |
+| Educational/state logic           | **Finite State Machines**   | Implementing `ab*c` manually        |
+
+---
+
+### **Final Advice**  
+- **Use regex** for flexible, non-recursive text patterns.  
+- **Use string methods** for trivial checks (faster, more readable).  
+- **Use parsers** for nested/structured data (HTML, code).  
+- **Use Lex/Flex** for maximum performance in tokenizers.  
+
+Would you like a case study comparing these tools on a real-world problem (e.g., log parsing)?
\ No newline at end of file