139 lines
4.9 KiB
Markdown
139 lines
4.9 KiB
Markdown
# Pattern Matching
|
||
|
||
---
|
||
|
||
### **Expanded Key Takeaways: Choosing the Right Tool for Pattern Matching**
|
||
|
||
Regular expressions (regex) are powerful, but they’re not always the best tool for every text-processing task. Below is an **expanded breakdown** of when to use regex versus alternatives, along with context and real-world examples.
|
||
|
||
---
|
||
|
||
## **1. Regex is Best for Medium-Complexity Text Patterns**
|
||
**Context**:
|
||
- Regex excels at flexible, rule-based matching (e.g., email validation, log filtering).
|
||
- It balances expressiveness and readability for moderately complex cases.
|
||
|
||
**When to Use**:
|
||
✔ Extracting structured data (e.g., `\d{3}-\d{2}-\d{4}` for SSNs).
|
||
✔ Finding variable patterns (e.g., `https?://[^\s]+` for URLs).
|
||
✔ Replacing substrings following a rule (e.g., `s/\bcolour\b/color/g`).
|
||
|
||
**Limitations**:
|
||
❌ Becomes unreadable for very complex rules (e.g., nested brackets).
|
||
❌ Poor at recursive patterns (e.g., matching nested HTML tags).
|
||
|
||
**Example**:
|
||
```python
|
||
import re
|
||
# Extract phone numbers in format (XXX) XXX-XXXX
|
||
text = "Call (123) 456-7890 or (987) 654-3210"
|
||
phones = re.findall(r'\(\d{3}\) \d{3}-\d{4}', text)
|
||
# Result: ['(123) 456-7890', '(987) 654-3210']
|
||
```
|
||
|
||
---
|
||
|
||
## **2. For Simple Tasks, Built-in String Methods Are Cleaner**
|
||
**Context**:
|
||
- If the task is **exact matching** or **fixed-format parsing**, avoid regex overhead.
|
||
|
||
**When to Use**:
|
||
✔ Checking prefixes/suffixes (`str.startswith()`, `str.endswith()`).
|
||
✔ Exact substring search (`str.find()`, `str.contains()`).
|
||
✔ Splitting on fixed delimiters (`str.split(',')`).
|
||
|
||
**Example**:
|
||
```python
|
||
# Check if a filename ends with .csv (simpler than regex)
|
||
filename = "data_2024.csv"
|
||
if filename.endswith(".csv"):
|
||
print("CSV file detected.")
|
||
```
|
||
|
||
---
|
||
|
||
## **3. For Recursive/Nested Patterns, Use Grammars or Parsers**
|
||
**Context**:
|
||
- Regex **cannot** handle recursive structures (e.g., JSON, XML, math expressions).
|
||
- **Formal grammars** (e.g., CFG) or **parser combinators** are needed.
|
||
|
||
**When to Use**:
|
||
✔ Parsing programming languages.
|
||
✔ Extracting nested data (e.g., HTML/XML).
|
||
✔ Validating structured documents.
|
||
|
||
**Example (Using `lxml` for HTML)**:
|
||
```python
|
||
from lxml import html
|
||
doc = html.fromstring("<div><p>Hello <b>world</b></p></div>")
|
||
text = doc.xpath("//p//text()") # Gets "Hello world"
|
||
```
|
||
|
||
---
|
||
|
||
## **4. Automata Are Theoretical Foundations (Rarely Hand-Coded)**
|
||
**Context**:
|
||
- Finite State Machines (FSMs) underpin regex but are **not practical to write manually** for most tasks.
|
||
- Useful for **educational purposes** or **low-level optimizations** (e.g., lexers).
|
||
|
||
**When to Use**:
|
||
✔ Teaching how regex works internally.
|
||
✔ Writing ultra-efficient tokenizers (e.g., in compiler design).
|
||
|
||
**Example (Toy FSM for `ab*c`)**:
|
||
```python
|
||
def is_ab_star_c(s):
|
||
state = 0
|
||
for char in s:
|
||
if state == 0 and char == 'a':
|
||
state = 1
|
||
elif state == 1 and char == 'b':
|
||
continue
|
||
elif state == 1 and char == 'c':
|
||
state = 2
|
||
else:
|
||
return False
|
||
return state == 2
|
||
```
|
||
|
||
---
|
||
|
||
## **5. For High-Performance Tokenizing, Use Lex/Flex**
|
||
**Context**:
|
||
- **Lex/Flex** generate **optimized C code** for pattern matching.
|
||
- Used in compilers (e.g., `gcc`, `clang`) for speed.
|
||
|
||
**When to Use**:
|
||
✔ Building custom programming languages.
|
||
✔ Processing large log files efficiently.
|
||
|
||
**Example (Lex Rule for Words and Numbers)**:
|
||
```lex
|
||
%%
|
||
[a-zA-Z]+ { printf("WORD: %s\n", yytext); }
|
||
[0-9]+ { printf("NUMBER: %s\n", yytext); }
|
||
%%
|
||
```
|
||
|
||
---
|
||
|
||
## **Task-to-Tool Decision Table**
|
||
| **Task** | **Best Tool** | **Example** |
|
||
|-----------------------------------|-----------------------------|--------------------------------------|
|
||
| Exact substring match | `str.contains()`, `str.find()` | `"error 404".find("404")` |
|
||
| Prefix/suffix check | `str.startswith()`/`endswith()` | `filename.endswith(".csv")` |
|
||
| Medium-complexity patterns | **Regex** | `re.findall(r'\b[A-Z]\w+', text)` |
|
||
| Nested structures (HTML/XML) | **Parsers (lxml, BeautifulSoup)** | `xpath("//div//p/text()")` |
|
||
| Recursive patterns (e.g., math) | **Grammars (ANTLR, PEG)** | Parsing `(1 + (2 * 3))` |
|
||
| High-speed tokenizing (e.g., logs)| **Lex/Flex** | Lex rules for Apache log parsing |
|
||
| Educational/state logic | **Finite State Machines** | Implementing `ab*c` manually |
|
||
|
||
---
|
||
|
||
### **Final Advice**
|
||
- **Use regex** for flexible, non-recursive text patterns.
|
||
- **Use string methods** for trivial checks (faster, more readable).
|
||
- **Use parsers** for nested/structured data (HTML, code).
|
||
- **Use Lex/Flex** for maximum performance in tokenizers.
|
||
|
||
Would you like a case study comparing these tools on a real-world problem (e.g., log parsing)? |