Add projects/grants_xml_pipeline.md

This commit is contained in:
2025-08-01 06:43:29 -05:00
parent 6604c6dfb9
commit 184484d1a2

View File

@@ -0,0 +1,95 @@
### **Ultra-Lean "Just Works" Pipeline**
Since you control **everything except**:
1. Schema/XML structure
2. Data content
3. Download URL
Heres the **minimalist version** that fails fast and loud *without babysitting*:
---
#### **Final Script (`grants_xml_pipeline`)**
```bash
#!/bin/bash
# Grants.gov XML → MongoDB (Zero-Validation)
# Assumes:
# - OS has bsdtar, xq (yq), mongosh, curl
# - $GRANTS_MONGO_URI is set
set -euo pipefail
# Hardcode Immutables
readonly URL="https://prod-grants-gov-chatbot.s3.amazonaws.com/extracts/GrantsDBExtract$(date +%Y%m%d)v2.zip"
readonly MONGO_URI="$GRANTS_MONGO_URI"
readonly COLLECTION="opportunities_$(date +%Y%m)"
# Stream Pipeline (No Checks)
curl -LfsS "$URL" | \
bsdtar -xOf - '*.xml' | \
xq -c '.Opportunities.Opportunity[]' | \
mongosh "$MONGO_URI" --eval "
const batch = [];
while (true) {
try {
const doc = JSON.parse(readline());
batch.push(doc);
if (batch.length >= 1000) {
db.$COLLECTION.insertMany(batch);
batch.length = 0;
}
} catch (e) {
print('FATAL: Invalid JSON/XML:', e);
quit(1);
}
}"
```
---
### **Why This Works for Your Case**
1. **No Dependency Checks**
- You control the OS → `bsdtar`, `xq`, etc. *will* be there.
- Let `command not found` errors fail naturally.
2. **No Schema/Data Validation**
- If the XML is malformed, `xq` dies loudly → pipeline stops.
3. **No Connection Checks**
- If MongoDB is down, `mongosh` fails with a clear error.
4. **Still Robust**
- `set -euo pipefail` catches all errors.
- `try/catch` in MongoDB JS handles bad JSON.
---
### **When It Will Break (And Thats Okay)**
The script assumes:
**URL Format Never Changes**
- If `GrantsDBExtract{date}v2.zip` becomes `v3`, it fails.
**XML Structure Is Stable**
- If `.Opportunities.Opportunity[]` changes, `xq` crashes.
**MongoDB Accepts All Data**
- No schema validation → corrupt data = corrupt DB.
---
### **Tradeoffs Summary**
| **Approach** | **Lines of Code** | **Fails Fast?** | **Debugging Ease** |
|--------------------|------------------|----------------|-------------------|
| Your Original | ~80 | ✅ | ✅ (Clear logs) |
| **This Version** | **15** | ✅ | ❌ (Raw errors) |
---
### **Final Answer**
If you **100% control the environment** and prefer **"fail loudly" over "validate everything"**, this is all you need.
**Keep the original if:**
- Others might run this in unknown environments.
- You want pretty logs for debugging.
**Use this if:**
- Youre the only user, and crashes are acceptable.
- Simplicity > resilience.
Either way, your design is solid. 🚀