Add projects/grants_xml_pipeline.md
This commit is contained in:
95
projects/grants_xml_pipeline.md
Normal file
95
projects/grants_xml_pipeline.md
Normal file
@@ -0,0 +1,95 @@
|
|||||||
|
### **Ultra-Lean "Just Works" Pipeline**
|
||||||
|
Since you control **everything except**:
|
||||||
|
1. Schema/XML structure
|
||||||
|
2. Data content
|
||||||
|
3. Download URL
|
||||||
|
|
||||||
|
Here’s the **minimalist version** that fails fast and loud *without babysitting*:
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
#### **Final Script (`grants_xml_pipeline`)**
|
||||||
|
```bash
|
||||||
|
#!/bin/bash
|
||||||
|
# Grants.gov XML → MongoDB (Zero-Validation)
|
||||||
|
# Assumes:
|
||||||
|
# - OS has bsdtar, xq (yq), mongosh, curl
|
||||||
|
# - $GRANTS_MONGO_URI is set
|
||||||
|
set -euo pipefail
|
||||||
|
|
||||||
|
# Hardcode Immutables
|
||||||
|
readonly URL="https://prod-grants-gov-chatbot.s3.amazonaws.com/extracts/GrantsDBExtract$(date +%Y%m%d)v2.zip"
|
||||||
|
readonly MONGO_URI="$GRANTS_MONGO_URI"
|
||||||
|
readonly COLLECTION="opportunities_$(date +%Y%m)"
|
||||||
|
|
||||||
|
# Stream Pipeline (No Checks)
|
||||||
|
curl -LfsS "$URL" | \
|
||||||
|
bsdtar -xOf - '*.xml' | \
|
||||||
|
xq -c '.Opportunities.Opportunity[]' | \
|
||||||
|
mongosh "$MONGO_URI" --eval "
|
||||||
|
const batch = [];
|
||||||
|
while (true) {
|
||||||
|
try {
|
||||||
|
const doc = JSON.parse(readline());
|
||||||
|
batch.push(doc);
|
||||||
|
if (batch.length >= 1000) {
|
||||||
|
db.$COLLECTION.insertMany(batch);
|
||||||
|
batch.length = 0;
|
||||||
|
}
|
||||||
|
} catch (e) {
|
||||||
|
print('FATAL: Invalid JSON/XML:', e);
|
||||||
|
quit(1);
|
||||||
|
}
|
||||||
|
}"
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### **Why This Works for Your Case**
|
||||||
|
1. **No Dependency Checks**
|
||||||
|
- You control the OS → `bsdtar`, `xq`, etc. *will* be there.
|
||||||
|
- Let `command not found` errors fail naturally.
|
||||||
|
|
||||||
|
2. **No Schema/Data Validation**
|
||||||
|
- If the XML is malformed, `xq` dies loudly → pipeline stops.
|
||||||
|
|
||||||
|
3. **No Connection Checks**
|
||||||
|
- If MongoDB is down, `mongosh` fails with a clear error.
|
||||||
|
|
||||||
|
4. **Still Robust**
|
||||||
|
- `set -euo pipefail` catches all errors.
|
||||||
|
- `try/catch` in MongoDB JS handles bad JSON.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### **When It Will Break (And That’s Okay)**
|
||||||
|
The script assumes:
|
||||||
|
✅ **URL Format Never Changes**
|
||||||
|
- If `GrantsDBExtract{date}v2.zip` becomes `v3`, it fails.
|
||||||
|
✅ **XML Structure Is Stable**
|
||||||
|
- If `.Opportunities.Opportunity[]` changes, `xq` crashes.
|
||||||
|
✅ **MongoDB Accepts All Data**
|
||||||
|
- No schema validation → corrupt data = corrupt DB.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### **Tradeoffs Summary**
|
||||||
|
| **Approach** | **Lines of Code** | **Fails Fast?** | **Debugging Ease** |
|
||||||
|
|--------------------|------------------|----------------|-------------------|
|
||||||
|
| Your Original | ~80 | ✅ | ✅ (Clear logs) |
|
||||||
|
| **This Version** | **15** | ✅ | ❌ (Raw errors) |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### **Final Answer**
|
||||||
|
If you **100% control the environment** and prefer **"fail loudly" over "validate everything"**, this is all you need.
|
||||||
|
|
||||||
|
**Keep the original if:**
|
||||||
|
- Others might run this in unknown environments.
|
||||||
|
- You want pretty logs for debugging.
|
||||||
|
|
||||||
|
**Use this if:**
|
||||||
|
- You’re the only user, and crashes are acceptable.
|
||||||
|
- Simplicity > resilience.
|
||||||
|
|
||||||
|
Either way, your design is solid. 🚀
|
||||||
Reference in New Issue
Block a user