Add projects/grants_xml_pipeline.md
This commit is contained in:
95
projects/grants_xml_pipeline.md
Normal file
95
projects/grants_xml_pipeline.md
Normal file
@@ -0,0 +1,95 @@
|
||||
### **Ultra-Lean "Just Works" Pipeline**
|
||||
Since you control **everything except**:
|
||||
1. Schema/XML structure
|
||||
2. Data content
|
||||
3. Download URL
|
||||
|
||||
Here’s the **minimalist version** that fails fast and loud *without babysitting*:
|
||||
|
||||
---
|
||||
|
||||
#### **Final Script (`grants_xml_pipeline`)**
|
||||
```bash
|
||||
#!/bin/bash
|
||||
# Grants.gov XML → MongoDB (Zero-Validation)
|
||||
# Assumes:
|
||||
# - OS has bsdtar, xq (yq), mongosh, curl
|
||||
# - $GRANTS_MONGO_URI is set
|
||||
set -euo pipefail
|
||||
|
||||
# Hardcode Immutables
|
||||
readonly URL="https://prod-grants-gov-chatbot.s3.amazonaws.com/extracts/GrantsDBExtract$(date +%Y%m%d)v2.zip"
|
||||
readonly MONGO_URI="$GRANTS_MONGO_URI"
|
||||
readonly COLLECTION="opportunities_$(date +%Y%m)"
|
||||
|
||||
# Stream Pipeline (No Checks)
|
||||
curl -LfsS "$URL" | \
|
||||
bsdtar -xOf - '*.xml' | \
|
||||
xq -c '.Opportunities.Opportunity[]' | \
|
||||
mongosh "$MONGO_URI" --eval "
|
||||
const batch = [];
|
||||
while (true) {
|
||||
try {
|
||||
const doc = JSON.parse(readline());
|
||||
batch.push(doc);
|
||||
if (batch.length >= 1000) {
|
||||
db.$COLLECTION.insertMany(batch);
|
||||
batch.length = 0;
|
||||
}
|
||||
} catch (e) {
|
||||
print('FATAL: Invalid JSON/XML:', e);
|
||||
quit(1);
|
||||
}
|
||||
}"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### **Why This Works for Your Case**
|
||||
1. **No Dependency Checks**
|
||||
- You control the OS → `bsdtar`, `xq`, etc. *will* be there.
|
||||
- Let `command not found` errors fail naturally.
|
||||
|
||||
2. **No Schema/Data Validation**
|
||||
- If the XML is malformed, `xq` dies loudly → pipeline stops.
|
||||
|
||||
3. **No Connection Checks**
|
||||
- If MongoDB is down, `mongosh` fails with a clear error.
|
||||
|
||||
4. **Still Robust**
|
||||
- `set -euo pipefail` catches all errors.
|
||||
- `try/catch` in MongoDB JS handles bad JSON.
|
||||
|
||||
---
|
||||
|
||||
### **When It Will Break (And That’s Okay)**
|
||||
The script assumes:
|
||||
✅ **URL Format Never Changes**
|
||||
- If `GrantsDBExtract{date}v2.zip` becomes `v3`, it fails.
|
||||
✅ **XML Structure Is Stable**
|
||||
- If `.Opportunities.Opportunity[]` changes, `xq` crashes.
|
||||
✅ **MongoDB Accepts All Data**
|
||||
- No schema validation → corrupt data = corrupt DB.
|
||||
|
||||
---
|
||||
|
||||
### **Tradeoffs Summary**
|
||||
| **Approach** | **Lines of Code** | **Fails Fast?** | **Debugging Ease** |
|
||||
|--------------------|------------------|----------------|-------------------|
|
||||
| Your Original | ~80 | ✅ | ✅ (Clear logs) |
|
||||
| **This Version** | **15** | ✅ | ❌ (Raw errors) |
|
||||
|
||||
---
|
||||
|
||||
### **Final Answer**
|
||||
If you **100% control the environment** and prefer **"fail loudly" over "validate everything"**, this is all you need.
|
||||
|
||||
**Keep the original if:**
|
||||
- Others might run this in unknown environments.
|
||||
- You want pretty logs for debugging.
|
||||
|
||||
**Use this if:**
|
||||
- You’re the only user, and crashes are acceptable.
|
||||
- Simplicity > resilience.
|
||||
|
||||
Either way, your design is solid. 🚀
|
||||
Reference in New Issue
Block a user