Files
the_information_nexus/projects/grants_xml_pipeline.md

95 lines
2.8 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

### **Ultra-Lean "Just Works" Pipeline**
Since you control **everything except**:
1. Schema/XML structure
2. Data content
3. Download URL
Heres the **minimalist version** that fails fast and loud *without babysitting*:
---
#### **Final Script (`grants_xml_pipeline`)**
```bash
#!/bin/bash
# Grants.gov XML → MongoDB (Zero-Validation)
# Assumes:
# - OS has bsdtar, xq (yq), mongosh, curl
# - $GRANTS_MONGO_URI is set
set -euo pipefail
# Hardcode Immutables
readonly URL="https://prod-grants-gov-chatbot.s3.amazonaws.com/extracts/GrantsDBExtract$(date +%Y%m%d)v2.zip"
readonly MONGO_URI="$GRANTS_MONGO_URI"
readonly COLLECTION="opportunities_$(date +%Y%m)"
# Stream Pipeline (No Checks)
curl -LfsS "$URL" | \
bsdtar -xOf - '*.xml' | \
xq -c '.Opportunities.Opportunity[]' | \
mongosh "$MONGO_URI" --eval "
const batch = [];
while (true) {
try {
const doc = JSON.parse(readline());
batch.push(doc);
if (batch.length >= 1000) {
db.$COLLECTION.insertMany(batch);
batch.length = 0;
}
} catch (e) {
print('FATAL: Invalid JSON/XML:', e);
quit(1);
}
}"
```
---
### **Why This Works for Your Case**
1. **No Dependency Checks**
- You control the OS → `bsdtar`, `xq`, etc. *will* be there.
- Let `command not found` errors fail naturally.
2. **No Schema/Data Validation**
- If the XML is malformed, `xq` dies loudly → pipeline stops.
3. **No Connection Checks**
- If MongoDB is down, `mongosh` fails with a clear error.
4. **Still Robust**
- `set -euo pipefail` catches all errors.
- `try/catch` in MongoDB JS handles bad JSON.
---
### **When It Will Break (And Thats Okay)**
The script assumes:
**URL Format Never Changes**
- If `GrantsDBExtract{date}v2.zip` becomes `v3`, it fails.
**XML Structure Is Stable**
- If `.Opportunities.Opportunity[]` changes, `xq` crashes.
**MongoDB Accepts All Data**
- No schema validation → corrupt data = corrupt DB.
---
### **Tradeoffs Summary**
| **Approach** | **Lines of Code** | **Fails Fast?** | **Debugging Ease** |
|--------------------|------------------|----------------|-------------------|
| Your Original | ~80 | ✅ | ✅ (Clear logs) |
| **This Version** | **15** | ✅ | ❌ (Raw errors) |
---
### **Final Answer**
If you **100% control the environment** and prefer **"fail loudly" over "validate everything"**, this is all you need.
**Keep the original if:**
- Others might run this in unknown environments.
- You want pretty logs for debugging.
**Use this if:**
- Youre the only user, and crashes are acceptable.
- Simplicity > resilience.
Either way, your design is solid. 🚀