2.8 KiB
2.8 KiB
Ultra-Lean "Just Works" Pipeline
Since you control everything except:
- Schema/XML structure
- Data content
- Download URL
Here’s the minimalist version that fails fast and loud without babysitting:
Final Script (grants_xml_pipeline)
#!/bin/bash
# Grants.gov XML → MongoDB (Zero-Validation)
# Assumes:
# - OS has bsdtar, xq (yq), mongosh, curl
# - $GRANTS_MONGO_URI is set
set -euo pipefail
# Hardcode Immutables
readonly URL="https://prod-grants-gov-chatbot.s3.amazonaws.com/extracts/GrantsDBExtract$(date +%Y%m%d)v2.zip"
readonly MONGO_URI="$GRANTS_MONGO_URI"
readonly COLLECTION="opportunities_$(date +%Y%m)"
# Stream Pipeline (No Checks)
curl -LfsS "$URL" | \
bsdtar -xOf - '*.xml' | \
xq -c '.Opportunities.Opportunity[]' | \
mongosh "$MONGO_URI" --eval "
const batch = [];
while (true) {
try {
const doc = JSON.parse(readline());
batch.push(doc);
if (batch.length >= 1000) {
db.$COLLECTION.insertMany(batch);
batch.length = 0;
}
} catch (e) {
print('FATAL: Invalid JSON/XML:', e);
quit(1);
}
}"
Why This Works for Your Case
-
No Dependency Checks
- You control the OS →
bsdtar,xq, etc. will be there. - Let
command not founderrors fail naturally.
- You control the OS →
-
No Schema/Data Validation
- If the XML is malformed,
xqdies loudly → pipeline stops.
- If the XML is malformed,
-
No Connection Checks
- If MongoDB is down,
mongoshfails with a clear error.
- If MongoDB is down,
-
Still Robust
set -euo pipefailcatches all errors.try/catchin MongoDB JS handles bad JSON.
When It Will Break (And That’s Okay)
The script assumes:
✅ URL Format Never Changes
- If
GrantsDBExtract{date}v2.zipbecomesv3, it fails.
✅ XML Structure Is Stable - If
.Opportunities.Opportunity[]changes,xqcrashes.
✅ MongoDB Accepts All Data - No schema validation → corrupt data = corrupt DB.
Tradeoffs Summary
| Approach | Lines of Code | Fails Fast? | Debugging Ease |
|---|---|---|---|
| Your Original | ~80 | ✅ | ✅ (Clear logs) |
| This Version | 15 | ✅ | ❌ (Raw errors) |
Final Answer
If you 100% control the environment and prefer "fail loudly" over "validate everything", this is all you need.
Keep the original if:
- Others might run this in unknown environments.
- You want pretty logs for debugging.
Use this if:
- You’re the only user, and crashes are acceptable.
- Simplicity > resilience.
Either way, your design is solid. 🚀