95 lines
2.8 KiB
Markdown
95 lines
2.8 KiB
Markdown
### **Ultra-Lean "Just Works" Pipeline**
|
||
Since you control **everything except**:
|
||
1. Schema/XML structure
|
||
2. Data content
|
||
3. Download URL
|
||
|
||
Here’s the **minimalist version** that fails fast and loud *without babysitting*:
|
||
|
||
---
|
||
|
||
#### **Final Script (`grants_xml_pipeline`)**
|
||
```bash
|
||
#!/bin/bash
|
||
# Grants.gov XML → MongoDB (Zero-Validation)
|
||
# Assumes:
|
||
# - OS has bsdtar, xq (yq), mongosh, curl
|
||
# - $GRANTS_MONGO_URI is set
|
||
set -euo pipefail
|
||
|
||
# Hardcode Immutables
|
||
readonly URL="https://prod-grants-gov-chatbot.s3.amazonaws.com/extracts/GrantsDBExtract$(date +%Y%m%d)v2.zip"
|
||
readonly MONGO_URI="$GRANTS_MONGO_URI"
|
||
readonly COLLECTION="opportunities_$(date +%Y%m)"
|
||
|
||
# Stream Pipeline (No Checks)
|
||
curl -LfsS "$URL" | \
|
||
bsdtar -xOf - '*.xml' | \
|
||
xq -c '.Opportunities.Opportunity[]' | \
|
||
mongosh "$MONGO_URI" --eval "
|
||
const batch = [];
|
||
while (true) {
|
||
try {
|
||
const doc = JSON.parse(readline());
|
||
batch.push(doc);
|
||
if (batch.length >= 1000) {
|
||
db.$COLLECTION.insertMany(batch);
|
||
batch.length = 0;
|
||
}
|
||
} catch (e) {
|
||
print('FATAL: Invalid JSON/XML:', e);
|
||
quit(1);
|
||
}
|
||
}"
|
||
```
|
||
|
||
---
|
||
|
||
### **Why This Works for Your Case**
|
||
1. **No Dependency Checks**
|
||
- You control the OS → `bsdtar`, `xq`, etc. *will* be there.
|
||
- Let `command not found` errors fail naturally.
|
||
|
||
2. **No Schema/Data Validation**
|
||
- If the XML is malformed, `xq` dies loudly → pipeline stops.
|
||
|
||
3. **No Connection Checks**
|
||
- If MongoDB is down, `mongosh` fails with a clear error.
|
||
|
||
4. **Still Robust**
|
||
- `set -euo pipefail` catches all errors.
|
||
- `try/catch` in MongoDB JS handles bad JSON.
|
||
|
||
---
|
||
|
||
### **When It Will Break (And That’s Okay)**
|
||
The script assumes:
|
||
✅ **URL Format Never Changes**
|
||
- If `GrantsDBExtract{date}v2.zip` becomes `v3`, it fails.
|
||
✅ **XML Structure Is Stable**
|
||
- If `.Opportunities.Opportunity[]` changes, `xq` crashes.
|
||
✅ **MongoDB Accepts All Data**
|
||
- No schema validation → corrupt data = corrupt DB.
|
||
|
||
---
|
||
|
||
### **Tradeoffs Summary**
|
||
| **Approach** | **Lines of Code** | **Fails Fast?** | **Debugging Ease** |
|
||
|--------------------|------------------|----------------|-------------------|
|
||
| Your Original | ~80 | ✅ | ✅ (Clear logs) |
|
||
| **This Version** | **15** | ✅ | ❌ (Raw errors) |
|
||
|
||
---
|
||
|
||
### **Final Answer**
|
||
If you **100% control the environment** and prefer **"fail loudly" over "validate everything"**, this is all you need.
|
||
|
||
**Keep the original if:**
|
||
- Others might run this in unknown environments.
|
||
- You want pretty logs for debugging.
|
||
|
||
**Use this if:**
|
||
- You’re the only user, and crashes are acceptable.
|
||
- Simplicity > resilience.
|
||
|
||
Either way, your design is solid. 🚀 |