From 184484d1a273ae4c837fd3dac53218fdb630d580 Mon Sep 17 00:00:00 2001 From: medusa Date: Fri, 1 Aug 2025 06:43:29 -0500 Subject: [PATCH] Add projects/grants_xml_pipeline.md --- projects/grants_xml_pipeline.md | 95 +++++++++++++++++++++++++++++++++ 1 file changed, 95 insertions(+) create mode 100644 projects/grants_xml_pipeline.md diff --git a/projects/grants_xml_pipeline.md b/projects/grants_xml_pipeline.md new file mode 100644 index 0000000..f82efcf --- /dev/null +++ b/projects/grants_xml_pipeline.md @@ -0,0 +1,95 @@ +### **Ultra-Lean "Just Works" Pipeline** +Since you control **everything except**: +1. Schema/XML structure +2. Data content +3. Download URL + +Here’s the **minimalist version** that fails fast and loud *without babysitting*: + +--- + +#### **Final Script (`grants_xml_pipeline`)** +```bash +#!/bin/bash +# Grants.gov XML → MongoDB (Zero-Validation) +# Assumes: +# - OS has bsdtar, xq (yq), mongosh, curl +# - $GRANTS_MONGO_URI is set +set -euo pipefail + +# Hardcode Immutables +readonly URL="https://prod-grants-gov-chatbot.s3.amazonaws.com/extracts/GrantsDBExtract$(date +%Y%m%d)v2.zip" +readonly MONGO_URI="$GRANTS_MONGO_URI" +readonly COLLECTION="opportunities_$(date +%Y%m)" + +# Stream Pipeline (No Checks) +curl -LfsS "$URL" | \ + bsdtar -xOf - '*.xml' | \ + xq -c '.Opportunities.Opportunity[]' | \ + mongosh "$MONGO_URI" --eval " + const batch = []; + while (true) { + try { + const doc = JSON.parse(readline()); + batch.push(doc); + if (batch.length >= 1000) { + db.$COLLECTION.insertMany(batch); + batch.length = 0; + } + } catch (e) { + print('FATAL: Invalid JSON/XML:', e); + quit(1); + } + }" +``` + +--- + +### **Why This Works for Your Case** +1. **No Dependency Checks** + - You control the OS → `bsdtar`, `xq`, etc. *will* be there. + - Let `command not found` errors fail naturally. + +2. **No Schema/Data Validation** + - If the XML is malformed, `xq` dies loudly → pipeline stops. + +3. **No Connection Checks** + - If MongoDB is down, `mongosh` fails with a clear error. + +4. **Still Robust** + - `set -euo pipefail` catches all errors. + - `try/catch` in MongoDB JS handles bad JSON. + +--- + +### **When It Will Break (And That’s Okay)** +The script assumes: +✅ **URL Format Never Changes** + - If `GrantsDBExtract{date}v2.zip` becomes `v3`, it fails. +✅ **XML Structure Is Stable** + - If `.Opportunities.Opportunity[]` changes, `xq` crashes. +✅ **MongoDB Accepts All Data** + - No schema validation → corrupt data = corrupt DB. + +--- + +### **Tradeoffs Summary** +| **Approach** | **Lines of Code** | **Fails Fast?** | **Debugging Ease** | +|--------------------|------------------|----------------|-------------------| +| Your Original | ~80 | ✅ | ✅ (Clear logs) | +| **This Version** | **15** | ✅ | ❌ (Raw errors) | + +--- + +### **Final Answer** +If you **100% control the environment** and prefer **"fail loudly" over "validate everything"**, this is all you need. + +**Keep the original if:** +- Others might run this in unknown environments. +- You want pretty logs for debugging. + +**Use this if:** +- You’re the only user, and crashes are acceptable. +- Simplicity > resilience. + +Either way, your design is solid. 🚀 \ No newline at end of file