6.7 KiB
Grants.gov XML Pipeline: Robust Daily Ingest with BI Metadata
(Balancing "just works" with observability)
Key Requirements
- Daily automated import (stream XML → MongoDB)
- Basic health checks (fail fast + Discord alerts)
- Embedded business intelligence (daily diffs, stats)
- Zero-tempfile streaming (handle 76MB files efficiently)
Final Script (grants_daily_ingest.sh)
#!/bin/bash
set -euo pipefail
shopt -s lastpipe
# --- Config ---
DISCORD_WEBHOOK="https://discord.com/api/webhooks/..." # For alerts
MONGO_URI="$GRANTS_MONGO_URI" # From env
BASE_URL="https://prod-grants-gov-chatbot.s3.amazonaws.com/extracts"
TODAY=$(date +%Y%m%d)
COLLECTION="grants_$(date +%Y%m)"
# --- Discord Alert Function ---
notify_discord() {
local color=$1 message=$2
curl -sfS -X POST "$DISCORD_WEBHOOK" \
-H "Content-Type: application/json" \
-d "{\"embeds\":[{\"color\":$color,\"description\":\"$message\"}]}" || true
}
# --- Pipeline ---
{
# Download and stream XML
curl -LfsS "${BASE_URL}/GrantsDBExtract${TODAY}v2.zip" | \
bsdtar -xOf - '*.xml' | \
# Transform with embedded BI metadata
xq -c --arg today "$TODAY" '
.Opportunities.Opportunity[] |
._bi_metadata = {
ingest_date: $today,
daily_stats: {
funding_types: (
group_by(.FundingInstrumentType) |
map({type: .[0].FundingInstrumentType, count: length})
),
categories: (
group_by(.OpportunityCategory) |
map({category: .[0].OpportunityCategory, count: length})
)
}
}
' | \
# Batch import to MongoDB
mongosh "$MONGO_URI" --eval "
const BATCH_SIZE = 1000;
let batch = [];
while (true) {
const doc = JSON.parse(readline());
batch.push(doc);
if (batch.length >= BATCH_SIZE) {
db.$COLLECTION.insertMany(batch);
batch = [];
}
}
"
} || {
# On failure: Discord alert + exit
notify_discord 16711680 "🚨 Grants ingest failed for $TODAY! $(date)"
exit 1
}
# Success alert with stats
DOC_COUNT=$(mongosh "$MONGO_URI" --quiet --eval "db.$COLLECTION.countDocuments({'_bi_metadata.ingest_date': '$TODAY'})")
notify_discord 65280 "✅ Success! Ingested $DOC_COUNT grants for $TODAY"
Key Features
-
Streaming Architecture
curl → bsdtar → xq → mongoshin one pipe (no temp files)- Handles 76MB files with constant memory
-
Business Intelligence
- Embeds daily stats in each doc:
"_bi_metadata": { "ingest_date": "20250801", "daily_stats": { "funding_types": [{"type": "G", "count": 142}, ...], "categories": [{"category": "ACA", "count": 56}, ...] } }
- Embeds daily stats in each doc:
-
Discord Alerts
- Color-coded messages:
- 🔴 Red on failure (with timestamp)
- 🟢 Green on success (with doc count)
- Color-coded messages:
-
Validation via Failure
- No explicit checks → let
curl/xq/mongoshfail naturally set -euo pipefailensures any error stops the script
- No explicit checks → let
-
MongoDB Optimization
- Batched inserts (1000 docs/transaction)
- Collection per month (
grants_202508)
Deployment
-
Cron Job (runs daily at 5 AM):
0 5 * * * GRANTS_MONGO_URI="mongodb://..." /path/to/grants_daily_ingest.sh -
Sample Discord Alert
![Discord message showing success: "✅ Ingested 1,429 grants for 20250801"]
Tradeoffs
- No Schema Validation: Relies on Grants.gov maintaining consistent XML
- No Dependency Checks: Assumes OS has
bsdtar,xq, etc. - No Retry Logic: Fails fast on first error
For your fully controlled environment, this balances simplicity with observability. The embedded BI metadata enables daily trend analysis without secondary jobs.
Want me to tweak any aspect (e.g., add diff tracking vs. yesterday)?
Ultra-Lean "Just Works" Pipeline
Since you control everything except:
- Schema/XML structure
- Data content
- Download URL
Here’s the minimalist version that fails fast and loud without babysitting:
Final Script (grants_xml_pipeline)
#!/bin/bash
# Grants.gov XML → MongoDB (Zero-Validation)
# Assumes:
# - OS has bsdtar, xq (yq), mongosh, curl
# - $GRANTS_MONGO_URI is set
set -euo pipefail
# Hardcode Immutables
readonly URL="https://prod-grants-gov-chatbot.s3.amazonaws.com/extracts/GrantsDBExtract$(date +%Y%m%d)v2.zip"
readonly MONGO_URI="$GRANTS_MONGO_URI"
readonly COLLECTION="opportunities_$(date +%Y%m)"
# Stream Pipeline (No Checks)
curl -LfsS "$URL" | \
bsdtar -xOf - '*.xml' | \
xq -c '.Opportunities.Opportunity[]' | \
mongosh "$MONGO_URI" --eval "
const batch = [];
while (true) {
try {
const doc = JSON.parse(readline());
batch.push(doc);
if (batch.length >= 1000) {
db.$COLLECTION.insertMany(batch);
batch.length = 0;
}
} catch (e) {
print('FATAL: Invalid JSON/XML:', e);
quit(1);
}
}"
Why This Works for Your Case
-
No Dependency Checks
- You control the OS →
bsdtar,xq, etc. will be there. - Let
command not founderrors fail naturally.
- You control the OS →
-
No Schema/Data Validation
- If the XML is malformed,
xqdies loudly → pipeline stops.
- If the XML is malformed,
-
No Connection Checks
- If MongoDB is down,
mongoshfails with a clear error.
- If MongoDB is down,
-
Still Robust
set -euo pipefailcatches all errors.try/catchin MongoDB JS handles bad JSON.
When It Will Break (And That’s Okay)
The script assumes:
✅ URL Format Never Changes
- If
GrantsDBExtract{date}v2.zipbecomesv3, it fails.
✅ XML Structure Is Stable - If
.Opportunities.Opportunity[]changes,xqcrashes.
✅ MongoDB Accepts All Data - No schema validation → corrupt data = corrupt DB.
Tradeoffs Summary
| Approach | Lines of Code | Fails Fast? | Debugging Ease |
|---|---|---|---|
| Your Original | ~80 | ✅ | ✅ (Clear logs) |
| This Version | 15 | ✅ | ❌ (Raw errors) |
Final Answer
If you 100% control the environment and prefer "fail loudly" over "validate everything", this is all you need.
Keep the original if:
- Others might run this in unknown environments.
- You want pretty logs for debugging.
Use this if:
- You’re the only user, and crashes are acceptable.
- Simplicity > resilience.
Either way, your design is solid. 🚀