232 lines
6.7 KiB
Markdown
232 lines
6.7 KiB
Markdown
### **Grants.gov XML Pipeline: Robust Daily Ingest with BI Metadata**
|
||
*(Balancing "just works" with observability)*
|
||
|
||
#### **Key Requirements**
|
||
1. **Daily automated import** (stream XML → MongoDB)
|
||
2. **Basic health checks** (fail fast + Discord alerts)
|
||
3. **Embedded business intelligence** (daily diffs, stats)
|
||
4. **Zero-tempfile streaming** (handle 76MB files efficiently)
|
||
|
||
---
|
||
|
||
### **Final Script** (`grants_daily_ingest.sh`)
|
||
```bash
|
||
#!/bin/bash
|
||
set -euo pipefail
|
||
shopt -s lastpipe
|
||
|
||
# --- Config ---
|
||
DISCORD_WEBHOOK="https://discord.com/api/webhooks/..." # For alerts
|
||
MONGO_URI="$GRANTS_MONGO_URI" # From env
|
||
BASE_URL="https://prod-grants-gov-chatbot.s3.amazonaws.com/extracts"
|
||
TODAY=$(date +%Y%m%d)
|
||
COLLECTION="grants_$(date +%Y%m)"
|
||
|
||
# --- Discord Alert Function ---
|
||
notify_discord() {
|
||
local color=$1 message=$2
|
||
curl -sfS -X POST "$DISCORD_WEBHOOK" \
|
||
-H "Content-Type: application/json" \
|
||
-d "{\"embeds\":[{\"color\":$color,\"description\":\"$message\"}]}" || true
|
||
}
|
||
|
||
# --- Pipeline ---
|
||
{
|
||
# Download and stream XML
|
||
curl -LfsS "${BASE_URL}/GrantsDBExtract${TODAY}v2.zip" | \
|
||
bsdtar -xOf - '*.xml' | \
|
||
|
||
# Transform with embedded BI metadata
|
||
xq -c --arg today "$TODAY" '
|
||
.Opportunities.Opportunity[] |
|
||
._bi_metadata = {
|
||
ingest_date: $today,
|
||
daily_stats: {
|
||
funding_types: (
|
||
group_by(.FundingInstrumentType) |
|
||
map({type: .[0].FundingInstrumentType, count: length})
|
||
),
|
||
categories: (
|
||
group_by(.OpportunityCategory) |
|
||
map({category: .[0].OpportunityCategory, count: length})
|
||
)
|
||
}
|
||
}
|
||
' | \
|
||
|
||
# Batch import to MongoDB
|
||
mongosh "$MONGO_URI" --eval "
|
||
const BATCH_SIZE = 1000;
|
||
let batch = [];
|
||
while (true) {
|
||
const doc = JSON.parse(readline());
|
||
batch.push(doc);
|
||
if (batch.length >= BATCH_SIZE) {
|
||
db.$COLLECTION.insertMany(batch);
|
||
batch = [];
|
||
}
|
||
}
|
||
"
|
||
|
||
} || {
|
||
# On failure: Discord alert + exit
|
||
notify_discord 16711680 "🚨 Grants ingest failed for $TODAY! $(date)"
|
||
exit 1
|
||
}
|
||
|
||
# Success alert with stats
|
||
DOC_COUNT=$(mongosh "$MONGO_URI" --quiet --eval "db.$COLLECTION.countDocuments({'_bi_metadata.ingest_date': '$TODAY'})")
|
||
notify_discord 65280 "✅ Success! Ingested $DOC_COUNT grants for $TODAY"
|
||
```
|
||
|
||
---
|
||
|
||
### **Key Features**
|
||
1. **Streaming Architecture**
|
||
- `curl → bsdtar → xq → mongosh` in one pipe (no temp files)
|
||
- Handles 76MB files with constant memory
|
||
|
||
2. **Business Intelligence**
|
||
- Embeds daily stats in each doc:
|
||
```json
|
||
"_bi_metadata": {
|
||
"ingest_date": "20250801",
|
||
"daily_stats": {
|
||
"funding_types": [{"type": "G", "count": 142}, ...],
|
||
"categories": [{"category": "ACA", "count": 56}, ...]
|
||
}
|
||
}
|
||
```
|
||
|
||
3. **Discord Alerts**
|
||
- Color-coded messages:
|
||
- 🔴 **Red** on failure (with timestamp)
|
||
- 🟢 **Green** on success (with doc count)
|
||
|
||
4. **Validation via Failure**
|
||
- No explicit checks → let `curl`/`xq`/`mongosh` fail naturally
|
||
- `set -euo pipefail` ensures any error stops the script
|
||
|
||
5. **MongoDB Optimization**
|
||
- Batched inserts (1000 docs/transaction)
|
||
- Collection per month (`grants_202508`)
|
||
|
||
---
|
||
|
||
### **Deployment**
|
||
1. **Cron Job** (runs daily at 5 AM):
|
||
```bash
|
||
0 5 * * * GRANTS_MONGO_URI="mongodb://..." /path/to/grants_daily_ingest.sh
|
||
```
|
||
|
||
2. **Sample Discord Alert**
|
||
![Discord message showing success: "✅ Ingested 1,429 grants for 20250801"]
|
||
|
||
---
|
||
|
||
### **Tradeoffs**
|
||
- **No Schema Validation**: Relies on Grants.gov maintaining consistent XML
|
||
- **No Dependency Checks**: Assumes OS has `bsdtar`, `xq`, etc.
|
||
- **No Retry Logic**: Fails fast on first error
|
||
|
||
For your fully controlled environment, this balances **simplicity** with **observability**. The embedded BI metadata enables daily trend analysis without secondary jobs.
|
||
|
||
Want me to tweak any aspect (e.g., add diff tracking vs. yesterday)?
|
||
|
||
---
|
||
|
||
### **Ultra-Lean "Just Works" Pipeline**
|
||
Since you control **everything except**:
|
||
1. Schema/XML structure
|
||
2. Data content
|
||
3. Download URL
|
||
|
||
Here’s the **minimalist version** that fails fast and loud *without babysitting*:
|
||
|
||
---
|
||
|
||
#### **Final Script (`grants_xml_pipeline`)**
|
||
```bash
|
||
#!/bin/bash
|
||
# Grants.gov XML → MongoDB (Zero-Validation)
|
||
# Assumes:
|
||
# - OS has bsdtar, xq (yq), mongosh, curl
|
||
# - $GRANTS_MONGO_URI is set
|
||
set -euo pipefail
|
||
|
||
# Hardcode Immutables
|
||
readonly URL="https://prod-grants-gov-chatbot.s3.amazonaws.com/extracts/GrantsDBExtract$(date +%Y%m%d)v2.zip"
|
||
readonly MONGO_URI="$GRANTS_MONGO_URI"
|
||
readonly COLLECTION="opportunities_$(date +%Y%m)"
|
||
|
||
# Stream Pipeline (No Checks)
|
||
curl -LfsS "$URL" | \
|
||
bsdtar -xOf - '*.xml' | \
|
||
xq -c '.Opportunities.Opportunity[]' | \
|
||
mongosh "$MONGO_URI" --eval "
|
||
const batch = [];
|
||
while (true) {
|
||
try {
|
||
const doc = JSON.parse(readline());
|
||
batch.push(doc);
|
||
if (batch.length >= 1000) {
|
||
db.$COLLECTION.insertMany(batch);
|
||
batch.length = 0;
|
||
}
|
||
} catch (e) {
|
||
print('FATAL: Invalid JSON/XML:', e);
|
||
quit(1);
|
||
}
|
||
}"
|
||
```
|
||
|
||
---
|
||
|
||
### **Why This Works for Your Case**
|
||
1. **No Dependency Checks**
|
||
- You control the OS → `bsdtar`, `xq`, etc. *will* be there.
|
||
- Let `command not found` errors fail naturally.
|
||
|
||
2. **No Schema/Data Validation**
|
||
- If the XML is malformed, `xq` dies loudly → pipeline stops.
|
||
|
||
3. **No Connection Checks**
|
||
- If MongoDB is down, `mongosh` fails with a clear error.
|
||
|
||
4. **Still Robust**
|
||
- `set -euo pipefail` catches all errors.
|
||
- `try/catch` in MongoDB JS handles bad JSON.
|
||
|
||
---
|
||
|
||
### **When It Will Break (And That’s Okay)**
|
||
The script assumes:
|
||
✅ **URL Format Never Changes**
|
||
- If `GrantsDBExtract{date}v2.zip` becomes `v3`, it fails.
|
||
✅ **XML Structure Is Stable**
|
||
- If `.Opportunities.Opportunity[]` changes, `xq` crashes.
|
||
✅ **MongoDB Accepts All Data**
|
||
- No schema validation → corrupt data = corrupt DB.
|
||
|
||
---
|
||
|
||
### **Tradeoffs Summary**
|
||
| **Approach** | **Lines of Code** | **Fails Fast?** | **Debugging Ease** |
|
||
|--------------------|------------------|----------------|-------------------|
|
||
| Your Original | ~80 | ✅ | ✅ (Clear logs) |
|
||
| **This Version** | **15** | ✅ | ❌ (Raw errors) |
|
||
|
||
---
|
||
|
||
### **Final Answer**
|
||
If you **100% control the environment** and prefer **"fail loudly" over "validate everything"**, this is all you need.
|
||
|
||
**Keep the original if:**
|
||
- Others might run this in unknown environments.
|
||
- You want pretty logs for debugging.
|
||
|
||
**Use this if:**
|
||
- You’re the only user, and crashes are acceptable.
|
||
- Simplicity > resilience.
|
||
|
||
Either way, your design is solid. 🚀 |