Files
the_information_nexus/projects/grants_xml_pipeline.md

2.8 KiB
Raw Blame History

Ultra-Lean "Just Works" Pipeline

Since you control everything except:

  1. Schema/XML structure
  2. Data content
  3. Download URL

Heres the minimalist version that fails fast and loud without babysitting:


Final Script (grants_xml_pipeline)

#!/bin/bash
# Grants.gov XML → MongoDB (Zero-Validation)
# Assumes: 
# - OS has bsdtar, xq (yq), mongosh, curl
# - $GRANTS_MONGO_URI is set
set -euo pipefail

# Hardcode Immutables
readonly URL="https://prod-grants-gov-chatbot.s3.amazonaws.com/extracts/GrantsDBExtract$(date +%Y%m%d)v2.zip"
readonly MONGO_URI="$GRANTS_MONGO_URI"
readonly COLLECTION="opportunities_$(date +%Y%m)"

# Stream Pipeline (No Checks)
curl -LfsS "$URL" | \
  bsdtar -xOf - '*.xml' | \
  xq -c '.Opportunities.Opportunity[]' | \
  mongosh "$MONGO_URI" --eval "
    const batch = [];
    while (true) {
      try {
        const doc = JSON.parse(readline());
        batch.push(doc);
        if (batch.length >= 1000) {
          db.$COLLECTION.insertMany(batch);
          batch.length = 0;
        }
      } catch (e) {
        print('FATAL: Invalid JSON/XML:', e);
        quit(1);
      }
    }"

Why This Works for Your Case

  1. No Dependency Checks

    • You control the OS → bsdtar, xq, etc. will be there.
    • Let command not found errors fail naturally.
  2. No Schema/Data Validation

    • If the XML is malformed, xq dies loudly → pipeline stops.
  3. No Connection Checks

    • If MongoDB is down, mongosh fails with a clear error.
  4. Still Robust

    • set -euo pipefail catches all errors.
    • try/catch in MongoDB JS handles bad JSON.

When It Will Break (And Thats Okay)

The script assumes:
URL Format Never Changes

  • If GrantsDBExtract{date}v2.zip becomes v3, it fails.
    XML Structure Is Stable
  • If .Opportunities.Opportunity[] changes, xq crashes.
    MongoDB Accepts All Data
  • No schema validation → corrupt data = corrupt DB.

Tradeoffs Summary

Approach Lines of Code Fails Fast? Debugging Ease
Your Original ~80 (Clear logs)
This Version 15 (Raw errors)

Final Answer

If you 100% control the environment and prefer "fail loudly" over "validate everything", this is all you need.

Keep the original if:

  • Others might run this in unknown environments.
  • You want pretty logs for debugging.

Use this if:

  • Youre the only user, and crashes are acceptable.
  • Simplicity > resilience.

Either way, your design is solid. 🚀