DataShield MCP Dataset Library

Project overview

What this app does, how it works, and who it’s for.

What is this?

DataShield MCP Dataset Library is a small web app + worker that builds (and maintains) a Postgres-backed catalog of public datasets.

It’s designed to be the “life blood” for an agent workflow:

  • Crawl public portals
  • Extract stable links + structured metadata
  • Store that metadata in human-accessible relational tables
  • Provide APIs (and an MCP server) so agents can search and consume datasets later

Key outcomes

  • A growing library of public datasets with:
    • landing URL + canonical URL
    • title / description / tags
    • publisher / license (when available)
    • resources (download/API links)
    • provenance (when discovered, last checked, last changed)
    • analytics payloads you import (raw JSON + extracted summary)

Architecture

  • Next.js web app
    • Public pages: / + /datasets + /datasets/:id + /help
    • Management UI: /app/*
  • Postgres
    • registry.* schema = datasets, sources, crawls, logs, settings
  • Worker (npm run worker)
    • Uses pg-boss to run scheduled and manual crawl jobs
    • Applies provider-specific discovery and ingestion

Mental model

  • Crawler projects (provider + schedule + rate limits)
    • Example: “Socrata - Test (NY)”
  • Sources (URLs or file paths) attached to a project
    • Can be a portal root, a dataset landing page, a DCAT data.json, a sitemap, or file:// path
  • Runs capture what happened during a crawl
    • Created/updated datasets, errors, run items

Primary workflows

  • Test quickly using seeded projects and sources
  • Enable schedules only after rate limits are tuned
  • Add new sources continuously
  • Import analytics profiles from your primary app
  • Let agents search + pick datasets later