Project overview
What this app does, how it works, and who it’s for.
What is this?
DataShield MCP Dataset Library is a small web app + worker that builds (and maintains) a Postgres-backed catalog of public datasets.
It’s designed to be the “life blood” for an agent workflow:
- Crawl public portals
- Extract stable links + structured metadata
- Store that metadata in human-accessible relational tables
- Provide APIs (and an MCP server) so agents can search and consume datasets later
Key outcomes
- A growing library of public datasets with:
- landing URL + canonical URL
- title / description / tags
- publisher / license (when available)
- resources (download/API links)
- provenance (when discovered, last checked, last changed)
- analytics payloads you import (raw JSON + extracted summary)
Architecture
- Next.js web app
- Public pages:
/+/datasets+/datasets/:id+/help - Management UI:
/app/*
- Public pages:
- Postgres
registry.*schema = datasets, sources, crawls, logs, settings
- Worker (
npm run worker)- Uses
pg-bossto run scheduled and manual crawl jobs - Applies provider-specific discovery and ingestion
- Uses
Mental model
- Crawler projects (provider + schedule + rate limits)
- Example: “Socrata - Test (NY)”
- Sources (URLs or file paths) attached to a project
- Can be a portal root, a dataset landing page, a DCAT
data.json, a sitemap, orfile://path
- Can be a portal root, a dataset landing page, a DCAT
- Runs capture what happened during a crawl
- Created/updated datasets, errors, run items
Primary workflows
- Test quickly using seeded projects and sources
- Enable schedules only after rate limits are tuned
- Add new sources continuously
- Import analytics profiles from your primary app
- Let agents search + pick datasets later