Project overview

AI-native data governance and intelligence platform — what it is, what it does, and how it fits together.

What is DataShield Library?

DataShield Library is an AI-native data governance and intelligence platform. It catalogs and profiles tabular and document datasets, governs how they can be queried, resolves entities into golden records, and exposes the entire surface as Model Context Protocol (MCP) tools so any AI agent can discover, query, and reason about your data without it ever leaving your premises.

It is delivered as a single-instance deployment (web + worker + MCP server, all backed by one PostgreSQL database) and uses a customer-supplied LLM endpoint (Ollama or any OpenAI-compatible AI Engine) for embeddings, classification, reranking, and change classification.

This is not a SaaS catalog. It is the data spine for a self-hosted, agent-driven data platform.

Six service areas

1. Data Catalog & Discovery

Crawl public portals (DCAT, Socrata, CKAN, sitemaps) and private connections (S3, B2, GCS, Azure Blob, FTP, SFTP, IMAP, SMB, filesystem). Extract canonical URLs, titles, descriptions, tags, publishers, licenses, resources, provenance. Search via SQL, MCP dataset_search, or hybrid vector + lexical retrieval. Browse by Topics and Domains.

2. Data Quality & Analysis

Conduit's 20-section profiling pipeline computes column statistics, distributions, semantic PII detection, uniqueness, null ratios, type inference, sampling configs, format hints, and quality flags. Every loaded dataset gets a full analysis run; every changeset triggers a fresh analysis on the changed rows.

3. Data Governance & Compliance

Per-dataset masking framework (HIPAA, GDPR, PCI, CCPA, custom). View-based masking via _masked_source, physical token-vault masking via _token_vault. Generalization strategies (k-anonymity, suppression, top/bottom coding) configurable per field. Compliance frameworks applied automatically based on detected PII types. Every changeset, chunk, and derivative is governed.

4. Master Data Management (MDM)

Entity resolution across datasets via token, phonetic, and edit-distance scorers. Golden record creation and maintenance. Stewardship review queue. The MDM changeset consumer marks affected golden entities stale_at / reeval_requested_at whenever an underlying dataset's CDC log records changes — keeping the golden record graph fresh without full re-resolution.

5. Document Intelligence (RAG)

Document collection ingest, chunking, embedding, and retrieval. Hybrid search combines pgvector cosine similarity with PostgreSQL BM25 full-text via Reciprocal Rank Fusion, with optional AI-Engine reranking. Documents flow through the same governance and analysis pipelines as tabular data — masking, PII detection, lineage, and CDC apply equally to chunks.

6. Change Data Capture & Differential Refresh ★

Every dataset can opt into automatic row-level differential refresh (the SPIKE-005 epic, completed in v0.19.5–v0.19.8). On every swap-table refresh, the platform:

Detects the primary key (5-tier cascade).
Checks for column renames via Jaro-Winkler matching.
Computes a row-level diff (PG ≤ 1M rows, DuckDB beyond — auto-routed).
Stores the changeset as a governed derivative file with full lineage.
Appends to a per-dataset cdc_log (or a single MAJOR_REVISION summary row when > 50% changed).
Enqueues analysis on the changeset.
Re-chunks and re-embeds only the changed documents in the RAG index.
Flags affected MDM golden entities for stewardship review.
Runs the LLM change classifier to produce a human-readable narrative + risk assessment.

Every step is best-effort — failures log a warning and never block the refresh.

Key differentiators

MCP-native — 119 MCP tools cover catalog, search, profiling, masking, governance, RAG, MDM, CDC, and admin. Any agent that speaks Model Context Protocol can connect.
Single-instance deployment — no Docker, no Kubernetes, no microservices. PostgreSQL + Node.js + four PM2 processes (web, MCP, worker, large-job worker). Restores from a single pg_dump.
Customer-supplied LLM — all AI calls go through a configurable AI Engine endpoint (Ollama by default). No data ever leaves your deployment. No outbound traffic to OpenAI, Anthropic, or any third party for model inference.
Full governance on everything — changesets, chunks, derivatives, golden records, RAG embeddings, and CDC logs are all first-class governed artifacts. Masking, PII detection, lineage, and audit apply uniformly.
Differential everything — refresh, RAG re-embedding, MDM re-evaluation, and analysis all key off the same row-level diff. A 50-document update in a 10M-document corpus re-embeds 50 chunks, not 10M.

Architecture overview

                          ┌──────────────────┐
                          │  AI Agent / UI   │
                          │ (Claude, Portal) │
                          └────────┬─────────┘
                                   │ MCP / HTTP
                                   ▼
                ┌──────────────────┴──────────────────┐
                │  pdl-mcp (3100)  │  pdl-web (3002)  │
                │   119 MCP tools  │  Next.js 16 UI   │
                └──────────────────┬──────────────────┘
                                   │
                                   ▼
                  ┌────────────────┴────────────────┐
                  │       PostgreSQL 15+ + pgvector │
                  │  registry │ conduit │ topics    │
                  │  mdm │ pgboss │ ds_<id> × N     │
                  └────┬────────────────────────┬───┘
                       │                        │
                       ▼                        ▼
       ┌───────────────────────┐    ┌─────────────────────┐
       │ pdl-worker (queue)    │    │ pdl-worker-large    │
       │  • crawl              │    │  • large ingest     │
       │  • analysis           │    │  • bulk loads       │
       │  • auto-diff (9-step) │    │  • DuckDB diff      │
       │  • RAG indexing       │    │                     │
       │  • MDM resolver       │    │                     │
       │  • watch poller       │    │                     │
       └───────────┬───────────┘    └─────────┬───────────┘
                   │                          │
                   └──────────┬───────────────┘
                              │
                              ▼
              ┌───────────────┴───────────────┐
              │  AI Engine (Ollama, customer  │
              │  endpoint) — embeddings,      │
              │  classification, reranking,   │
              │  change classifier            │
              └───────────────────────────────┘

For full details see docs/codebase/ARCHITECTURE.md.

Getting started

Quickstart: see docs/quickstart.md — get the stack running locally and load your first dataset.
MCP tool catalog: see docs/codebase/MCP_TOOLS.md — all 119 tools with parameters and examples.
DevOps runbook: see docs/devops/runbook.md — restart, recover, rotate keys, troubleshoot.
Differential refresh capability summary: see docs/spikes/LIB-SPIKE-005-capability-summary.md — the 8-bullet summary of what the diff engine enables.
Capability matrix for agents: see docs/mcp-capability-matrix.md — workflows, personas, and tool combinations.