DataShield MCP Dataset Library

Capabilities — One-Page Cheat Sheet

30-second scan of every capability in the DataShield Library, with the MCP tool name inline. The "don't build what you already have" reference.

Capabilities — One-Page Cheat Sheet

Version: v0.19.9 | Audience: anyone scanning to remember what the platform already does before building something new.

Every capability is one line. The MCP tool name is at the end. For details on arguments, call help_page("features"). For full module/function inventory, call help_page("codebase/FEATURE_INVENTORY").

134 MCP tools — 119 core + 15 git.


1. Data Catalog & Discovery

  • Search datasets by keyword, provider, tags, governance, quality, source system → datasets_search
  • Search columns/fields across every dataset by name or semantic type → fields_search
  • Find similar datasets to a free-text query (embeddings + filters) → find_similar_datasets
  • Get full dataset metadata — schema, lineage, snapshots, masking, RAG status → dataset_get
  • Export per-dataset catalog (schema, lineage, governance, README, manifest) → dataset_export_catalog
  • Detect source system (Salesforce, SAP, Monday, etc.) from column names → fingerprint_detect
  • Register/manage custom source fingerprintsfingerprint_add · fingerprint_list · fingerprint_remove

2. Data Ingestion & ETL

  • Ingest from URL (Socrata, CKAN, S3 presigned, any HTTP) → dataset_ingest_url
  • Ingest raw bytes (CSV/JSON/TSV passed inline) → dataset_ingest
  • Ingest a batch of URLs (up to 50 in one call) → dataset_ingest_batch
  • Track an in-flight ingest by ingest/batch/dataset ID → ingest_status
  • Get presigned PUT URLs (bulk) to upload via S3-compatible storage → dataset_get_upload_urls
  • Watch a remote storage for new files (S3, Azure Blob, GCS, SFTP, FTP, SMB, IMAP, FS) → watch_source_create · watch_source_list · watch_source_poll
  • Preflight a URL for size/format/cost before downloading → analysis_preflight_check
  • 5-gate ingest pipeline runs automatically on every source: format → size → sniff → dedup → preflight

3. Data Analysis & Profiling

  • Quick AI classification (topic tags from columns + samples) → classify_dataset
  • Bulk classify N un-classified datasetsclassify_batch
  • Submit a full 20-section analysis jobanalysis_queue_submit
  • Track / cancel / retry queued analyses → analysis_queue_status · analysis_queue_cancel · analysis_queue_retry
  • Read the latest analysis for a dataset/file/run (full or summary) → get_analysis
  • Store an analysis result programmatically → store_analysis
  • 20 sections include: columns, semanticTypes, piiFlags, piiCompliance, dataQuality, patterns, distributions, correlations, joinCandidates, keyColumns, anomalies, freshness, lineage, fileManifest, storage, rowSample, documentIntelligence, maskingReadiness, embeddings

4. Data Loading & Query

  • Load a dataset into the query engine (idempotent) → dataset_load
  • Run SQL with cost guardrails + automatic PG/DuckDB routing → dataset_query
  • Sample rows without writing SQL → dataset_sample
  • Get column statistics (null %, distinct, min/max/top values) → dataset_stats
  • List all tables for a dataset (source, masking view, snapshots, derivatives) → dataset_tables
  • Cost guardrails (GREEN/YELLOW/ORANGE/RED) reject runaway queries automatically

5. Data Governance, Compliance, Masking

  • Set masking mode (view or physical) on a dataset → set_masking_mode
  • Look up an original value from a vault token → vault_lookup
  • Get vault stats (token count, distinct, last write) → vault_stats
  • Apply a generalization strategy (dob→year, zip→zip3, age band, ssn last4, etc.) → generalization_apply
  • Preview a generalization before committing → generalization_preview
  • Bulk-apply generalizations across many columns → generalization_bulk_apply
  • Remove / list / stats for generalizations → generalization_remove · generalization_list · generalization_stats
  • Compliance frameworks (HIPAA, GDPR, PCI, SOX, CCPA, CJIS) — read findings via get_analysis (piiCompliance section)
  • Reload masking view after framework change → dataset_reload_masking

6. Schema Management & Alignment

  • Align columns across 2+ sources (Jaro-Winkler + semantic + dtype + value overlap) → schema_align_columns
  • Join two datasets with alias-based collision handling → dataset_create_join
  • Union 2+ datasets with optional column rename mapping → dataset_create_union
  • See dataset schema (with column-level types and semantic types) → dataset_get (include_fields)

7. Snapshots & Versioning

  • Create a snapshot (immutable copy of current source) → dataset_snapshot_create
  • List snapshots with retention summary → dataset_snapshot_list
  • Compare two snapshots (or against live) with add/update/delete buckets → dataset_snapshot_compare
  • Restore a snapshot to become the source table → dataset_snapshot_restore
  • Delete a snapshot by name → dataset_snapshot_delete

8. Differential Refresh & CDC (SPIKE-005)

  • Diff is automatic on every refresh when dataset_settings.diff_config.enabled = true (currently SQL-only setting)
  • 5-tier PK detection cascade — declared → unique index → high-cardinality → composite → synthetic hash
  • Engine routing — PG up to pg_max_rows, DuckDB beyond
  • Schema evolution — new columns / widened types / dropped columns detected and applied automatically
  • Major-revision threshold caps detailed CDC logging on bulk overwrites (default 0.5)
  • Query the change log by time, operation, row key → dataset_changes
  • Inspect snapshot diffs on demanddataset_snapshot_compare
  • Downstream consumers run automatically: RAG incremental re-embed, MDM stale/reeval flag, LLM change classification, timeline event emit

9. Document Intelligence & RAG

  • Build a RAG index with explicit column mapping → rag_index
  • Force a full reindexrag_reindex
  • Hybrid document search (vector + BM25 + RRF) → document_search
  • Auto extract text from PDF, DOCX, HTML, EML during ingest (no manual call needed)
  • Manually create a derivative file (extract, flatten, filter, transform, analysis_export) → file_create_derivative
  • List file versions / lineagefile_list_versions · file_repo_summary
  • Embeddings: nomic-embed-text (768-d) in pgvector with HNSW + GIN tsvector

10. Master Data Management

  • List golden records in a topic → golden_records_list
  • Get one golden record by entity ID → golden_record_get
  • Resolve a stewardship case (merge / separate / defer) → stewardship_decide
  • Create a domain (top-level governance scope) → domain_create
  • List domainsdomain_list
  • Create a topic under a domain → topic_create
  • Attach a resource (dataset/report/snapshot/file) to a topic → topic_add_resource
  • Add discussion thread to a topic → topic_add_discussion
  • Add knowledge fact (JSON content + fact_type) → topic_add_knowledge
  • Get budgeted topic context pack for an LLM → topic_get_context
  • Pipeline: 3-stage blocking (deterministic → phonetic → LSH/MinHash) → Fellegi-Sunter → Union-Find → stewardship queue

11. Archives & Cold Storage

  • Archive a snapshot or live table to S3-compatible storage → dataset_archive
  • List archives for a dataset → dataset_archive_list
  • Delete an archive (with optional connection override) → dataset_archive_delete
  • Restore an archive as source or as a new snapshot → dataset_restore
  • Storage backends: AWS S3, Backblaze B2, Cloudflare R2, Wasabi, DigitalOcean Spaces, MinIO
  • Format: Parquet + manifest with SHA256 checksums; verified on restore

12. Outbound Sync

  • Sync a dataset to remote storage with token-templated paths and per-file sidecars → dataset_sync
  • Sidecars available: dataset, schema, lineage, quality, governance, import_sql, readme, manifest
  • Formats: parquet, csv, jsonl · Compression: gzip, zstd, snappy, lz4
  • Mask before export with apply_masking: true

13. REST API Integration

  • Save a REST endpoint with auth-vault binding → rest_create_endpoint
  • Execute an endpoint (auto-save response as a file optional) → rest_execute
  • List endpoints / collectionsrest_list_endpoints
  • Diff two execution responsesrest_diff_responses
  • Auth types: none, basic, api_key, bearer, oauth2_client_credentials — credentials live in encrypted connections
  • Connection management: connection_create · connection_list · connection_test

14. Crawling & Portal Discovery

  • Auto-discover portals (data.gov, Socrata global, known portals) → discover_portals
  • List crawler projectscrawler_list_projects
  • Add a source URL to a project → project_add_source
  • Configure project (schedule, concurrency, rate limit) → project_update
  • Run a project on demandcrawler_run_project
  • Refresh a single dataset by URL → crawler_refresh_dataset
  • Provider support: Socrata, CKAN, ArcGIS, OpenDataSoft, DCAT, filesystem, generic fallback

15. Platform Administration & Tiers

  • Check health (quick or full doctor with AI diagnostics) → health_check
  • Get workspace contextworkspace_info
  • Browse the settings registrysettings_docs
  • Create an API key for an account → tokens_create_api_key
  • List / revoke tokenstokens_list · tokens_revoke
  • Verify a tokenauth_verify_token
  • List / create / revoke sessionssessions_list · sessions_create · sessions_revoke
  • List tier configs (and keys)tiers_get_all
  • See per-key usageaccount_usage
  • Override per-key limitsaccount_update
  • Update a tier config (admin) → tier_update
  • Admin dispatch (get_account / set_tier / get_usage / list_tiers) → admin_manage
  • Smoke-test admin keys per tieradmin_test_keys
  • Manage account records (subjects_*) → subjects_create · subjects_list · subjects_update
  • Update dataset settings (freeze/lock, schedule, sampling, semantic overrides, derivatives) → dataset_settings
  • Refresh dataset summarydataset_refresh_summary

16. AI Engine, Agent UX, Reports, Workflows, JSON Tools

  • AI Engine powers classification, embeddings, change narratives, reranking (nomic-embed-text 768-d)
  • Greeting + intent planning__hello (returns workflow archetype + steps when given an intent)
  • Browse pre-built workflows by archetype or domain → workflow_list
  • Get one workflow definitionworkflow_get
  • Start a UX session with format/client hints → ux_session_start
  • List / save / get reports with lineage tracking → report_list · report_save · report_get
  • Process JSON / flatten to CSVprocess_json · flatten_to_csv · process_config
  • Help docs (this file, features.md, codebase index, providers) → help_page
  • Record a timeline event for audit + telemetry → events_ingest

At a Glance

| Category | Count | Examples | |---|---|---| | Catalog & Discovery | 7 tools | datasets_search, find_similar_datasets, fields_search | | Ingestion | 9 tools | dataset_ingest_url, dataset_ingest_batch, watch_source_create | | Analysis | 7 tools | analysis_queue_submit, get_analysis, classify_dataset | | Query & Loading | 5 tools | dataset_load, dataset_query, dataset_sample | | Governance & Masking | 11 tools | set_masking_mode, vault_lookup, generalization_apply | | Schema & Alignment | 3 tools | schema_align_columns, dataset_create_join, dataset_create_union | | Snapshots & CDC | 7 tools | dataset_snapshot_create, dataset_changes, dataset_snapshot_compare | | Documents & RAG | 5 tools | rag_index, document_search, file_create_derivative | | MDM & Topics | 11 tools | golden_records_list, stewardship_decide, topic_get_context | | Archives & Sync | 7 tools | dataset_archive, dataset_restore, dataset_sync | | REST & Connections | 7 tools | rest_create_endpoint, connection_create, connection_test | | Crawling & Fingerprints | 10 tools | discover_portals, crawler_run_project, fingerprint_detect | | Admin, Tokens, Tiers | ~20 tools | tokens_create_api_key, admin_manage, health_check | | UX, Reports, Workflows, Help | 10 tools | ux_session_start, report_save, workflow_list, help_page |


Pair this with help_page("features") for the Q&A walkthrough and help_page("codebase/FEATURE_INVENTORY") for the full module-level inventory. Verified against scripts/mcp-server.ts for v0.19.9.

Send Feedback