Capabilities — One-Page Cheat Sheet

30-second scan of every capability in the DataShield Library, with the MCP tool name inline. The "don't build what you already have" reference.

Capabilities — One-Page Cheat Sheet

Version: v0.19.9 | Audience: anyone scanning to remember what the platform already does before building something new.

Every capability is one line. The MCP tool name is at the end. For details on arguments, call help_page("features"). For full module/function inventory, call help_page("codebase/FEATURE_INVENTORY").

134 MCP tools — 119 core + 15 git.

1. Data Catalog & Discovery

Search datasets by keyword, provider, tags, governance, quality, source system → datasets_search
Search columns/fields across every dataset by name or semantic type → fields_search
Find similar datasets to a free-text query (embeddings + filters) → find_similar_datasets
Get full dataset metadata — schema, lineage, snapshots, masking, RAG status → dataset_get
Export per-dataset catalog (schema, lineage, governance, README, manifest) → dataset_export_catalog
Detect source system (Salesforce, SAP, Monday, etc.) from column names → fingerprint_detect
Register/manage custom source fingerprints → fingerprint_add · fingerprint_list · fingerprint_remove

2. Data Ingestion & ETL

Ingest from URL (Socrata, CKAN, S3 presigned, any HTTP) → dataset_ingest_url
Ingest raw bytes (CSV/JSON/TSV passed inline) → dataset_ingest
Ingest a batch of URLs (up to 50 in one call) → dataset_ingest_batch
Track an in-flight ingest by ingest/batch/dataset ID → ingest_status
Get presigned PUT URLs (bulk) to upload via S3-compatible storage → dataset_get_upload_urls
Watch a remote storage for new files (S3, Azure Blob, GCS, SFTP, FTP, SMB, IMAP, FS) → watch_source_create · watch_source_list · watch_source_poll
Preflight a URL for size/format/cost before downloading → analysis_preflight_check
5-gate ingest pipeline runs automatically on every source: format → size → sniff → dedup → preflight

3. Data Analysis & Profiling

Quick AI classification (topic tags from columns + samples) → classify_dataset
Bulk classify N un-classified datasets → classify_batch
Submit a full 20-section analysis job → analysis_queue_submit
Track / cancel / retry queued analyses → analysis_queue_status · analysis_queue_cancel · analysis_queue_retry
Read the latest analysis for a dataset/file/run (full or summary) → get_analysis
Store an analysis result programmatically → store_analysis
20 sections include: columns, semanticTypes, piiFlags, piiCompliance, dataQuality, patterns, distributions, correlations, joinCandidates, keyColumns, anomalies, freshness, lineage, fileManifest, storage, rowSample, documentIntelligence, maskingReadiness, embeddings

4. Data Loading & Query

Load a dataset into the query engine (idempotent) → dataset_load
Run SQL with cost guardrails + automatic PG/DuckDB routing → dataset_query
Sample rows without writing SQL → dataset_sample
Get column statistics (null %, distinct, min/max/top values) → dataset_stats
List all tables for a dataset (source, masking view, snapshots, derivatives) → dataset_tables
Cost guardrails (GREEN/YELLOW/ORANGE/RED) reject runaway queries automatically

5. Data Governance, Compliance, Masking

Set masking mode (view or physical) on a dataset → set_masking_mode
Look up an original value from a vault token → vault_lookup
Get vault stats (token count, distinct, last write) → vault_stats
Apply a generalization strategy (dob→year, zip→zip3, age band, ssn last4, etc.) → generalization_apply
Preview a generalization before committing → generalization_preview
Bulk-apply generalizations across many columns → generalization_bulk_apply
Remove / list / stats for generalizations → generalization_remove · generalization_list · generalization_stats
Compliance frameworks (HIPAA, GDPR, PCI, SOX, CCPA, CJIS) — read findings via get_analysis (piiCompliance section)
Reload masking view after framework change → dataset_reload_masking

6. Schema Management & Alignment

Align columns across 2+ sources (Jaro-Winkler + semantic + dtype + value overlap) → schema_align_columns
Join two datasets with alias-based collision handling → dataset_create_join
Union 2+ datasets with optional column rename mapping → dataset_create_union
See dataset schema (with column-level types and semantic types) → dataset_get (include_fields)

7. Snapshots & Versioning

Create a snapshot (immutable copy of current source) → dataset_snapshot_create
List snapshots with retention summary → dataset_snapshot_list
Compare two snapshots (or against live) with add/update/delete buckets → dataset_snapshot_compare
Restore a snapshot to become the source table → dataset_snapshot_restore
Delete a snapshot by name → dataset_snapshot_delete

8. Differential Refresh & CDC (SPIKE-005)

Diff is automatic on every refresh when dataset_settings.diff_config.enabled = true (currently SQL-only setting)
5-tier PK detection cascade — declared → unique index → high-cardinality → composite → synthetic hash
Engine routing — PG up to pg_max_rows, DuckDB beyond
Schema evolution — new columns / widened types / dropped columns detected and applied automatically
Major-revision threshold caps detailed CDC logging on bulk overwrites (default 0.5)
Query the change log by time, operation, row key → dataset_changes
Inspect snapshot diffs on demand → dataset_snapshot_compare
Downstream consumers run automatically: RAG incremental re-embed, MDM stale/reeval flag, LLM change classification, timeline event emit

9. Document Intelligence & RAG

Build a RAG index with explicit column mapping → rag_index
Force a full reindex → rag_reindex
Hybrid document search (vector + BM25 + RRF) → document_search
Auto extract text from PDF, DOCX, HTML, EML during ingest (no manual call needed)
Manually create a derivative file (extract, flatten, filter, transform, analysis_export) → file_create_derivative
List file versions / lineage → file_list_versions · file_repo_summary
Embeddings: nomic-embed-text (768-d) in pgvector with HNSW + GIN tsvector

10. Master Data Management

List golden records in a topic → golden_records_list
Get one golden record by entity ID → golden_record_get
Resolve a stewardship case (merge / separate / defer) → stewardship_decide
Create a domain (top-level governance scope) → domain_create
List domains → domain_list
Create a topic under a domain → topic_create
Attach a resource (dataset/report/snapshot/file) to a topic → topic_add_resource
Add discussion thread to a topic → topic_add_discussion
Add knowledge fact (JSON content + fact_type) → topic_add_knowledge
Get budgeted topic context pack for an LLM → topic_get_context
Pipeline: 3-stage blocking (deterministic → phonetic → LSH/MinHash) → Fellegi-Sunter → Union-Find → stewardship queue

11. Archives & Cold Storage

Archive a snapshot or live table to S3-compatible storage → dataset_archive
List archives for a dataset → dataset_archive_list
Delete an archive (with optional connection override) → dataset_archive_delete
Restore an archive as source or as a new snapshot → dataset_restore
Storage backends: AWS S3, Backblaze B2, Cloudflare R2, Wasabi, DigitalOcean Spaces, MinIO
Format: Parquet + manifest with SHA256 checksums; verified on restore

12. Outbound Sync

Sync a dataset to remote storage with token-templated paths and per-file sidecars → dataset_sync
Sidecars available: dataset, schema, lineage, quality, governance, import_sql, readme, manifest
Formats: parquet, csv, jsonl · Compression: gzip, zstd, snappy, lz4
Mask before export with apply_masking: true

13. REST API Integration

Save a REST endpoint with auth-vault binding → rest_create_endpoint
Execute an endpoint (auto-save response as a file optional) → rest_execute
List endpoints / collections → rest_list_endpoints
Diff two execution responses → rest_diff_responses
Auth types: none, basic, api_key, bearer, oauth2_client_credentials — credentials live in encrypted connections
Connection management: connection_create · connection_list · connection_test

14. Crawling & Portal Discovery

Auto-discover portals (data.gov, Socrata global, known portals) → discover_portals
List crawler projects → crawler_list_projects
Add a source URL to a project → project_add_source
Configure project (schedule, concurrency, rate limit) → project_update
Run a project on demand → crawler_run_project
Refresh a single dataset by URL → crawler_refresh_dataset
Provider support: Socrata, CKAN, ArcGIS, OpenDataSoft, DCAT, filesystem, generic fallback

15. Platform Administration & Tiers

Check health (quick or full doctor with AI diagnostics) → health_check
Get workspace context → workspace_info
Browse the settings registry → settings_docs
Create an API key for an account → tokens_create_api_key
List / revoke tokens → tokens_list · tokens_revoke
Verify a token → auth_verify_token
List / create / revoke sessions → sessions_list · sessions_create · sessions_revoke
List tier configs (and keys) → tiers_get_all
See per-key usage → account_usage
Override per-key limits → account_update
Update a tier config (admin) → tier_update
Admin dispatch (get_account / set_tier / get_usage / list_tiers) → admin_manage
Smoke-test admin keys per tier → admin_test_keys
Manage account records (subjects_*) → subjects_create · subjects_list · subjects_update
Update dataset settings (freeze/lock, schedule, sampling, semantic overrides, derivatives) → dataset_settings
Refresh dataset summary → dataset_refresh_summary

16. AI Engine, Agent UX, Reports, Workflows, JSON Tools

AI Engine powers classification, embeddings, change narratives, reranking (nomic-embed-text 768-d)
Greeting + intent planning → __hello (returns workflow archetype + steps when given an intent)
Browse pre-built workflows by archetype or domain → workflow_list
Get one workflow definition → workflow_get
Start a UX session with format/client hints → ux_session_start
List / save / get reports with lineage tracking → report_list · report_save · report_get
Process JSON / flatten to CSV → process_json · flatten_to_csv · process_config
Help docs (this file, features.md, codebase index, providers) → help_page
Record a timeline event for audit + telemetry → events_ingest

At a Glance

| Category | Count | Examples | |---|---|---| | Catalog & Discovery | 7 tools | datasets_search, find_similar_datasets, fields_search | | Ingestion | 9 tools | dataset_ingest_url, dataset_ingest_batch, watch_source_create | | Analysis | 7 tools | analysis_queue_submit, get_analysis, classify_dataset | | Query & Loading | 5 tools | dataset_load, dataset_query, dataset_sample | | Governance & Masking | 11 tools | set_masking_mode, vault_lookup, generalization_apply | | Schema & Alignment | 3 tools | schema_align_columns, dataset_create_join, dataset_create_union | | Snapshots & CDC | 7 tools | dataset_snapshot_create, dataset_changes, dataset_snapshot_compare | | Documents & RAG | 5 tools | rag_index, document_search, file_create_derivative | | MDM & Topics | 11 tools | golden_records_list, stewardship_decide, topic_get_context | | Archives & Sync | 7 tools | dataset_archive, dataset_restore, dataset_sync | | REST & Connections | 7 tools | rest_create_endpoint, connection_create, connection_test | | Crawling & Fingerprints | 10 tools | discover_portals, crawler_run_project, fingerprint_detect | | Admin, Tokens, Tiers | ~20 tools | tokens_create_api_key, admin_manage, health_check | | UX, Reports, Workflows, Help | 10 tools | ux_session_start, report_save, workflow_list, help_page |

Pair this with help_page("features") for the Q&A walkthrough and help_page("codebase/FEATURE_INVENTORY") for the full module-level inventory. Verified against scripts/mcp-server.ts for v0.19.9.