Capabilities — One-Page Cheat Sheet
30-second scan of every capability in the DataShield Library, with the MCP tool name inline. The "don't build what you already have" reference.
Capabilities — One-Page Cheat Sheet
Version: v0.19.9 | Audience: anyone scanning to remember what the platform already does before building something new.
Every capability is one line. The MCP tool name is at the end. For details on arguments, call
help_page("features"). For full module/function inventory, callhelp_page("codebase/FEATURE_INVENTORY").134 MCP tools — 119 core + 15 git.
1. Data Catalog & Discovery
- Search datasets by keyword, provider, tags, governance, quality, source system →
datasets_search - Search columns/fields across every dataset by name or semantic type →
fields_search - Find similar datasets to a free-text query (embeddings + filters) →
find_similar_datasets - Get full dataset metadata — schema, lineage, snapshots, masking, RAG status →
dataset_get - Export per-dataset catalog (schema, lineage, governance, README, manifest) →
dataset_export_catalog - Detect source system (Salesforce, SAP, Monday, etc.) from column names →
fingerprint_detect - Register/manage custom source fingerprints →
fingerprint_add·fingerprint_list·fingerprint_remove
2. Data Ingestion & ETL
- Ingest from URL (Socrata, CKAN, S3 presigned, any HTTP) →
dataset_ingest_url - Ingest raw bytes (CSV/JSON/TSV passed inline) →
dataset_ingest - Ingest a batch of URLs (up to 50 in one call) →
dataset_ingest_batch - Track an in-flight ingest by ingest/batch/dataset ID →
ingest_status - Get presigned PUT URLs (bulk) to upload via S3-compatible storage →
dataset_get_upload_urls - Watch a remote storage for new files (S3, Azure Blob, GCS, SFTP, FTP, SMB, IMAP, FS) →
watch_source_create·watch_source_list·watch_source_poll - Preflight a URL for size/format/cost before downloading →
analysis_preflight_check - 5-gate ingest pipeline runs automatically on every source: format → size → sniff → dedup → preflight
3. Data Analysis & Profiling
- Quick AI classification (topic tags from columns + samples) →
classify_dataset - Bulk classify N un-classified datasets →
classify_batch - Submit a full 20-section analysis job →
analysis_queue_submit - Track / cancel / retry queued analyses →
analysis_queue_status·analysis_queue_cancel·analysis_queue_retry - Read the latest analysis for a dataset/file/run (full or summary) →
get_analysis - Store an analysis result programmatically →
store_analysis - 20 sections include: columns, semanticTypes, piiFlags, piiCompliance, dataQuality, patterns, distributions, correlations, joinCandidates, keyColumns, anomalies, freshness, lineage, fileManifest, storage, rowSample, documentIntelligence, maskingReadiness, embeddings
4. Data Loading & Query
- Load a dataset into the query engine (idempotent) →
dataset_load - Run SQL with cost guardrails + automatic PG/DuckDB routing →
dataset_query - Sample rows without writing SQL →
dataset_sample - Get column statistics (null %, distinct, min/max/top values) →
dataset_stats - List all tables for a dataset (source, masking view, snapshots, derivatives) →
dataset_tables - Cost guardrails (GREEN/YELLOW/ORANGE/RED) reject runaway queries automatically
5. Data Governance, Compliance, Masking
- Set masking mode (
vieworphysical) on a dataset →set_masking_mode - Look up an original value from a vault token →
vault_lookup - Get vault stats (token count, distinct, last write) →
vault_stats - Apply a generalization strategy (dob→year, zip→zip3, age band, ssn last4, etc.) →
generalization_apply - Preview a generalization before committing →
generalization_preview - Bulk-apply generalizations across many columns →
generalization_bulk_apply - Remove / list / stats for generalizations →
generalization_remove·generalization_list·generalization_stats - Compliance frameworks (HIPAA, GDPR, PCI, SOX, CCPA, CJIS) — read findings via
get_analysis(piiCompliancesection) - Reload masking view after framework change →
dataset_reload_masking
6. Schema Management & Alignment
- Align columns across 2+ sources (Jaro-Winkler + semantic + dtype + value overlap) →
schema_align_columns - Join two datasets with alias-based collision handling →
dataset_create_join - Union 2+ datasets with optional column rename mapping →
dataset_create_union - See dataset schema (with column-level types and semantic types) →
dataset_get(include_fields)
7. Snapshots & Versioning
- Create a snapshot (immutable copy of current source) →
dataset_snapshot_create - List snapshots with retention summary →
dataset_snapshot_list - Compare two snapshots (or against live) with add/update/delete buckets →
dataset_snapshot_compare - Restore a snapshot to become the source table →
dataset_snapshot_restore - Delete a snapshot by name →
dataset_snapshot_delete
8. Differential Refresh & CDC (SPIKE-005)
- Diff is automatic on every refresh when
dataset_settings.diff_config.enabled = true(currently SQL-only setting) - 5-tier PK detection cascade — declared → unique index → high-cardinality → composite → synthetic hash
- Engine routing — PG up to
pg_max_rows, DuckDB beyond - Schema evolution — new columns / widened types / dropped columns detected and applied automatically
- Major-revision threshold caps detailed CDC logging on bulk overwrites (default 0.5)
- Query the change log by time, operation, row key →
dataset_changes - Inspect snapshot diffs on demand →
dataset_snapshot_compare - Downstream consumers run automatically: RAG incremental re-embed, MDM stale/reeval flag, LLM change classification, timeline event emit
9. Document Intelligence & RAG
- Build a RAG index with explicit column mapping →
rag_index - Force a full reindex →
rag_reindex - Hybrid document search (vector + BM25 + RRF) →
document_search - Auto extract text from PDF, DOCX, HTML, EML during ingest (no manual call needed)
- Manually create a derivative file (extract, flatten, filter, transform, analysis_export) →
file_create_derivative - List file versions / lineage →
file_list_versions·file_repo_summary - Embeddings: nomic-embed-text (768-d) in pgvector with HNSW + GIN tsvector
10. Master Data Management
- List golden records in a topic →
golden_records_list - Get one golden record by entity ID →
golden_record_get - Resolve a stewardship case (merge / separate / defer) →
stewardship_decide - Create a domain (top-level governance scope) →
domain_create - List domains →
domain_list - Create a topic under a domain →
topic_create - Attach a resource (dataset/report/snapshot/file) to a topic →
topic_add_resource - Add discussion thread to a topic →
topic_add_discussion - Add knowledge fact (JSON content + fact_type) →
topic_add_knowledge - Get budgeted topic context pack for an LLM →
topic_get_context - Pipeline: 3-stage blocking (deterministic → phonetic → LSH/MinHash) → Fellegi-Sunter → Union-Find → stewardship queue
11. Archives & Cold Storage
- Archive a snapshot or live table to S3-compatible storage →
dataset_archive - List archives for a dataset →
dataset_archive_list - Delete an archive (with optional connection override) →
dataset_archive_delete - Restore an archive as source or as a new snapshot →
dataset_restore - Storage backends: AWS S3, Backblaze B2, Cloudflare R2, Wasabi, DigitalOcean Spaces, MinIO
- Format: Parquet + manifest with SHA256 checksums; verified on restore
12. Outbound Sync
- Sync a dataset to remote storage with token-templated paths and per-file sidecars →
dataset_sync - Sidecars available: dataset, schema, lineage, quality, governance, import_sql, readme, manifest
- Formats: parquet, csv, jsonl · Compression: gzip, zstd, snappy, lz4
- Mask before export with
apply_masking: true
13. REST API Integration
- Save a REST endpoint with auth-vault binding →
rest_create_endpoint - Execute an endpoint (auto-save response as a file optional) →
rest_execute - List endpoints / collections →
rest_list_endpoints - Diff two execution responses →
rest_diff_responses - Auth types: none, basic, api_key, bearer, oauth2_client_credentials — credentials live in encrypted connections
- Connection management:
connection_create·connection_list·connection_test
14. Crawling & Portal Discovery
- Auto-discover portals (data.gov, Socrata global, known portals) →
discover_portals - List crawler projects →
crawler_list_projects - Add a source URL to a project →
project_add_source - Configure project (schedule, concurrency, rate limit) →
project_update - Run a project on demand →
crawler_run_project - Refresh a single dataset by URL →
crawler_refresh_dataset - Provider support: Socrata, CKAN, ArcGIS, OpenDataSoft, DCAT, filesystem, generic fallback
15. Platform Administration & Tiers
- Check health (quick or full doctor with AI diagnostics) →
health_check - Get workspace context →
workspace_info - Browse the settings registry →
settings_docs - Create an API key for an account →
tokens_create_api_key - List / revoke tokens →
tokens_list·tokens_revoke - Verify a token →
auth_verify_token - List / create / revoke sessions →
sessions_list·sessions_create·sessions_revoke - List tier configs (and keys) →
tiers_get_all - See per-key usage →
account_usage - Override per-key limits →
account_update - Update a tier config (admin) →
tier_update - Admin dispatch (get_account / set_tier / get_usage / list_tiers) →
admin_manage - Smoke-test admin keys per tier →
admin_test_keys - Manage account records (subjects_*) →
subjects_create·subjects_list·subjects_update - Update dataset settings (freeze/lock, schedule, sampling, semantic overrides, derivatives) →
dataset_settings - Refresh dataset summary →
dataset_refresh_summary
16. AI Engine, Agent UX, Reports, Workflows, JSON Tools
- AI Engine powers classification, embeddings, change narratives, reranking (nomic-embed-text 768-d)
- Greeting + intent planning →
__hello(returns workflow archetype + steps when given an intent) - Browse pre-built workflows by archetype or domain →
workflow_list - Get one workflow definition →
workflow_get - Start a UX session with format/client hints →
ux_session_start - List / save / get reports with lineage tracking →
report_list·report_save·report_get - Process JSON / flatten to CSV →
process_json·flatten_to_csv·process_config - Help docs (this file, features.md, codebase index, providers) →
help_page - Record a timeline event for audit + telemetry →
events_ingest
At a Glance
| Category | Count | Examples |
|---|---|---|
| Catalog & Discovery | 7 tools | datasets_search, find_similar_datasets, fields_search |
| Ingestion | 9 tools | dataset_ingest_url, dataset_ingest_batch, watch_source_create |
| Analysis | 7 tools | analysis_queue_submit, get_analysis, classify_dataset |
| Query & Loading | 5 tools | dataset_load, dataset_query, dataset_sample |
| Governance & Masking | 11 tools | set_masking_mode, vault_lookup, generalization_apply |
| Schema & Alignment | 3 tools | schema_align_columns, dataset_create_join, dataset_create_union |
| Snapshots & CDC | 7 tools | dataset_snapshot_create, dataset_changes, dataset_snapshot_compare |
| Documents & RAG | 5 tools | rag_index, document_search, file_create_derivative |
| MDM & Topics | 11 tools | golden_records_list, stewardship_decide, topic_get_context |
| Archives & Sync | 7 tools | dataset_archive, dataset_restore, dataset_sync |
| REST & Connections | 7 tools | rest_create_endpoint, connection_create, connection_test |
| Crawling & Fingerprints | 10 tools | discover_portals, crawler_run_project, fingerprint_detect |
| Admin, Tokens, Tiers | ~20 tools | tokens_create_api_key, admin_manage, health_check |
| UX, Reports, Workflows, Help | 10 tools | ux_session_start, report_save, workflow_list, help_page |
Pair this with help_page("features") for the Q&A walkthrough and help_page("codebase/FEATURE_INVENTORY") for the full module-level inventory. Verified against scripts/mcp-server.ts for v0.19.9.