DataShield MCP Dataset Library

Features — Agent Quick Reference

Q&A guide for AI agents calling help_page("features"). Answers "how do I..." questions with the exact MCP tools, arguments, and order of operations.

Features — Agent Quick Reference

Version: v0.19.9 | Audience: AI agents calling help_page("features"), and humans skimming capabilities.

Companion docs: help_page("CAPABILITIES") (one-page cheat sheet) · help_page("codebase/FEATURE_INVENTORY") · help_page("codebase/FUNCTION_REFERENCE")

MCP tool count: 134 (119 core tools + 15 git tools)

Parameter style: Tools use snake_case parameter names. Examples below show the exact arguments accepted by each tool's Zod schema in scripts/mcp-server.ts.

This guide is structured as task-oriented Q&A. For each "How do I...?" question you'll see:

  1. Tool to call — exact MCP tool name and required arguments
  2. What you get back — key result fields
  3. Notes — gotchas, related tools, deep dives

Quick index

| I want to... | Jump to | |---|---| | Find and browse datasets | Discovery | | Load data from a URL / file / REST API | Ingestion | | Profile or analyze a dataset | Analysis | | Run SQL against a dataset | Query | | Mask PII columns | Masking | | Classify a dataset (AI tags) | Classification | | Detect what changed since the last refresh | Diff & CDC | | Create / restore snapshots | Snapshots | | Archive a dataset to cold storage | Archive | | Search documents semantically (RAG) | RAG | | Match / deduplicate entities (MDM) | MDM | | Join or union two datasets | Schema ops | | Align schemas across sources | Schema alignment | | Configure a REST API connection | REST | | Watch a remote source for changes | Watch sources | | Set up a portal crawler | Crawl | | Generalize values (dob → year, etc.) | Generalization | | Sync a dataset out to S3 / blob storage | Sync | | Create / manage API keys | Auth | | Check platform health | Health | | Understand my tier and usage | Tiers | | Track a long-running analysis job | Queue | | Browse pre-built workflows | Workflows |


Discovery & Catalog

How do I find a dataset?

datasets_search { query?, provider?, visibility?, tag?, limit?, offset?, sort_by?, has_pii?, compliance_framework?, ... }
  • Inputs: free-text query, optional filters (provider, tag (singular, exact match), compliance_framework, min_rows, max_size, quality_rating, has_pii, has_phi, source_system, table_type).
  • Returns: paginated dataset list with title, provider, row counts, governance flags, quality score.
  • Sort options: relevance (default with query), created_desc (default without query), plus size/rows/columns/quality/title sorts.

How do I get details on a specific dataset?

dataset_get { id, compact?, include_fields? }
  • Returns full metadata: schema, tier, source URL, refresh history, masking mode, snapshots, derivatives, lineage, RAG status.
  • Set compact: true to skip statistics; include_fields: false saves ~2,000 tokens for wide datasets.
  • For larger diagnostics: dataset_stats, dataset_tables, dataset_changes (CDC log).

How do I search columns/fields across the catalog?

fields_search { fieldQuery, semanticType?, limit?, offset? }
  • Searches column names, descriptions, and detected semantic types across every dataset.
  • Useful for "find every dataset with an email column" or "show me all columns tagged as SSN".

How do I find similar datasets?

find_similar_datasets { query, limit?, sort_by?, provider?, has_pii?, ... }
  • Note: takes a free-text query string, not a dataset ID. Combines AI Engine embeddings with the rich filter set from datasets_search.
  • For "similar to dataset X", use dataset_get { id: X } to read its title/tags, then pass those into find_similar_datasets.

How do I export a dataset's catalog metadata?

dataset_export_catalog { dataset_id, output_path?, include_data?, include_metadata? }
  • Writes a per-dataset catalog directory: schema JSON, lineage, governance, quality, README, manifest. With include_data: true symlinks the Parquet file into catalog/data/.

Ingestion & ETL

How do I ingest a dataset from a URL?

dataset_ingest_url { url, title?, description?, tags?, dataset_id?, auto_analyze?, auto_load_db?, skip_preflight? }
  • url: any HTTP(S) URL, Socrata endpoint, data portal link, or presigned S3 URL.
  • dataset_id: optional. If omitted, a new dataset is created. If provided, the URL becomes the refresh source for that dataset.
  • auto_analyze / auto_load_db: default true — runs the full Conduit analysis and loads into the query DB after staging.
  • skip_preflight: for endpoints that don't support HEAD requests.
  • Returns: { ingest_id, dataset_id, status, rowCount, sizeBytes, stagedAt }.
  • Pipeline: the ingest runs through 5 gates — format detection → size guard → content sniff → duplicate/hash check → preflight. Failures return a machine-readable code.
  • Diff refresh: there is no refreshMode parameter. Differential refresh runs automatically when dataset_settings.diff_config.enabled = true for the dataset (see Diff & CDC).

How do I upload a file from local disk?

The library accepts file content directly via dataset_ingest. For large files you can use presigned URLs through dataset_get_upload_urls (which writes to a configured connection, not back to the library).

dataset_ingest { content, filename, title?, description?, tags?, dataset_id?, auto_analyze?, auto_load_db? }
  • content: raw CSV/JSON/TSV bytes as a UTF-8 string.
  • filename: must include the extension (e.g. employees.csv) — the format is inferred from this.
  • dataset_id: optional, for refreshing an existing dataset.

For presigned S3 uploads to a connected storage backend (bulk — 1 to 50 files per call):

dataset_get_upload_urls { connection_id, files: [{ filename, content_type? }, ...], ttl_seconds? }
  • connection_id: an S3-compatible connection_create entry. (This tool does not take a dataset_id.)
  • files: array of 1–50 objects; each yields a PUT URL and a matching GET URL.
  • Returns signed PUT/GET URLs valid for ttl_seconds (default 3600, max 86400).

How do I ingest multiple URLs at once?

dataset_ingest_batch { urls: [{ url, title?, tags? }, ...] }
  • Field name is urls (max 50 entries). Each URL becomes its own dataset (or refresh).
  • Track progress with ingest_status { batch_id }.

How do I check status of an in-flight ingest?

ingest_status { ingest_id?, batch_id?, dataset_id? }
  • Provide one of the three IDs. Returns queue/progress state plus error details.

How do I watch a remote storage for new files?

watch_source_create { connection_id, watch_path, glob_pattern?, recursive?, poll_interval_seconds?, auto_analyze? }
watch_source_list   { enabled_only? }
watch_source_poll   {}
  • connection_id: a previously created storage connection_create entry (S3, blob, GCS, SFTP, FTP, IMAP, SMB, filesystem).
  • watch_path: base path to monitor (e.g. /data/exports).
  • glob_pattern: file filter (default *).
  • poll_interval_seconds: between 60 and 86400 (default 300).
  • auto_analyze: default true — new files run through the full ingest pipeline.
  • watch_source_poll triggers a one-shot poll across all enabled watches.

How do I create and test a connection?

connection_create { name, connection_type, config, provider_name?, base_url?, scopes?, tags? }
connection_list   { provider_name?, connection_type?, tag? }
connection_test   { connection_id }
  • connection_type: oauth2, api_key, basic, bearer, custom (for credential storage), or storage-specific types used by dataset_sync and watch_source_create.
  • config: credentials object — encrypted at rest. The execution engine injects auth at call time; raw secrets never appear in tool arguments or logs.

Analysis & Profiling

How do I analyze a dataset?

Quick AI classification only:

classify_dataset { datasetId, autoApprove? }

Full 20-section analysis (queued):

analysis_queue_submit { dataset_id, file_id?, job_type?, download_url?, priority? }
analysis_queue_status { job_id?, status_filter?, limit? }
get_analysis          { dataset_id?, file_id?, run_id?, summary_only? }
  • job_type: "full_analysis" (default — full pipeline against an existing file), "refresh_analyze" (download from download_url then analyze), "reprocess" (re-run on existing file).
  • priority: 1–10 (default 5).
  • get_analysis returns the canonical 20-section AnalysisContract v1.1 (overview · columns · semanticTypes · piiFlags · piiCompliance · dataQuality · patterns · distributions · correlations · joinCandidates · keyColumns · anomalies · freshness · lineage · fileManifest · storage · rowSample · documentIntelligence · maskingReadiness · embeddings).
  • Set summary_only: true for the compact summary instead of all 20 sections.

How do I cancel or retry an analysis job?

analysis_queue_cancel { job_id }
analysis_queue_retry  { job_id }

How do I classify a dataset into topic tags?

classify_dataset { datasetId, autoApprove? }
  • Uses the AI Engine to assign topic tags based on column names + sample rows. Set autoApprove: true to write tags directly without human review.
  • (There is no strategy parameter — the classifier picks its own depth based on dataset size.)

How do I classify many datasets in bulk?

classify_batch { limit? }
  • Processes up to limit (default 10, max 50) un-classified datasets in a single call. There is no datasetIds[] parameter — the tool picks them off the queue.

How do I preflight-check a URL before ingesting?

analysis_preflight_check { url, dataset_id? }
  • Probes the URL via HEAD/range requests, returns: estimated row count, estimated cost tier (GREEN/YELLOW/ORANGE/RED), warnings, can-run verdict.
  • Use this to avoid wasting tier quota on a problematic source.

Query & Data Loading

How do I query a dataset with SQL?

dataset_load  { dataset_id, file_id?, workspace_id?, framework?, masking_mode?, include_derivatives? }
dataset_query { dataset_id, sql, table?, limit?, offset?, force? }
  • dataset_load is idempotent — call it before dataset_query to ensure the dataset is materialized into PostgreSQL.
  • framework: optional compliance framework override (hipaa | sox | cjis | pci | gdpr | ccpa) — applies the appropriate masking ruleset.
  • masking_mode: "view" (default) or "physical" — see Masking.
  • dataset_query.sql: SELECT only. DDL and unscoped DELETE are rejected by validateSql.
  • table: which physical table to query — defaults to "source". Other options include sheet_* (per-sheet for Excel) and deriv_* (derivatives).
  • force: admin-only bypass for cost guardrails.
  • Returns: { rows, rowCount, columns, costTier, elapsedMs, engine }.
  • Routing: Simple SELECTs run in PostgreSQL. Heavy analytical queries route to DuckDB via queryRouter.decideQueryEngine.

How do I sample rows from a dataset?

dataset_sample { dataset_id, table?, rows? }
  • rows defaults to 10 (max 100). Cheaper than running a SELECT * LIMIT n.

How do I get statistics for a dataset?

dataset_stats { dataset_id, table? }
  • Returns per-column null %, distinct count, min/max/avg, top values — read from the cached analysis store when available.

How do I list tables in a dataset?

dataset_tables { dataset_id }
  • Returns the canonical source table, masking view (if enabled), per-sheet tables (Excel), snapshot tables, and derivatives.

Data Governance, Compliance, Masking

How do I mask PII?

set_masking_mode { dataset_id, mode }
  • mode options (only two):
    • view — a masking view replaces the base table for queries. Reversible: toggle off any time. Cheap.
    • physical — values are tokenized in-place and originals stored in the encrypted vault (AES-256-GCM). Queries return tokens; use vault_lookup to reverse.
  • (There is no none or tokenized enum value; "no masking" is the absence of a configured mode, and physical masking already preserves format.)

How do I look up an original value from a token?

vault_lookup { dataset_id, token }
  • Both arguments are required. Requires vault.read scope on your API key. Returns the decrypted original value.
  • Stats only: vault_stats { dataset_id }{ tokenCount, distinctCount, lastWritten } without exposing values.

How do I apply a generalization strategy?

generalization_list      { category?, include_portal? }
generalization_preview   { dataset_id, column, strategy, options?, sample_size? }
generalization_apply     { dataset_id, column, strategy, options?, apply_mode? }
generalization_remove    { dataset_id, column }
generalization_stats     { dataset_id }
generalization_bulk_apply { dataset_id, configs: [{ column, strategy, options?, apply_mode? }] }
  • Built-in strategies: dob_generalize, zip_generalize, age_band, phone_partial, ssn_partial, email_domain — list with generalization_list.
  • apply_mode: "view" (default — transform at query time, reversible) or "physical" (transform at copy time).
  • Custom rules can be loaded from the portal integration (Team/Enterprise tier) when include_portal: true is set on generalization_list.

How do I check compliance posture for a dataset?

After analysis completes, read the piiCompliance section from the latest analysis:

get_analysis { dataset_id }
  • Look at the piiCompliance section: { HIPAA: {...}, GDPR: {...}, PCI: {...}, SOX: {...}, CCPA: {...}, CJIS: {...} }.
  • Each framework lists matched columns, rule IDs, and severity (info / warn / critical).

Schema Management & Alignment

How do I align columns across sources?

schema_align_columns { sources: [{ dataset_id, alias? }, ...] }
  • Pass 2 or more sources. The first source becomes the target schema; the rest are aligned to it.
  • Uses a 4-signal scorer: name similarity (Jaro-Winkler) + semantic type + data type compatibility + sample value overlap.
  • Returns a per-column alignment with confidence scores. Feed the result into dataset_create_union for multi-source merges.

How do I join or union two datasets?

Join (exactly 2 sources):

dataset_create_join { sources: [{ dataset_id, alias, columns? }, { dataset_id, alias, columns? }] }
  • Both dataset_id and alias are required per source. columns is an optional whitelist (default: all columns).
  • The aliases are used to disambiguate column-name collisions.

Union (2 or more sources):

dataset_create_union { sources: [{ dataset_id, alias, column_mapping? }, ...] }
  • column_mapping: optional { "SOURCE_COL": "target_col" } rename map per source — used to align mismatched column names without a separate alignment step.
  • Each row in the unioned output carries _source_alias for lineage.

How do I see the schema of a dataset?

dataset_get   { id }       # includes the field/schema list
dataset_tables { dataset_id } # includes column defs per physical table

Snapshots & Versioning

How do I create a snapshot?

dataset_snapshot_create { dataset_id, name? }
  • Creates an immutable copy of the current source table. Snapshots are physical (CTAS) tables under a _snapshot_<timestamp> suffix.
  • Optionally pass a name suffix; otherwise the timestamp suffix is auto-generated.
  • (There is no retentionDays parameter — retention is governed by tier defaults and the dataset_settings schedule.)

How do I list, compare, restore, or delete snapshots?

dataset_snapshot_list    { dataset_id, include_retention_summary? }
dataset_snapshot_compare { dataset_id, snapshot_a, snapshot_b, key_column?, sample_diffs? }
dataset_snapshot_restore { dataset_id, snapshot_name }
dataset_snapshot_delete  { dataset_id, snapshot_name }
  • snapshot_a / snapshot_b: snapshot names (e.g. _snapshot_20240315120000), or the literal string "source" for the current live data.
  • key_column: optional PK override for compare. If omitted, the platform auto-detects from the latest analysis + dataset_settings.diff_config.primary_key_columns.
  • sample_diffs: max sample rows per bucket (added/deleted/modified), default 5.
  • restore is destructive — the current source table is replaced by the snapshot.
  • delete requires both dataset_id and the snapshot_name.

Differential Refresh & CDC (SPIKE-005)

How do I detect what changed in a dataset?

Automatic (recommended): when dataset_settings.diff_config.enabled = true for a dataset, every refresh (via dataset_ingest_url or scheduled crawl) automatically computes a diff:

  1. Detects primary keys via the 5-tier cascade (declared → unique index → high-cardinality column → composite → synthetic row hash).
  2. Picks an engine (PG for row counts under pg_max_rows, DuckDB for larger).
  3. Computes add/change/delete sets.
  4. Applies schema evolution (new columns, widened types, dropped columns).
  5. Writes a changeset derivative file.
  6. Appends to the CDC log (registry.dataset_changes).
  7. Emits timeline events (diff.computed, diff.no_changes, diff.schema_evolved).
  8. Runs downstream consumers (RAG incremental, MDM re-eval, LLM narrative).
  9. Marks refresh complete with a run ID.

Inspect the change log manually:

dataset_changes { dataset_id, since?, until?, operation?, row_key?, limit?, offset? }
  • since / until: ISO datetimes filtering by changed_at.
  • operation: "INSERT", "UPDATE", or "DELETE" (uppercase enum values).
  • row_key: trace a specific row's history by its primary-key value (as a string).
  • limit: 1..1000, default 100.

How do I see the diff between two snapshots on demand?

dataset_snapshot_compare { dataset_id, snapshot_a, snapshot_b, key_column?, sample_diffs? }
  • Returns add/update/delete bucket counts plus sample rows from each bucket. Pass "source" for either snapshot to compare against the current live table.

How do I configure differential refresh for a dataset?

Important — no MCP tool exposes diff_config writes. The dataset_settings MCP tool does not accept a diff_config field. Configuration is set via SQL/portal admin against registry.dataset_settings.diff_config (a JSONB column). The actual fields, defined in lib/conduit/diffConfig.ts, are:

| Field | Type | Default | Purpose | |---|---|---|---| | enabled | boolean | false | Master switch. When false, refreshes overwrite without diffing. | | engine | "auto" \| "pg" \| "duckdb" | "auto" | Diff engine. auto picks PG under pg_max_rows, else DuckDB. | | pg_max_rows | number | 1_000_000 | Row threshold above which auto engine routes to DuckDB. | | primary_key_columns | string[] \| null | null | Manual PK override. When null, the 5-tier detection cascade runs. | | major_revision_threshold | number (0..1) | 0.5 | Fraction of rows changed that triggers "major revision" classification (caps detailed CDC logging on bulk overwrites). | | auto_diff_on_refresh | boolean | true | Whether refresh hooks invoke the differ when enabled = true. | | store_old_values | boolean | true | Include old_values JSONB on UPDATE rows in dataset_changes. |

To set these today, run a SQL update (or use the portal admin UI):

UPDATE registry.dataset_settings
SET diff_config = jsonb_set(diff_config, '{enabled}', 'true')
WHERE dataset_id = '<uuid>';

(Adding an MCP tool to expose diff_config writes is a known gap — see docs/roadmap.md.)


Document Intelligence & RAG

How do I search documents inside a dataset?

rag_index      { dataset_id, chunk_size?, chunk_overlap?, column_mapping }
document_search { dataset_id, query, top_k? }
  • rag_index.column_mapping (required, added in v0.19.4): a ragColumnMappingSchema object that names the title/body/timestamp columns to chunk. Use it to handle datasets with multiple text columns.
  • chunk_size: 128–2048 tokens (default 512). chunk_overlap: 0–256 (default 50).
  • document_search.top_k: 1–20 (default 5). The result set is hybrid (vector + BM25 + RRF) when the index supports it; there is no mode parameter.
  • Embeddings are 768-d nomic-embed-text vectors stored in pgvector (HNSW), plus a GIN tsvector for BM25.

How do I refresh the RAG index after new data arrives?

rag_reindex { dataset_id, chunk_size?, chunk_overlap?, column_mapping }
  • Same schema as rag_index. Forces a full rebuild.
  • Cheaper alternative: enable diff_config.enabled for the dataset. The diff pipeline runs ragIncremental, which only re-embeds added/changed chunks and tombstones deleted ones.

How do I extract text from PDFs, DOCX, HTML, EML files?

Document extraction happens automatically during ingest for recognized MIME types — the resulting text becomes a derivative file (document_extract.*), is indexed by RAG if column_mapping covers it, and is scanned for PII via documentPiiScan.

To trigger extraction manually:

file_create_derivative { repo_id, parent_file_id, filename, content_type?, data, derivation_type, derivation_config, analysis? }
  • derivation_type: extract, flatten_csv, filter, transform, or analysis_export.
  • data: base64-encoded file content (the new derivative's bytes).
  • derivation_config: strategy-specific JSON (e.g. { jsonPath, flattenDepth } for flatten_csv).

Master Data Management

How do I resolve entities (find golden records)?

golden_records_list { topic_id, entity_type?, limit?, offset? }
golden_record_get   { entity_id }
stewardship_decide  { stewardship_id, decision: "merge" | "separate" | "defer", resolution_notes?, resolved_by? }
  • golden_records_list is scoped to a topic_id, not a dataset_id — golden records live under topics within domains.
  • stewardship_decide.decision: the enum is "merge" | "separate" | "defer" (use "separate", not "keep_separate").
  • Under the hood: the MDM pipeline runs 3-stage blocking (deterministic key → phonetic → LSH/MinHash), a Fellegi-Sunter scorer, Union-Find clustering, then surfaces uncertain clusters to the stewardship queue.
  • The 4-phase normalizer (text → semantic → domain → key) runs before blocking so "St." vs "Street", "J. Smith" vs "John Smith", etc. collapse.

How do I create a domain / topic / knowledge record?

domain_create        { name, owner?, compliance_framework?, access_policy? }
domain_list          {}
topic_create         { domain_id, title, description?, created_by? }
topic_add_resource   { topic_id, resource_type: "dataset"|"report"|"snapshot"|"file", resource_id, resource_role?, notes?, linked_by? }
topic_add_discussion { topic_id, content, author_type: "human"|"agent"|"system", author_id?, parent_id? }
topic_add_knowledge  { topic_id, fact_type?, content, source_links?, confidence?, created_by? }
topic_get_context    { topic_id, max_tokens?, include_discussions?, include_knowledge? }
  • topic_create.title (not name) is the topic display label.
  • topic_add_discussion.content + author_type are required; parent_id enables threading.
  • topic_add_knowledge.content is a JSON object — use fact_type to label it (entity_profile, decision, match_strategy, observation, etc.).
  • topic_get_context returns a budgeted context pack (default 8000 tokens) for an LLM working in the topic's scope.

How does MDM react to dataset changes?

When the diff pipeline runs, the changeset consumer:

  • Flags affected golden records as stale (queued for re-scoring).
  • Flags uncertain matches as reeval (queued for stewardship review).
  • Tombstones source links for deleted rows (not removed).

Archives & Cold Storage

How do I archive a dataset to cold storage?

dataset_archive { dataset_id, snapshot_name?, drop_after?, connection_id? }
  • snapshot_name: snapshot to archive. If omitted, the current source table is archived.
  • drop_after: default true — drops the local snapshot after a successful upload to reclaim disk space.
  • connection_id: S3-compatible connection (AWS S3, Backblaze B2, Cloudflare R2, Wasabi, DigitalOcean Spaces, MinIO). If omitted, falls back to the ARCHIVE_S3_BUCKET env var.
  • Writes data + a manifest with SHA256 checksums to the connection's bucket/prefix.

How do I list, delete, or restore archives?

dataset_archive_list   { dataset_id }
dataset_archive_delete { dataset_id, archive_id, connection_id? }
dataset_restore        { dataset_id, archive_id, as_snapshot?, connection_id? }
  • dataset_archive_list.dataset_id is required.
  • as_snapshot: default false — replaces the source table. Set true to restore as a new snapshot instead.
  • Restore verifies SHA256 checksums before writing.

Sync (Outbound Export)

How do I sync a dataset out to remote storage?

dataset_sync { dataset_id, connection_id, destination_path?, path_template?, data_format?, compression?, apply_masking?, row_limit?, columns?, include_metadata?, sidecars? }
  • Pushes the dataset (data + sidecar metadata) to a configured storage backend.
  • destination_path: remote prefix (default /_uploads/{slug}/).
  • path_template: path with token substitution: {id}, {slug}, {title}, {provider}, {date}, {timestamp}.
  • data_format: auto | parquet | csv | jsonl. compression: auto | none | gzip | zstd | snappy | lz4.
  • apply_masking: sync the masked view instead of raw rows.
  • sidecars: per-file toggles for dataset, schema, lineage, quality, governance, import_sql, readme, manifest.

REST API Integration

How do I call a REST API and keep results fresh?

rest_create_endpoint { name, url, method?, collection_id?, headers?, body_template?, auth_type?, auth_connection_id? }
rest_execute        { endpoint_id, auto_save?, dataset_id?, overrides? }
rest_list_endpoints { collection_id? }
rest_diff_responses { execution_id_1, execution_id_2 }
  • auth_connection_id references a connection_create entry that stores OAuth/API key/JWT credentials encrypted with AES-256-GCM. The execution engine injects auth at call time — raw credentials are never exposed in tool arguments or logs.
  • rest_execute: runs the request. With auto_save: true, the response body is stored as a file (linked to dataset_id if provided).
  • rest_diff_responses takes two execution_id values (not endpoint IDs) and returns a structured diff.

Crawling & Portal Discovery

How do I crawl a data portal?

discover_portals      { dataGov?, socrataGlobal?, knownPortals?, enableProjects?, socrataMinDatasets? }
crawler_list_projects { limit?, offset?, enabledOnly? }
project_add_source    { projectId, sourceUrl, enabled? }
project_update        { projectId, enabled?, scheduleCron?, concurrency?, rateLimitRps?, maxRequestsPerRun? }
crawler_run_project   { projectId }
crawler_refresh_dataset { url, dataset_id?, repo_id? }
  • discover_portals: seeds known portal projects. Toggles control which discovery engines run (dataGov, socrataGlobal, knownPortals). Set enableProjects: true to start crawling immediately.
  • project_add_source.sourceUrl (not source) — accepts portal roots, dataset links, DCAT feeds, or file:// paths.
  • crawler_refresh_dataset takes a URL (and optionally an existing dataset_id to link to), not just a dataset ID.
  • Provider support: socrata, ckan, arcgis, opendatasoft, dcat, filesystem, plus a generic fallback. Provider auto-detection lives in lib/server/providers/detector.ts.

How do I fingerprint a data source?

fingerprint_detect { columns, sampleValues?, metadata? }
fingerprint_list   { category? }
fingerprint_add    { system, label, category, columnExacts, columnPatterns?, valuePatterns? }
fingerprint_remove { system }
  • fingerprint_detect takes the column names of a dataset (and optional sample values), not a URL — it scores them against the registered fingerprint catalog and returns the best-matching source system.
  • fingerprint_add registers a custom system; fingerprint_remove soft-deactivates by system ID.

JSON & File Tools

How do I flatten a nested JSON payload?

process_json    { data, jsonPath?, maxRows?, flattenDepth? }
flatten_to_csv  { data, jsonPath?, flattenDepth?, delimiter? }
process_config  { action: "get" | "update", config? }
  • process_json: processes nested JSON into rows (object-flattening up to flattenDepth).
  • flatten_to_csv: same flatten logic but emits CSV directly.
  • process_config: view or update server-side flatten/processing defaults.

How do I track files across derivative versions?

file_list_versions { repo_id?, file_id?, dataset_id? }
file_repo_summary  { repo_id }
store_analysis     { file_id?, dataset_file_id?, result, engine_name?, engine_version? }
  • file_list_versions walks the lineage chain for a file (or every file in a repo).
  • store_analysis writes a Conduit analysis result back to the registry, linked to a file.

Platform Administration

How do I create an API key?

tokens_create_api_key { account_id, name, scopes?, expires_at?, category?, tier?, description? }
tokens_list           { type?, account_id?, include_revoked? }
tokens_revoke         { token_id }
  • account_id is required — admins create keys on behalf of an account.
  • scopes: array of scope strings, e.g. ["read:datasets", "vault:read", "admin:*"]. Default: ["public:read"].
  • category: test | mcp_tool | upload | general (default general).
  • tier: free | education | developer | team | enterprise (default free).
  • tokens_revoke uses the parameter name token_id (not keyId).

How do I check if a token is valid?

auth_verify_token { token }
  • Returns { ok, account, scopes, tier, expires_at, ... }. Read-only.

How do I list / create / revoke sessions?

sessions_list   { account_id, include_revoked? }
sessions_create { account_id, user_agent?, ip_address? }
sessions_revoke { session_id }
  • sessions_list.account_id is required.
  • sessions_create does not take a TTL — session lifetime is determined by server config.

How do I see my tier and usage?

tiers_get_all { keyLimit?, keyOffset?, tiersOnly? }
account_usage { keyId }
account_update { keyId, overrides: { dailyLimit?, monthlyLimit?, tier?, scopes? } }
  • account_usage is per-key — pass the keyId you want usage for.
  • tiers_get_all lists tier configs and (optionally) keys.

How do I manage tiers (admin only)?

tier_update    { tier, patch: { dailyLimit?, totalLimit?, monthlyLimit?, priceCents?, ... } }
admin_manage   { action: "get_account" | "set_tier" | "get_usage" | "list_tiers", email?, account_id?, key_id?, tier?, max_datasets?, reason? }
admin_test_keys { tier?, format? }
  • tier_update uses an enum tier name (free | education | developer | team | enterprise) plus a patch object.
  • admin_manage is the dispatch for several admin actions — pass the matching arguments for each action.

How do I check platform health?

health_check { mode? }
  • mode: "quick" (default — fast probes) or "full" (full doctor report with AI diagnostics).
  • Returns version, DB / queue status, disk space, provider availability, and migration state.

How do I track a long-running analysis job?

analysis_queue_submit { dataset_id, file_id?, job_type?, download_url?, priority? }   # returns { job_id }
analysis_queue_status { job_id?, status_filter?, limit? }
analysis_queue_cancel { job_id }
analysis_queue_retry  { job_id }
  • The worker processes the job via pg-boss. Status values: pending, running, completed, failed, cancelled, retrying.
  • Progress is reported per section of the 20-section AnalysisContract.

Workspace & Settings

How do I update dataset-level settings?

dataset_settings { dataset_id, is_frozen?, is_locked?, schedule_enabled?, schedule_cron?, auto_analyze?, auto_load_db?, process_derivatives?, workspace_id?, notes?, sampling_config?, semantic_overrides?, derivative_file_id?, derivative_override_frozen?, derivative_schedule_cron? }
  • Parameters are flat (not nested under set: {...}).
  • is_frozen / is_locked: freeze data activity / lock all access (maintenance mode).
  • schedule_enabled / schedule_cron: override the project schedule for this dataset.
  • auto_analyze / auto_load_db / process_derivatives: toggles for the post-refresh hooks.
  • sampling_config: { strategy, sampleSize?, maxRowsHardLimit?, maxFileSizeMbForFull?, fallbackStrategy? } — strategies are full | first_n | last_n | random | bookend | stratified.
  • semantic_overrides: manual semantic-type overrides for specific columns.
  • notes: free-text dataset notes.
  • Not exposed via this tool: diff_config, masking_mode, RAG settings, retention. Masking is set via set_masking_mode. Diff config currently requires SQL (see Diff & CDC).

How do I refresh the cached summary or rebuild the masking view?

dataset_refresh_summary { dataset_id, table? }
dataset_reload_masking  { dataset_id, framework, table? }
  • dataset_reload_masking.framework is required — pass the compliance framework (hipaa | sox | cjis | pci | gdpr | ccpa).

How do I see workspace context?

workspace_info { workspace_id? }
  • Returns: workspace ID, tier, feature flags, user role, enabled providers. Defaults to the caller's default workspace.

How do I look up a configuration setting?

settings_docs { search?, category?, mcp_only? }
  • Browse the canonical settings registry. Categories: Core | Security | MCP | AI | Storage | Crawler | Billing | System.

Accounts (subjects_*)

Naming note: the tools prefixed subjects_ operate on account records (the platform's user accounts), not on GDPR data subjects. The naming is historical.

subjects_create { email, name?, password?, provider? }
subjects_list   { email?, status?, limit? }
subjects_update { account_id, status?, name? }
  • provider: local | google | github | test (default local).
  • status: active | pending | suspended | all (filter for subjects_list).

UX & Reporting

How do I start a UX session?

ux_session_start { format?, client_hint?, demo_mode? }
  • format: auto | html | markdown | plain — auto-detects the renderer based on client_hint.
  • client_hint: claude_chat | claude_code | web | api.
  • demo_mode: enables a guided walkthrough.

How do I save and retrieve reports?

report_list { status?, dataset_id?, limit?, offset? }
report_save { title, html_content, description?, visibility?, generation_prompt?, dataset_ids?, file_ids?, tags?, status?, auto_regenerate?, report_id? }
report_get  { id?, slug? }
  • title + html_content are required for report_save. Pass an existing report_id to update.
  • visibility: private | link | public (default link).
  • status: draft | published.
  • dataset_ids / file_ids: lineage tracking — used to surface the report when its sources refresh, and to enable auto_regenerate.
  • report_get accepts either id or slug, not both.
  • All reports must comply with Report Scroll/Sizing Standard-001 before saving.

Events & Timeline

How do I record a timeline event?

events_ingest { datasetId, eventType, actor?, source?, metadata? }
  • eventType: event identifier (e.g. dataset.read, dataset.recommend, dataset.enrich).
  • metadata: optional context object — surfaces in the dataset's timeline and audit log.
  • (Note: this is one of the few tools using camelCase parameter names — datasetId, not dataset_id.)

Workflows

How do I browse pre-built workflows?

workflow_list { archetype?, domain? }
workflow_get  { workflow_id }
  • archetype: have_data | what_exists | build_something | transform | govern | explore | bootstrap.
  • domain: functional domain code (Acq, Ana, Qry, Tfm, Prv, Ver, Gov, Rpt, IAM, Plt, Jsn).
  • workflow_get.workflow_id: e.g. "W01", "W32".

Common multi-step workflows

Workflow: Ingest → analyze → classify → mask → query

1. dataset_ingest_url   { url }                                       -> { dataset_id }
2. analysis_queue_submit { dataset_id, job_type: "full_analysis" }    -> { job_id }
3. analysis_queue_status { job_id }                                    (poll until completed)
4. classify_dataset     { datasetId, autoApprove: true }              -> { tags }
5. set_masking_mode     { dataset_id, mode: "view" }                  (if PII detected)
6. dataset_load         { dataset_id }
7. dataset_query        { dataset_id, sql: "SELECT ..." }

Workflow: Set up diff-refresh pipeline

1. (one-time SQL) UPDATE registry.dataset_settings
   SET diff_config = jsonb_set(diff_config, '{enabled}', 'true')
   WHERE dataset_id = '<uuid>';

2. dataset_ingest_url { url, dataset_id }      # first run — seeds the baseline
3. dataset_ingest_url { url, dataset_id }      # subsequent run — diff is computed automatically
4. dataset_changes    { dataset_id, since: "2026-01-01T00:00:00Z" }   # see what changed

Workflow: Build a knowledge base from documents

1. dataset_ingest_batch { urls: [ { url: "...pdf" }, { url: "...docx" }, ... ] }
2. (wait for ingest)     each document auto-extracts to text
3. rag_index             { dataset_id, column_mapping: { ... } }
4. document_search       { dataset_id, query: "what does the report say about X?" }

Workflow: Match entities across two customer datasets

1. dataset_load          { dataset_id: A }
2. dataset_load          { dataset_id: B }
3. schema_align_columns  { sources: [{ dataset_id: A, alias: "a" }, { dataset_id: B, alias: "b" }] }
4. dataset_create_union  { sources: [{ dataset_id: A, alias: "a", column_mapping: {...} },
                                       { dataset_id: B, alias: "b", column_mapping: {...} }] }
5. (the MDM pipeline runs automatically on the unioned dataset)
6. (under a topic) golden_records_list { topic_id }
7. stewardship_decide    { stewardship_id, decision: "merge" }       # for uncertain matches

Workflow: Archive old datasets to S3

1. connection_create  { name: "cold-storage", connection_type: "api_key", config: { ... }, provider_name: "s3" }
2. connection_test    { connection_id }
3. dataset_archive    { dataset_id, connection_id, drop_after: true }
4. (later) dataset_restore { dataset_id, archive_id, as_snapshot: true, connection_id }

Deep dives

For more detail on any area, call:

  • help_page("CAPABILITIES") — one-page capability cheat sheet (every feature in one line).
  • help_page("codebase/FEATURE_INVENTORY") — complete inventory of every module, function, MCP tool, and setting, organized by feature domain.
  • help_page("codebase/FUNCTION_REFERENCE") — alphabetical index of every exported function (Ctrl+F friendly).
  • help_page("codebase/INDEX") — codebase architecture overview.
  • help_page("codebase/MCP_TOOLS") — per-tool reference (inputs, outputs, cost).
  • help_page("codebase/SCHEMAS") — PG schema map.
  • help_page("codebase/API_ENDPOINTS") — HTTP API endpoints (/api/*).
  • help_page("providers/socrata"), help_page("providers/ckan"), help_page("providers/arcgis"), help_page("providers/opendatasoft"), help_page("providers/dcat"), help_page("providers/filesystem") — per-provider guides.
  • help_page("quickstart") — installation and first-run.
  • help_page("devops/troubleshooting") — common errors and fixes.

Conventions agents should follow

  1. Always check health_check first when a session starts — if ok: false, surface the error before attempting destructive operations.
  2. Always call workspace_info on first interaction to learn the tier and available features.
  3. Never assume a dataset is loaded. Call dataset_load (idempotent) before dataset_query.
  4. Use analysis_preflight_check { url } before downloading large sources to avoid wasting tier quota.
  5. Diff refresh requires diff_config.enabled = true in registry.dataset_settings (currently SQL-only). Once set, every refresh diffs automatically — there is no refreshMode parameter.
  6. When PII is detected, set a masking mode before running queries that might expose data. Modes are view or physical.
  7. Quote function names and dataset IDs in backticks in responses to users — it keeps outputs copy-paste friendly.
  8. Use ux_session_start for multi-step journeys so the platform can guide next steps.
  9. Respect cost tiers. RED = refuse, ORANGE = confirm with user, YELLOW = proceed but warn, GREEN = proceed silently.

This guide is verified against scripts/mcp-server.ts and lib/conduit/diffConfig.ts for v0.19.9. Every parameter name above matches the actual Zod input schema. Last accuracy pass: v0.19.9c.

Send Feedback