Features — Agent Quick Reference
Q&A guide for AI agents calling help_page("features"). Answers "how do I..." questions with the exact MCP tools, arguments, and order of operations.
Features — Agent Quick Reference
Version: v0.19.9 | Audience: AI agents calling
help_page("features"), and humans skimming capabilities.Companion docs:
help_page("CAPABILITIES")(one-page cheat sheet) ·help_page("codebase/FEATURE_INVENTORY")·help_page("codebase/FUNCTION_REFERENCE")MCP tool count: 134 (119 core tools + 15 git tools)
Parameter style: Tools use
snake_caseparameter names. Examples below show the exact arguments accepted by each tool's Zod schema inscripts/mcp-server.ts.This guide is structured as task-oriented Q&A. For each "How do I...?" question you'll see:
- Tool to call — exact MCP tool name and required arguments
- What you get back — key result fields
- Notes — gotchas, related tools, deep dives
Quick index
| I want to... | Jump to | |---|---| | Find and browse datasets | Discovery | | Load data from a URL / file / REST API | Ingestion | | Profile or analyze a dataset | Analysis | | Run SQL against a dataset | Query | | Mask PII columns | Masking | | Classify a dataset (AI tags) | Classification | | Detect what changed since the last refresh | Diff & CDC | | Create / restore snapshots | Snapshots | | Archive a dataset to cold storage | Archive | | Search documents semantically (RAG) | RAG | | Match / deduplicate entities (MDM) | MDM | | Join or union two datasets | Schema ops | | Align schemas across sources | Schema alignment | | Configure a REST API connection | REST | | Watch a remote source for changes | Watch sources | | Set up a portal crawler | Crawl | | Generalize values (dob → year, etc.) | Generalization | | Sync a dataset out to S3 / blob storage | Sync | | Create / manage API keys | Auth | | Check platform health | Health | | Understand my tier and usage | Tiers | | Track a long-running analysis job | Queue | | Browse pre-built workflows | Workflows |
Discovery & Catalog
How do I find a dataset?
datasets_search { query?, provider?, visibility?, tag?, limit?, offset?, sort_by?, has_pii?, compliance_framework?, ... }
- Inputs: free-text
query, optional filters (provider,tag(singular, exact match),compliance_framework,min_rows,max_size,quality_rating,has_pii,has_phi,source_system,table_type). - Returns: paginated dataset list with title, provider, row counts, governance flags, quality score.
- Sort options:
relevance(default with query),created_desc(default without query), plus size/rows/columns/quality/title sorts.
How do I get details on a specific dataset?
dataset_get { id, compact?, include_fields? }
- Returns full metadata: schema, tier, source URL, refresh history, masking mode, snapshots, derivatives, lineage, RAG status.
- Set
compact: trueto skip statistics;include_fields: falsesaves ~2,000 tokens for wide datasets. - For larger diagnostics:
dataset_stats,dataset_tables,dataset_changes(CDC log).
How do I search columns/fields across the catalog?
fields_search { fieldQuery, semanticType?, limit?, offset? }
- Searches column names, descriptions, and detected semantic types across every dataset.
- Useful for "find every dataset with an
emailcolumn" or "show me all columns tagged as SSN".
How do I find similar datasets?
find_similar_datasets { query, limit?, sort_by?, provider?, has_pii?, ... }
- Note: takes a free-text
querystring, not a dataset ID. Combines AI Engine embeddings with the rich filter set fromdatasets_search. - For "similar to dataset X", use
dataset_get { id: X }to read its title/tags, then pass those intofind_similar_datasets.
How do I export a dataset's catalog metadata?
dataset_export_catalog { dataset_id, output_path?, include_data?, include_metadata? }
- Writes a per-dataset catalog directory: schema JSON, lineage, governance, quality, README, manifest. With
include_data: truesymlinks the Parquet file intocatalog/data/.
Ingestion & ETL
How do I ingest a dataset from a URL?
dataset_ingest_url { url, title?, description?, tags?, dataset_id?, auto_analyze?, auto_load_db?, skip_preflight? }
url: any HTTP(S) URL, Socrata endpoint, data portal link, or presigned S3 URL.dataset_id: optional. If omitted, a new dataset is created. If provided, the URL becomes the refresh source for that dataset.auto_analyze/auto_load_db: defaulttrue— runs the full Conduit analysis and loads into the query DB after staging.skip_preflight: for endpoints that don't support HEAD requests.- Returns:
{ ingest_id, dataset_id, status, rowCount, sizeBytes, stagedAt }. - Pipeline: the ingest runs through 5 gates — format detection → size guard → content sniff → duplicate/hash check → preflight. Failures return a machine-readable code.
- Diff refresh: there is no
refreshModeparameter. Differential refresh runs automatically whendataset_settings.diff_config.enabled = truefor the dataset (see Diff & CDC).
How do I upload a file from local disk?
The library accepts file content directly via dataset_ingest. For large files you can use presigned URLs through dataset_get_upload_urls (which writes to a configured connection, not back to the library).
dataset_ingest { content, filename, title?, description?, tags?, dataset_id?, auto_analyze?, auto_load_db? }
content: raw CSV/JSON/TSV bytes as a UTF-8 string.filename: must include the extension (e.g.employees.csv) — the format is inferred from this.dataset_id: optional, for refreshing an existing dataset.
For presigned S3 uploads to a connected storage backend (bulk — 1 to 50 files per call):
dataset_get_upload_urls { connection_id, files: [{ filename, content_type? }, ...], ttl_seconds? }
connection_id: an S3-compatibleconnection_createentry. (This tool does not take adataset_id.)files: array of 1–50 objects; each yields a PUT URL and a matching GET URL.- Returns signed PUT/GET URLs valid for
ttl_seconds(default 3600, max 86400).
How do I ingest multiple URLs at once?
dataset_ingest_batch { urls: [{ url, title?, tags? }, ...] }
- Field name is
urls(max 50 entries). Each URL becomes its own dataset (or refresh). - Track progress with
ingest_status { batch_id }.
How do I check status of an in-flight ingest?
ingest_status { ingest_id?, batch_id?, dataset_id? }
- Provide one of the three IDs. Returns queue/progress state plus error details.
How do I watch a remote storage for new files?
watch_source_create { connection_id, watch_path, glob_pattern?, recursive?, poll_interval_seconds?, auto_analyze? }
watch_source_list { enabled_only? }
watch_source_poll {}
connection_id: a previously created storageconnection_createentry (S3, blob, GCS, SFTP, FTP, IMAP, SMB, filesystem).watch_path: base path to monitor (e.g./data/exports).glob_pattern: file filter (default*).poll_interval_seconds: between 60 and 86400 (default 300).auto_analyze: defaulttrue— new files run through the full ingest pipeline.watch_source_polltriggers a one-shot poll across all enabled watches.
How do I create and test a connection?
connection_create { name, connection_type, config, provider_name?, base_url?, scopes?, tags? }
connection_list { provider_name?, connection_type?, tag? }
connection_test { connection_id }
connection_type:oauth2,api_key,basic,bearer,custom(for credential storage), or storage-specific types used bydataset_syncandwatch_source_create.config: credentials object — encrypted at rest. The execution engine injects auth at call time; raw secrets never appear in tool arguments or logs.
Analysis & Profiling
How do I analyze a dataset?
Quick AI classification only:
classify_dataset { datasetId, autoApprove? }
Full 20-section analysis (queued):
analysis_queue_submit { dataset_id, file_id?, job_type?, download_url?, priority? }
analysis_queue_status { job_id?, status_filter?, limit? }
get_analysis { dataset_id?, file_id?, run_id?, summary_only? }
job_type:"full_analysis"(default — full pipeline against an existing file),"refresh_analyze"(download fromdownload_urlthen analyze),"reprocess"(re-run on existing file).priority: 1–10 (default 5).get_analysisreturns the canonical 20-sectionAnalysisContractv1.1 (overview · columns · semanticTypes · piiFlags · piiCompliance · dataQuality · patterns · distributions · correlations · joinCandidates · keyColumns · anomalies · freshness · lineage · fileManifest · storage · rowSample · documentIntelligence · maskingReadiness · embeddings).- Set
summary_only: truefor the compact summary instead of all 20 sections.
How do I cancel or retry an analysis job?
analysis_queue_cancel { job_id }
analysis_queue_retry { job_id }
How do I classify a dataset into topic tags?
classify_dataset { datasetId, autoApprove? }
- Uses the AI Engine to assign topic tags based on column names + sample rows. Set
autoApprove: trueto write tags directly without human review. - (There is no
strategyparameter — the classifier picks its own depth based on dataset size.)
How do I classify many datasets in bulk?
classify_batch { limit? }
- Processes up to
limit(default 10, max 50) un-classified datasets in a single call. There is nodatasetIds[]parameter — the tool picks them off the queue.
How do I preflight-check a URL before ingesting?
analysis_preflight_check { url, dataset_id? }
- Probes the URL via HEAD/range requests, returns: estimated row count, estimated cost tier (GREEN/YELLOW/ORANGE/RED), warnings, can-run verdict.
- Use this to avoid wasting tier quota on a problematic source.
Query & Data Loading
How do I query a dataset with SQL?
dataset_load { dataset_id, file_id?, workspace_id?, framework?, masking_mode?, include_derivatives? }
dataset_query { dataset_id, sql, table?, limit?, offset?, force? }
dataset_loadis idempotent — call it beforedataset_queryto ensure the dataset is materialized into PostgreSQL.framework: optional compliance framework override (hipaa | sox | cjis | pci | gdpr | ccpa) — applies the appropriate masking ruleset.masking_mode:"view"(default) or"physical"— see Masking.dataset_query.sql: SELECT only. DDL and unscoped DELETE are rejected byvalidateSql.table: which physical table to query — defaults to"source". Other options includesheet_*(per-sheet for Excel) andderiv_*(derivatives).force: admin-only bypass for cost guardrails.- Returns:
{ rows, rowCount, columns, costTier, elapsedMs, engine }. - Routing: Simple SELECTs run in PostgreSQL. Heavy analytical queries route to DuckDB via
queryRouter.decideQueryEngine.
How do I sample rows from a dataset?
dataset_sample { dataset_id, table?, rows? }
rowsdefaults to 10 (max 100). Cheaper than running aSELECT * LIMIT n.
How do I get statistics for a dataset?
dataset_stats { dataset_id, table? }
- Returns per-column null %, distinct count, min/max/avg, top values — read from the cached analysis store when available.
How do I list tables in a dataset?
dataset_tables { dataset_id }
- Returns the canonical source table, masking view (if enabled), per-sheet tables (Excel), snapshot tables, and derivatives.
Data Governance, Compliance, Masking
How do I mask PII?
set_masking_mode { dataset_id, mode }
modeoptions (only two):view— a masking view replaces the base table for queries. Reversible: toggle off any time. Cheap.physical— values are tokenized in-place and originals stored in the encrypted vault (AES-256-GCM). Queries return tokens; usevault_lookupto reverse.
- (There is no
noneortokenizedenum value; "no masking" is the absence of a configured mode, and physical masking already preserves format.)
How do I look up an original value from a token?
vault_lookup { dataset_id, token }
- Both arguments are required. Requires
vault.readscope on your API key. Returns the decrypted original value. - Stats only:
vault_stats { dataset_id }→{ tokenCount, distinctCount, lastWritten }without exposing values.
How do I apply a generalization strategy?
generalization_list { category?, include_portal? }
generalization_preview { dataset_id, column, strategy, options?, sample_size? }
generalization_apply { dataset_id, column, strategy, options?, apply_mode? }
generalization_remove { dataset_id, column }
generalization_stats { dataset_id }
generalization_bulk_apply { dataset_id, configs: [{ column, strategy, options?, apply_mode? }] }
- Built-in strategies:
dob_generalize,zip_generalize,age_band,phone_partial,ssn_partial,email_domain— list withgeneralization_list. apply_mode:"view"(default — transform at query time, reversible) or"physical"(transform at copy time).- Custom rules can be loaded from the portal integration (Team/Enterprise tier) when
include_portal: trueis set ongeneralization_list.
How do I check compliance posture for a dataset?
After analysis completes, read the piiCompliance section from the latest analysis:
get_analysis { dataset_id }
- Look at the
piiCompliancesection:{ HIPAA: {...}, GDPR: {...}, PCI: {...}, SOX: {...}, CCPA: {...}, CJIS: {...} }. - Each framework lists matched columns, rule IDs, and severity (info / warn / critical).
Schema Management & Alignment
How do I align columns across sources?
schema_align_columns { sources: [{ dataset_id, alias? }, ...] }
- Pass 2 or more sources. The first source becomes the target schema; the rest are aligned to it.
- Uses a 4-signal scorer: name similarity (Jaro-Winkler) + semantic type + data type compatibility + sample value overlap.
- Returns a per-column alignment with confidence scores. Feed the result into
dataset_create_unionfor multi-source merges.
How do I join or union two datasets?
Join (exactly 2 sources):
dataset_create_join { sources: [{ dataset_id, alias, columns? }, { dataset_id, alias, columns? }] }
- Both
dataset_idandaliasare required per source.columnsis an optional whitelist (default: all columns). - The aliases are used to disambiguate column-name collisions.
Union (2 or more sources):
dataset_create_union { sources: [{ dataset_id, alias, column_mapping? }, ...] }
column_mapping: optional{ "SOURCE_COL": "target_col" }rename map per source — used to align mismatched column names without a separate alignment step.- Each row in the unioned output carries
_source_aliasfor lineage.
How do I see the schema of a dataset?
dataset_get { id } # includes the field/schema list
dataset_tables { dataset_id } # includes column defs per physical table
Snapshots & Versioning
How do I create a snapshot?
dataset_snapshot_create { dataset_id, name? }
- Creates an immutable copy of the current source table. Snapshots are physical (CTAS) tables under a
_snapshot_<timestamp>suffix. - Optionally pass a
namesuffix; otherwise the timestamp suffix is auto-generated. - (There is no
retentionDaysparameter — retention is governed by tier defaults and thedataset_settingsschedule.)
How do I list, compare, restore, or delete snapshots?
dataset_snapshot_list { dataset_id, include_retention_summary? }
dataset_snapshot_compare { dataset_id, snapshot_a, snapshot_b, key_column?, sample_diffs? }
dataset_snapshot_restore { dataset_id, snapshot_name }
dataset_snapshot_delete { dataset_id, snapshot_name }
snapshot_a/snapshot_b: snapshot names (e.g._snapshot_20240315120000), or the literal string"source"for the current live data.key_column: optional PK override forcompare. If omitted, the platform auto-detects from the latest analysis +dataset_settings.diff_config.primary_key_columns.sample_diffs: max sample rows per bucket (added/deleted/modified), default 5.restoreis destructive — the current source table is replaced by the snapshot.deleterequires bothdataset_idand thesnapshot_name.
Differential Refresh & CDC (SPIKE-005)
How do I detect what changed in a dataset?
Automatic (recommended): when dataset_settings.diff_config.enabled = true for a dataset, every refresh (via dataset_ingest_url or scheduled crawl) automatically computes a diff:
- Detects primary keys via the 5-tier cascade (declared → unique index → high-cardinality column → composite → synthetic row hash).
- Picks an engine (PG for row counts under
pg_max_rows, DuckDB for larger). - Computes add/change/delete sets.
- Applies schema evolution (new columns, widened types, dropped columns).
- Writes a changeset derivative file.
- Appends to the CDC log (
registry.dataset_changes). - Emits timeline events (
diff.computed,diff.no_changes,diff.schema_evolved). - Runs downstream consumers (RAG incremental, MDM re-eval, LLM narrative).
- Marks refresh complete with a run ID.
Inspect the change log manually:
dataset_changes { dataset_id, since?, until?, operation?, row_key?, limit?, offset? }
since/until: ISO datetimes filtering bychanged_at.operation:"INSERT","UPDATE", or"DELETE"(uppercase enum values).row_key: trace a specific row's history by its primary-key value (as a string).limit: 1..1000, default 100.
How do I see the diff between two snapshots on demand?
dataset_snapshot_compare { dataset_id, snapshot_a, snapshot_b, key_column?, sample_diffs? }
- Returns add/update/delete bucket counts plus sample rows from each bucket. Pass
"source"for either snapshot to compare against the current live table.
How do I configure differential refresh for a dataset?
Important — no MCP tool exposes
diff_configwrites. Thedataset_settingsMCP tool does not accept adiff_configfield. Configuration is set via SQL/portal admin againstregistry.dataset_settings.diff_config(a JSONB column). The actual fields, defined inlib/conduit/diffConfig.ts, are:
| Field | Type | Default | Purpose |
|---|---|---|---|
| enabled | boolean | false | Master switch. When false, refreshes overwrite without diffing. |
| engine | "auto" \| "pg" \| "duckdb" | "auto" | Diff engine. auto picks PG under pg_max_rows, else DuckDB. |
| pg_max_rows | number | 1_000_000 | Row threshold above which auto engine routes to DuckDB. |
| primary_key_columns | string[] \| null | null | Manual PK override. When null, the 5-tier detection cascade runs. |
| major_revision_threshold | number (0..1) | 0.5 | Fraction of rows changed that triggers "major revision" classification (caps detailed CDC logging on bulk overwrites). |
| auto_diff_on_refresh | boolean | true | Whether refresh hooks invoke the differ when enabled = true. |
| store_old_values | boolean | true | Include old_values JSONB on UPDATE rows in dataset_changes. |
To set these today, run a SQL update (or use the portal admin UI):
UPDATE registry.dataset_settings
SET diff_config = jsonb_set(diff_config, '{enabled}', 'true')
WHERE dataset_id = '<uuid>';
(Adding an MCP tool to expose diff_config writes is a known gap — see docs/roadmap.md.)
Document Intelligence & RAG
How do I search documents inside a dataset?
rag_index { dataset_id, chunk_size?, chunk_overlap?, column_mapping }
document_search { dataset_id, query, top_k? }
rag_index.column_mapping(required, added in v0.19.4): aragColumnMappingSchemaobject that names the title/body/timestamp columns to chunk. Use it to handle datasets with multiple text columns.chunk_size: 128–2048 tokens (default 512).chunk_overlap: 0–256 (default 50).document_search.top_k: 1–20 (default 5). The result set is hybrid (vector + BM25 + RRF) when the index supports it; there is nomodeparameter.- Embeddings are 768-d nomic-embed-text vectors stored in pgvector (HNSW), plus a GIN tsvector for BM25.
How do I refresh the RAG index after new data arrives?
rag_reindex { dataset_id, chunk_size?, chunk_overlap?, column_mapping }
- Same schema as
rag_index. Forces a full rebuild. - Cheaper alternative: enable
diff_config.enabledfor the dataset. The diff pipeline runsragIncremental, which only re-embeds added/changed chunks and tombstones deleted ones.
How do I extract text from PDFs, DOCX, HTML, EML files?
Document extraction happens automatically during ingest for recognized MIME types — the resulting text becomes a derivative file (document_extract.*), is indexed by RAG if column_mapping covers it, and is scanned for PII via documentPiiScan.
To trigger extraction manually:
file_create_derivative { repo_id, parent_file_id, filename, content_type?, data, derivation_type, derivation_config, analysis? }
derivation_type:extract,flatten_csv,filter,transform, oranalysis_export.data: base64-encoded file content (the new derivative's bytes).derivation_config: strategy-specific JSON (e.g.{ jsonPath, flattenDepth }for flatten_csv).
Master Data Management
How do I resolve entities (find golden records)?
golden_records_list { topic_id, entity_type?, limit?, offset? }
golden_record_get { entity_id }
stewardship_decide { stewardship_id, decision: "merge" | "separate" | "defer", resolution_notes?, resolved_by? }
golden_records_listis scoped to atopic_id, not adataset_id— golden records live under topics within domains.stewardship_decide.decision: the enum is"merge" | "separate" | "defer"(use"separate", not"keep_separate").- Under the hood: the MDM pipeline runs 3-stage blocking (deterministic key → phonetic → LSH/MinHash), a Fellegi-Sunter scorer, Union-Find clustering, then surfaces uncertain clusters to the stewardship queue.
- The 4-phase normalizer (text → semantic → domain → key) runs before blocking so "St." vs "Street", "J. Smith" vs "John Smith", etc. collapse.
How do I create a domain / topic / knowledge record?
domain_create { name, owner?, compliance_framework?, access_policy? }
domain_list {}
topic_create { domain_id, title, description?, created_by? }
topic_add_resource { topic_id, resource_type: "dataset"|"report"|"snapshot"|"file", resource_id, resource_role?, notes?, linked_by? }
topic_add_discussion { topic_id, content, author_type: "human"|"agent"|"system", author_id?, parent_id? }
topic_add_knowledge { topic_id, fact_type?, content, source_links?, confidence?, created_by? }
topic_get_context { topic_id, max_tokens?, include_discussions?, include_knowledge? }
topic_create.title(notname) is the topic display label.topic_add_discussion.content+author_typeare required;parent_idenables threading.topic_add_knowledge.contentis a JSON object — usefact_typeto label it (entity_profile,decision,match_strategy,observation, etc.).topic_get_contextreturns a budgeted context pack (default 8000 tokens) for an LLM working in the topic's scope.
How does MDM react to dataset changes?
When the diff pipeline runs, the changeset consumer:
- Flags affected golden records as stale (queued for re-scoring).
- Flags uncertain matches as reeval (queued for stewardship review).
- Tombstones source links for deleted rows (not removed).
Archives & Cold Storage
How do I archive a dataset to cold storage?
dataset_archive { dataset_id, snapshot_name?, drop_after?, connection_id? }
snapshot_name: snapshot to archive. If omitted, the current source table is archived.drop_after: defaulttrue— drops the local snapshot after a successful upload to reclaim disk space.connection_id: S3-compatible connection (AWS S3, Backblaze B2, Cloudflare R2, Wasabi, DigitalOcean Spaces, MinIO). If omitted, falls back to theARCHIVE_S3_BUCKETenv var.- Writes data + a manifest with SHA256 checksums to the connection's bucket/prefix.
How do I list, delete, or restore archives?
dataset_archive_list { dataset_id }
dataset_archive_delete { dataset_id, archive_id, connection_id? }
dataset_restore { dataset_id, archive_id, as_snapshot?, connection_id? }
dataset_archive_list.dataset_idis required.as_snapshot: defaultfalse— replaces the source table. Settrueto restore as a new snapshot instead.- Restore verifies SHA256 checksums before writing.
Sync (Outbound Export)
How do I sync a dataset out to remote storage?
dataset_sync { dataset_id, connection_id, destination_path?, path_template?, data_format?, compression?, apply_masking?, row_limit?, columns?, include_metadata?, sidecars? }
- Pushes the dataset (data + sidecar metadata) to a configured storage backend.
destination_path: remote prefix (default/_uploads/{slug}/).path_template: path with token substitution:{id},{slug},{title},{provider},{date},{timestamp}.data_format:auto | parquet | csv | jsonl.compression:auto | none | gzip | zstd | snappy | lz4.apply_masking: sync the masked view instead of raw rows.sidecars: per-file toggles fordataset,schema,lineage,quality,governance,import_sql,readme,manifest.
REST API Integration
How do I call a REST API and keep results fresh?
rest_create_endpoint { name, url, method?, collection_id?, headers?, body_template?, auth_type?, auth_connection_id? }
rest_execute { endpoint_id, auto_save?, dataset_id?, overrides? }
rest_list_endpoints { collection_id? }
rest_diff_responses { execution_id_1, execution_id_2 }
auth_connection_idreferences aconnection_createentry that stores OAuth/API key/JWT credentials encrypted with AES-256-GCM. The execution engine injects auth at call time — raw credentials are never exposed in tool arguments or logs.rest_execute: runs the request. Withauto_save: true, the response body is stored as a file (linked todataset_idif provided).rest_diff_responsestakes twoexecution_idvalues (not endpoint IDs) and returns a structured diff.
Crawling & Portal Discovery
How do I crawl a data portal?
discover_portals { dataGov?, socrataGlobal?, knownPortals?, enableProjects?, socrataMinDatasets? }
crawler_list_projects { limit?, offset?, enabledOnly? }
project_add_source { projectId, sourceUrl, enabled? }
project_update { projectId, enabled?, scheduleCron?, concurrency?, rateLimitRps?, maxRequestsPerRun? }
crawler_run_project { projectId }
crawler_refresh_dataset { url, dataset_id?, repo_id? }
discover_portals: seeds known portal projects. Toggles control which discovery engines run (dataGov,socrataGlobal,knownPortals). SetenableProjects: trueto start crawling immediately.project_add_source.sourceUrl(notsource) — accepts portal roots, dataset links, DCAT feeds, orfile://paths.crawler_refresh_datasettakes a URL (and optionally an existingdataset_idto link to), not just a dataset ID.- Provider support:
socrata,ckan,arcgis,opendatasoft,dcat,filesystem, plus a generic fallback. Provider auto-detection lives inlib/server/providers/detector.ts.
How do I fingerprint a data source?
fingerprint_detect { columns, sampleValues?, metadata? }
fingerprint_list { category? }
fingerprint_add { system, label, category, columnExacts, columnPatterns?, valuePatterns? }
fingerprint_remove { system }
fingerprint_detecttakes the column names of a dataset (and optional sample values), not a URL — it scores them against the registered fingerprint catalog and returns the best-matching source system.fingerprint_addregisters a custom system;fingerprint_removesoft-deactivates bysystemID.
JSON & File Tools
How do I flatten a nested JSON payload?
process_json { data, jsonPath?, maxRows?, flattenDepth? }
flatten_to_csv { data, jsonPath?, flattenDepth?, delimiter? }
process_config { action: "get" | "update", config? }
process_json: processes nested JSON into rows (object-flattening up toflattenDepth).flatten_to_csv: same flatten logic but emits CSV directly.process_config: view or update server-side flatten/processing defaults.
How do I track files across derivative versions?
file_list_versions { repo_id?, file_id?, dataset_id? }
file_repo_summary { repo_id }
store_analysis { file_id?, dataset_file_id?, result, engine_name?, engine_version? }
file_list_versionswalks the lineage chain for a file (or every file in a repo).store_analysiswrites a Conduit analysis result back to the registry, linked to a file.
Platform Administration
How do I create an API key?
tokens_create_api_key { account_id, name, scopes?, expires_at?, category?, tier?, description? }
tokens_list { type?, account_id?, include_revoked? }
tokens_revoke { token_id }
account_idis required — admins create keys on behalf of an account.scopes: array of scope strings, e.g.["read:datasets", "vault:read", "admin:*"]. Default:["public:read"].category:test | mcp_tool | upload | general(defaultgeneral).tier:free | education | developer | team | enterprise(defaultfree).tokens_revokeuses the parameter nametoken_id(notkeyId).
How do I check if a token is valid?
auth_verify_token { token }
- Returns
{ ok, account, scopes, tier, expires_at, ... }. Read-only.
How do I list / create / revoke sessions?
sessions_list { account_id, include_revoked? }
sessions_create { account_id, user_agent?, ip_address? }
sessions_revoke { session_id }
sessions_list.account_idis required.sessions_createdoes not take a TTL — session lifetime is determined by server config.
How do I see my tier and usage?
tiers_get_all { keyLimit?, keyOffset?, tiersOnly? }
account_usage { keyId }
account_update { keyId, overrides: { dailyLimit?, monthlyLimit?, tier?, scopes? } }
account_usageis per-key — pass thekeyIdyou want usage for.tiers_get_alllists tier configs and (optionally) keys.
How do I manage tiers (admin only)?
tier_update { tier, patch: { dailyLimit?, totalLimit?, monthlyLimit?, priceCents?, ... } }
admin_manage { action: "get_account" | "set_tier" | "get_usage" | "list_tiers", email?, account_id?, key_id?, tier?, max_datasets?, reason? }
admin_test_keys { tier?, format? }
tier_updateuses an enum tier name (free | education | developer | team | enterprise) plus apatchobject.admin_manageis the dispatch for several admin actions — pass the matching arguments for eachaction.
How do I check platform health?
health_check { mode? }
mode:"quick"(default — fast probes) or"full"(full doctor report with AI diagnostics).- Returns version, DB / queue status, disk space, provider availability, and migration state.
How do I track a long-running analysis job?
analysis_queue_submit { dataset_id, file_id?, job_type?, download_url?, priority? } # returns { job_id }
analysis_queue_status { job_id?, status_filter?, limit? }
analysis_queue_cancel { job_id }
analysis_queue_retry { job_id }
- The worker processes the job via
pg-boss. Status values:pending,running,completed,failed,cancelled,retrying. - Progress is reported per section of the 20-section
AnalysisContract.
Workspace & Settings
How do I update dataset-level settings?
dataset_settings { dataset_id, is_frozen?, is_locked?, schedule_enabled?, schedule_cron?, auto_analyze?, auto_load_db?, process_derivatives?, workspace_id?, notes?, sampling_config?, semantic_overrides?, derivative_file_id?, derivative_override_frozen?, derivative_schedule_cron? }
- Parameters are flat (not nested under
set: {...}). is_frozen/is_locked: freeze data activity / lock all access (maintenance mode).schedule_enabled/schedule_cron: override the project schedule for this dataset.auto_analyze/auto_load_db/process_derivatives: toggles for the post-refresh hooks.sampling_config:{ strategy, sampleSize?, maxRowsHardLimit?, maxFileSizeMbForFull?, fallbackStrategy? }— strategies arefull | first_n | last_n | random | bookend | stratified.semantic_overrides: manual semantic-type overrides for specific columns.notes: free-text dataset notes.- Not exposed via this tool:
diff_config,masking_mode, RAG settings, retention. Masking is set viaset_masking_mode. Diff config currently requires SQL (see Diff & CDC).
How do I refresh the cached summary or rebuild the masking view?
dataset_refresh_summary { dataset_id, table? }
dataset_reload_masking { dataset_id, framework, table? }
dataset_reload_masking.frameworkis required — pass the compliance framework (hipaa | sox | cjis | pci | gdpr | ccpa).
How do I see workspace context?
workspace_info { workspace_id? }
- Returns: workspace ID, tier, feature flags, user role, enabled providers. Defaults to the caller's default workspace.
How do I look up a configuration setting?
settings_docs { search?, category?, mcp_only? }
- Browse the canonical settings registry. Categories:
Core | Security | MCP | AI | Storage | Crawler | Billing | System.
Accounts (subjects_*)
Naming note: the tools prefixed
subjects_operate on account records (the platform's user accounts), not on GDPR data subjects. The naming is historical.
subjects_create { email, name?, password?, provider? }
subjects_list { email?, status?, limit? }
subjects_update { account_id, status?, name? }
provider:local | google | github | test(defaultlocal).status:active | pending | suspended | all(filter forsubjects_list).
UX & Reporting
How do I start a UX session?
ux_session_start { format?, client_hint?, demo_mode? }
format:auto | html | markdown | plain— auto-detects the renderer based onclient_hint.client_hint:claude_chat | claude_code | web | api.demo_mode: enables a guided walkthrough.
How do I save and retrieve reports?
report_list { status?, dataset_id?, limit?, offset? }
report_save { title, html_content, description?, visibility?, generation_prompt?, dataset_ids?, file_ids?, tags?, status?, auto_regenerate?, report_id? }
report_get { id?, slug? }
title+html_contentare required forreport_save. Pass an existingreport_idto update.visibility:private | link | public(defaultlink).status:draft | published.dataset_ids/file_ids: lineage tracking — used to surface the report when its sources refresh, and to enableauto_regenerate.report_getaccepts eitheridorslug, not both.- All reports must comply with Report Scroll/Sizing Standard-001 before saving.
Events & Timeline
How do I record a timeline event?
events_ingest { datasetId, eventType, actor?, source?, metadata? }
eventType: event identifier (e.g.dataset.read,dataset.recommend,dataset.enrich).metadata: optional context object — surfaces in the dataset's timeline and audit log.- (Note: this is one of the few tools using camelCase parameter names —
datasetId, notdataset_id.)
Workflows
How do I browse pre-built workflows?
workflow_list { archetype?, domain? }
workflow_get { workflow_id }
archetype:have_data | what_exists | build_something | transform | govern | explore | bootstrap.domain: functional domain code (Acq,Ana,Qry,Tfm,Prv,Ver,Gov,Rpt,IAM,Plt,Jsn).workflow_get.workflow_id: e.g."W01","W32".
Common multi-step workflows
Workflow: Ingest → analyze → classify → mask → query
1. dataset_ingest_url { url } -> { dataset_id }
2. analysis_queue_submit { dataset_id, job_type: "full_analysis" } -> { job_id }
3. analysis_queue_status { job_id } (poll until completed)
4. classify_dataset { datasetId, autoApprove: true } -> { tags }
5. set_masking_mode { dataset_id, mode: "view" } (if PII detected)
6. dataset_load { dataset_id }
7. dataset_query { dataset_id, sql: "SELECT ..." }
Workflow: Set up diff-refresh pipeline
1. (one-time SQL) UPDATE registry.dataset_settings
SET diff_config = jsonb_set(diff_config, '{enabled}', 'true')
WHERE dataset_id = '<uuid>';
2. dataset_ingest_url { url, dataset_id } # first run — seeds the baseline
3. dataset_ingest_url { url, dataset_id } # subsequent run — diff is computed automatically
4. dataset_changes { dataset_id, since: "2026-01-01T00:00:00Z" } # see what changed
Workflow: Build a knowledge base from documents
1. dataset_ingest_batch { urls: [ { url: "...pdf" }, { url: "...docx" }, ... ] }
2. (wait for ingest) each document auto-extracts to text
3. rag_index { dataset_id, column_mapping: { ... } }
4. document_search { dataset_id, query: "what does the report say about X?" }
Workflow: Match entities across two customer datasets
1. dataset_load { dataset_id: A }
2. dataset_load { dataset_id: B }
3. schema_align_columns { sources: [{ dataset_id: A, alias: "a" }, { dataset_id: B, alias: "b" }] }
4. dataset_create_union { sources: [{ dataset_id: A, alias: "a", column_mapping: {...} },
{ dataset_id: B, alias: "b", column_mapping: {...} }] }
5. (the MDM pipeline runs automatically on the unioned dataset)
6. (under a topic) golden_records_list { topic_id }
7. stewardship_decide { stewardship_id, decision: "merge" } # for uncertain matches
Workflow: Archive old datasets to S3
1. connection_create { name: "cold-storage", connection_type: "api_key", config: { ... }, provider_name: "s3" }
2. connection_test { connection_id }
3. dataset_archive { dataset_id, connection_id, drop_after: true }
4. (later) dataset_restore { dataset_id, archive_id, as_snapshot: true, connection_id }
Deep dives
For more detail on any area, call:
help_page("CAPABILITIES")— one-page capability cheat sheet (every feature in one line).help_page("codebase/FEATURE_INVENTORY")— complete inventory of every module, function, MCP tool, and setting, organized by feature domain.help_page("codebase/FUNCTION_REFERENCE")— alphabetical index of every exported function (Ctrl+F friendly).help_page("codebase/INDEX")— codebase architecture overview.help_page("codebase/MCP_TOOLS")— per-tool reference (inputs, outputs, cost).help_page("codebase/SCHEMAS")— PG schema map.help_page("codebase/API_ENDPOINTS")— HTTP API endpoints (/api/*).help_page("providers/socrata"),help_page("providers/ckan"),help_page("providers/arcgis"),help_page("providers/opendatasoft"),help_page("providers/dcat"),help_page("providers/filesystem")— per-provider guides.help_page("quickstart")— installation and first-run.help_page("devops/troubleshooting")— common errors and fixes.
Conventions agents should follow
- Always check
health_checkfirst when a session starts — ifok: false, surface the error before attempting destructive operations. - Always call
workspace_infoon first interaction to learn the tier and available features. - Never assume a dataset is loaded. Call
dataset_load(idempotent) beforedataset_query. - Use
analysis_preflight_check { url }before downloading large sources to avoid wasting tier quota. - Diff refresh requires
diff_config.enabled = trueinregistry.dataset_settings(currently SQL-only). Once set, every refresh diffs automatically — there is norefreshModeparameter. - When PII is detected, set a masking mode before running queries that might expose data. Modes are
vieworphysical. - Quote function names and dataset IDs in backticks in responses to users — it keeps outputs copy-paste friendly.
- Use
ux_session_startfor multi-step journeys so the platform can guide next steps. - Respect cost tiers. RED = refuse, ORANGE = confirm with user, YELLOW = proceed but warn, GREEN = proceed silently.
This guide is verified against scripts/mcp-server.ts and lib/conduit/diffConfig.ts for v0.19.9. Every parameter name above matches the actual Zod input schema. Last accuracy pass: v0.19.9c.