DataShield MCP Dataset Library

Other providers & sitemap discovery

Use robots.txt + sitemap.xml heuristics to discover dataset pages on unknown sites.

When to use

If a portal is not Socrata/ArcGIS/CKAN/OpenDataSoft/DCAT, use the OTHER provider.

What it does

Given a source URL:

  • If it looks like a sitemap (*.xml or contains sitemap), it parses it.
  • Otherwise it tries:
    • robots.txt sitemap declarations
    • /sitemap.xml and /sitemap_index.xml

Then it filters URLs for likely dataset-related pages.

Limits

To avoid runaway crawls, discovery returns a capped list.

If you need more:

  • Add multiple sources (different sitemap chunks)
  • Split into multiple projects