Other providers & sitemap discovery
Use robots.txt + sitemap.xml heuristics to discover dataset pages on unknown sites.
When to use
If a portal is not Socrata/ArcGIS/CKAN/OpenDataSoft/DCAT, use the OTHER provider.
What it does
Given a source URL:
- If it looks like a sitemap (
*.xmlor containssitemap), it parses it. - Otherwise it tries:
robots.txtsitemap declarations/sitemap.xmland/sitemap_index.xml
Then it filters URLs for likely dataset-related pages.
Limits
To avoid runaway crawls, discovery returns a capped list.
If you need more:
- Add multiple sources (different sitemap chunks)
- Split into multiple projects