Crawler
The crawl endpoint allows you to trigger a crawl of a website and stores the contents in a collection. You can enter a start url or sitemap.
Crawl Website
This route triggers a crawl of a website and adds the webpage contents to a collection.
Path Params
- Name
collection_id
- Type
- string
- Description
The id of the collection to add the webpage contents to.
Required attributes
- Name
sites
- Type
- array
- Description
Array of the sites to crawl. It takes an array of objects with the shape:
start_url: string settings?: { "type": "sitemap" } metadata?: object
Settings is optional. Currently you only need to add it if you are scraping using a sitemap. If you are scraping from a sitemap then set the start_url to the sitemap url. Otherwise, we will crawl and discover the urls. Discovery take a bit of time right now, we are working on speeding this up.
You can also add a metadata object to the site object. This metadata will be added to the site object in the collection.This will allow you to filter by this metadata later when querying the collection.
Request
curl -X POST 'https://api.scoutos.com/v1/collections/:collection_id/crawl' \
--header 'Content-Type: application/json' \
--header 'Authorization: Bearer SECRET_KEY' \
--data-raw '{"sites": [{"start_url": "https://scoutos.com"}]}'
Response
{
"ok": true
}