Crawler

The crawl endpoint allows you to trigger a crawl of a website and stores the contents in a collection. You can enter a start url or sitemap.


POSThttps://api.scoutos.com/v1/collections/:collection_id/crawl

Crawl Website

This route triggers a crawl of a website and adds the webpage contents to a collection.

Path Params

  • Name
    collection_id
    Type
    string
    Description

    The id of the collection to add the webpage contents to.

Required attributes

  • Name
    sites
    Type
    array
    Description

    Array of the sites to crawl. It takes an array of objects with the shape:

    start_url: string
    settings?: { "type": "sitemap" }
    metadata?: object
    

    Settings is optional. Currently you only need to add it if you are scraping using a sitemap. If you are scraping from a sitemap then set the start_url to the sitemap url. Otherwise, we will crawl and discover the urls. Discovery take a bit of time right now, we are working on speeding this up.

    You can also add a metadata object to the site object. This metadata will be added to the site object in the collection.This will allow you to filter by this metadata later when querying the collection.

Request

POST
https://api.scoutos.com/v1/collections/:collection_id/crawl
curl -X POST 'https://api.scoutos.com/v1/collections/:collection_id/crawl' \
--header 'Content-Type: application/json' \
--header 'Authorization: Bearer SECRET_KEY' \
--data-raw '{"sites": [{"start_url": "https://scoutos.com"}]}'

Response

{
 "ok": true
}