Scraping a Website

Comprehensive guide to web scraping options

Scout provides the ability to setup custom web scrapes sources to populate the tables in your Scout collections with documents. Before getting started created a web scraping source, ensure you’ve created a collection.

There are three different web scraping options: you can enter a URL for scraping a single web page, multiple pages via a sitemap, or an entire site (all URLs) using our crawler.

Types of Web Scrapes

Single Web Page

Enter the URL of the single web page you wish to scrape. This will target and extract content from the specific page.

Website Crawl (Multiple Pages)

Initiate a full crawl of a website, scraping multiple linked pages. This option scrapes content from all available pages across the domain.

Sitemap

Provide a sitemap URL. The sitemap will be used to locate and scrape specific pages from the website, as indicated by the sitemap structure.

Configuring Web Scrapes

There are several inputs and settings to give you full control and customization over the scraping process:

Settings

URL
stringRequired

The full URL of the page, domain, or sitemap you want to scrape.

Scraper
selectDefaults to HttpRequired

Select the scraper technology. Available options are:

  • Http - Faster, but may not mix well with heavily client-side rendered sites.
  • Playwright - Heavier and slower, but renders dynamic content, which works well with heavily client-side rendered sites.
Text Extractor
selectDefaults to ReadabilityRequired

Defines how text is extracted from the page. Options include:

  • Readability - Scout’s smart scraping logic to include relevant components.
  • Trafilatura - A Python package and command-line tool used to gather text on the web.
Allow
string

Comma-separated list of URL patterns to include. Only URLs matching these patterns will be crawled.

Deny
string

Comma-separated list of URL patterns to exclude. URLs matching these patterns will be skipped.

Exclude Patterns
string

Comma-separated list of regex patterns to exclude. URLs matching any of these patterns will be skipped. For example: “/private/, login.html$’ will skip URLs containing ‘/private/’ and ending with ‘login.html’

Strip
string

Comma-separated list of html tags to strip from the content.

Strip URLs
boolDefaults to true

Removes query parameters, fragments, or trailing slashes, to simplify or standardize urls.

Allowed Domains
string

Comma-separated list of domains to include. Only URLs from these domains will be crawled. Defaults to the domain of the starting URL.

Max Depth
intDefaults to 5Required

Sets the maximum depth for the crawl. This limits how deep the scraper will follow links from the original page.

Max Page Count
intDefaults to 3000Required

Defines the maximum number of pages to scrape. Use this to prevent overly large scrapes.

Mapping

Frequency

URL
stringRequired

The full URL of the page, domain, or sitemap you want to scrape.

Advanced Options

Monitoring Web Scrapes

You can monitor the progress of your web scrapes in real-time from the dashboard. Once you click the “Run” button to start a web scrape, you’ll land on a page showing you the progress and status of the process:

Results

Once scraping is complete, the content will be stored as documents in a collection, with each web page being a separate document.

Built with