Comprehensive guide to web scraping options

There are three different web scraping options: you can enter a URL for scraping a single web page, multiple pages via a sitemap, or an entire site (all URLs) using our crawler.

Types of Web Scrapes

Single Web Page

Enter the URL of the single web page you wish to scrape. This will target and extract content from the specific page.

Website Crawl (Multiple Pages)

Initiate a full crawl of a website, scraping multiple linked pages. This option scrapes content from all available pages across the domain.

Sitemap

Provide a sitemap URL. The sitemap will be used to locate and scrape specific pages from the website, as indicated by the sitemap structure.

Customizing Web Scrapes

There are several inputs and settings to give you full control and customization over the scraping process:

URL
stringRequired

The full URL of the page, domain, or sitemap you want to scrape.

Scraper
selectRequired

Select the scraper technology. Available options are:

  • HTTP - Faster, but may not mix well with heavily client-side rendered sites.
  • Playwright - Heavier and slower, but renders dynamic content, which works well with heavily client-side rendered sites.
Text Extractor
selectRequired

Defines how text is extracted from the page. Options include:

  • Custom - Your own custom scraping logic.
  • Readability - Scout’s smart scraping logic to include relevant components.
  • Trafilatura - A Python package and command-line tool used to gather text on the web.
Include Selectors
string

CSS selectors for the HTML elements you want to include in the scraping. For example, targeting specific tags like body, h1, p.

Exclude Selectors
string

CSS selectors for the HTML elements you want to exclude from the scraping process. For example, elements like .navbar, #header, footer.

Remove Common Elements
boolean

Select this option if you want to automatically remove common elements such as headers and footers that often appear on multiple pages.

Advanced Options

Exclude Pages with a lastmod Date Prior to
date

Exclude any pages with a lastmod date older than the specified date. This helps to avoid scraping outdated content.

Max Depth
int

Sets the maximum depth for the crawl. This limits how deep the scraper will follow links from the original page.

Strip
string

Element selectors that you want stripped from the document.

Max Page Count
int

Defines the maximum number of pages to scrape. Use this to prevent overly large scrapes.

Allow
string

A list of allowed paths or patterns that the scraper will include in the crawl.

Allowed Domains
string

Specify which domains are allowed in the crawl. Links to other domains will be excluded unless specified here.

Deny
string

A list of paths or patterns that the scraper will exclude from the crawl.

Monitoring Web Scrapes

You can monitor the progress of your web scrapes in real-time from the dashboard. Once you click the “Run” button to start a web scrape, you’ll land on a page showing you the progress and status of the process:

Results

Once scraping is complete, the content will be stored as documents in a collection, with each web page being a separate document.