Scraping a Website
Comprehensive guide to web scraping options
There are three different web scraping options: you can enter a URL for scraping a single web page, multiple pages via a sitemap, or an entire site (all URLs) using our crawler.
Types of Web Scrapes
Single Web Page
Enter the URL of the single web page you wish to scrape. This will target and extract content from the specific page.
Website Crawl (Multiple Pages)
Initiate a full crawl of a website, scraping multiple linked pages. This option scrapes content from all available pages across the domain.
Sitemap
Provide a sitemap URL. The sitemap will be used to locate and scrape specific pages from the website, as indicated by the sitemap structure.
Customizing Web Scrapes
There are several inputs and settings to give you full control and customization over the scraping process:
The full URL of the page, domain, or sitemap you want to scrape.
Select the scraper technology. Available options are:
- HTTP - Faster, but may not mix well with heavily client-side rendered sites.
- Playwright - Heavier and slower, but renders dynamic content, which works well with heavily client-side rendered sites.
Defines how text is extracted from the page. Options include:
- Custom - Your own custom scraping logic.
- Readability - Scout’s smart scraping logic to include relevant components.
- Trafilatura - A Python package and command-line tool used to gather text on the web.
CSS selectors for the HTML elements you want to include in the scraping. For example, targeting specific tags like body
, h1
, p
.
CSS selectors for the HTML elements you want to exclude from the scraping process. For example, elements like .navbar
, #header
, footer
.
Select this option if you want to automatically remove common elements such as headers and footers that often appear on multiple pages.
Advanced Options
Exclude any pages with a lastmod
date older than the specified date. This helps to avoid scraping outdated content.
Sets the maximum depth for the crawl. This limits how deep the scraper will follow links from the original page.
Element selectors that you want stripped from the document.
Defines the maximum number of pages to scrape. Use this to prevent overly large scrapes.
A list of allowed paths or patterns that the scraper will include in the crawl.
Specify which domains are allowed in the crawl. Links to other domains will be excluded unless specified here.
A list of paths or patterns that the scraper will exclude from the crawl.
Monitoring Web Scrapes
You can monitor the progress of your web scrapes in real-time from the dashboard. Once you click the “Run” button to start a web scrape, you’ll land on a page showing you the progress and status of the process:
Results
Once scraping is complete, the content will be stored as documents in a collection, with each web page being a separate document.