Skip to Content
🎉 Scout Docs 2.0 is here!

Web Scraping

You’ve got a website full of useful content, and you want your AI app to know about it. Scout’s web scraping lets you pull that content directly into a Collection, where it gets indexed for semantic search and RAG.

Overview

Scout lets you set up web scraping sources to populate your Collections with documents from the web. Before getting started, make sure you’ve created a Collection.

Web Scraping is one source type in Scout. For source setup, scheduling and run history across all source types, see Sources.

There are three web scraping options:

  • Single Page: Scrape a single web page by URL
  • Website Crawl: Crawl an entire site, following links across multiple pages
  • Sitemap: Use a sitemap to discover and scrape specific pages

Types of Web Scrapes

Single Web Page

Enter the URL of a single web page you want to scrape. This option pulls content from one specific page, perfect for importing individual articles, documentation pages, or landing pages.

Best for: One-off pages, specific articles, documentation pages

Example use cases:

  • Import a single blog post
  • Extract content from a product page
  • Archive a specific documentation article

Website Crawl (Multiple Pages)

Start a full crawl of a website. Scout automatically follows links to discover and scrape content from multiple pages. The crawler respects robots.txt and stays within the configured depth and domain limits.

Best for: Complete sites, documentation portals, knowledge bases

Example use cases:

  • Import an entire documentation site
  • Crawl a knowledge base
  • Archive a company blog

Sitemap

Provide a sitemap URL (e.g., https://example.com/sitemap.xml). Scout parses the sitemap and scrapes all listed pages, giving you precise control over which pages are included.

Best for: Large sites with organized sitemaps, targeting specific sections

Example use cases:

  • Import all blog posts via /blog/sitemap.xml
  • Scrape product pages from a sitemap
  • Target specific content categories

Configuration Options

Basic Settings

SettingRequiredDefaultDescription
URLYesThe full URL of the page, domain, or sitemap to scrape
ScraperNoHttpScraper technology (Http or Playwright)
Text ExtractorNoReadabilityText extraction method

Scraper Options

  • Http: Faster and more efficient, ideal for statically rendered pages. May not work well with heavily client-side rendered sites (SPAs).
  • Playwright: Renders JavaScript and dynamic content. Slower but handles modern web applications, SPAs, and sites requiring JavaScript execution.

Not sure which scraper to use? Start with Http. If pages come back empty or with only partial content, switch to Playwright. The most common cause of empty scrape results is a JavaScript-rendered site being scraped with the Http scraper.

Text Extractor Options

  • Readability: Scout’s smart extraction logic that identifies and pulls relevant content while removing navigation, ads, and clutter.
  • Trafilatura: A Python-based extractor focused on main content extraction from web pages.

Advanced Settings

SettingDefaultDescription
AllowComma-separated list of URL patterns to include. Only matching URLs will be crawled.
DenyComma-separated list of URL patterns to exclude. Matching URLs will be skipped.
Exclude PatternsRegex patterns to exclude. Example: /private/, login.html$ skips URLs containing ‘/private/’ or ending with ‘login.html’
StripComma-separated list of HTML tags to remove from content.
Strip URLstrueNormalizes URLs by removing query parameters, fragments, and trailing slashes.
Allowed DomainsSource domainLimits crawling to specific domains. Defaults to the starting URL’s domain.
Max Depth5Maximum crawl depth from the starting page. Higher values crawl more links.
Max Page Count3000Maximum number of pages to scrape. Prevents runaway crawls.

Setting Up a Web Scrape

Step 1: Create a Collection

First, create a Collection and Table to store the scraped content. See Creating Collections for detailed instructions.

Step 2: Add a Web Scraping Source

  1. Navigate to your Collection in the Scout dashboard
  2. Click Sources then Add Source
  3. Select Web Scrape from the source options

Step 3: Configure the Scrape

Enter your URL and choose your scraping options:

For a single page:

URL: https://docs.scoutos.com/docs/overview Scraper: Http Text Extractor: Readability

For a full site crawl:

URL: https://docs.scoutos.com Scraper: Http Max Depth: 3 Max Page Count: 500 Exclude Patterns: /api/, /changelog/

For a sitemap:

URL: https://docs.scoutos.com/sitemap.xml Scraper: Http Text Extractor: Readability

Step 4: Run the Scrape

Click Run to start the scraping process. The dashboard shows real-time progress:

  • Pages discovered
  • Pages scraped
  • Errors encountered
  • Estimated completion

Monitoring Web Scrapes

Track your web scrapes from the Sources tab in your Collection. Each run shows:

  • Status: Running, completed, or failed
  • Progress: Pages scraped vs. total discovered
  • Errors: Any failed pages with error details
  • Duration: Time elapsed

Click on any scrape run to see detailed logs and per-page results. If individual pages failed, the logs show the specific error for each URL.

Results

Once scraping is complete:

  1. Documents created: Each web page becomes a separate document in your table
  2. Text indexed: Page content is automatically embedded and indexed for semantic search
  3. Metadata preserved: URL, title, and other metadata are stored with each document

Sample Document Structure

{ "id": "doc_abc123", "text": "Extracted content from the web page...", "metadata": { "url": "https://example.com/page", "title": "Page Title", "scraped_at": "2025-02-26T10:00:00Z" } }

Best Practices

Optimize Crawl Scope

  • Start small: Begin with a low Max Depth (2-3) to test your configuration
  • Use patterns: Exclude /admin/, /login/, and other unnecessary paths
  • Set limits: Always set Max Page Count to prevent unexpected large crawls

Choose the Right Scraper

  • Use Http for static sites, blogs, documentation
  • Use Playwright for SPAs, React/Vue apps, sites with dynamic content

Content Quality

  • Readability usually produces cleaner results for articles and documentation
  • Trafilatura may work better for news articles and blog posts
  • Use Strip to remove unwanted HTML elements like <nav>, <footer>, <aside>

Performance Tips

  1. Batch large sites: Use sitemaps to break up large crawls into smaller batches
  2. Exclude media: Add image/video URLs to exclude patterns
  3. Respect rate limits: Scout handles rate limiting automatically, but be mindful of target site resources

Troubleshooting

Content Not Extracted

If pages show empty content:

  • Switch to Playwright (the page likely uses client-side rendering)
  • Check if the page requires authentication
  • Verify the page is publicly accessible

Too Many Pages Discovered

If Max Page Count is hit unexpectedly:

  • Add URL patterns to Deny or Exclude Patterns
  • Lower Max Depth
  • Use a sitemap for precise control

Slow Crawls

If crawling is slower than expected:

  • Http is faster than Playwright, switch if possible
  • Reduce Max Depth to limit scope
  • Check your target site’s response times

Next Steps

Last updated on