Web Scraping

You’ve got a website full of useful content, and you want your AI app to know about it. Scout’s web scraping lets you pull that content directly into a Collection, where it gets indexed for semantic search and RAG.

Overview

Scout lets you set up web scraping sources to populate your Collections with documents from the web. Before getting started, make sure you’ve created a Collection.

Web Scraping is one source type in Scout. For source setup, scheduling and run history across all source types, see Sources.

There are three web scraping options:

Single Page: Scrape a single web page by URL
Website Crawl: Crawl an entire site, following links across multiple pages
Sitemap: Use a sitemap to discover and scrape specific pages

Types of Web Scrapes

Single Web Page

Enter the URL of a single web page you want to scrape. This option pulls content from one specific page, perfect for importing individual articles, documentation pages, or landing pages.

Best for: One-off pages, specific articles, documentation pages

Example use cases:

Import a single blog post
Extract content from a product page
Archive a specific documentation article

Website Crawl (Multiple Pages)

Start a full crawl of a website. Scout automatically follows links to discover and scrape content from multiple pages. The crawler respects robots.txt and stays within the configured depth and domain limits.

Best for: Complete sites, documentation portals, knowledge bases

Example use cases:

Import an entire documentation site
Crawl a knowledge base
Archive a company blog

Sitemap

Provide a sitemap URL (e.g., https://example.com/sitemap.xml). Scout parses the sitemap and scrapes all listed pages, giving you precise control over which pages are included.

Best for: Large sites with organized sitemaps, targeting specific sections

Example use cases:

Import all blog posts via /blog/sitemap.xml
Scrape product pages from a sitemap
Target specific content categories

Configuration Options

Basic Settings

Setting	Required	Default	Description
URL	Yes		The full URL of the page, domain, or sitemap to scrape
Scraper	No	Http	Scraper technology (Http or Playwright)
Text Extractor	No	Readability	Text extraction method

Scraper Options

Http: Faster and more efficient, ideal for statically rendered pages. May not work well with heavily client-side rendered sites (SPAs).
Playwright: Renders JavaScript and dynamic content. Slower but handles modern web applications, SPAs, and sites requiring JavaScript execution.

Not sure which scraper to use? Start with Http. If pages come back empty or with only partial content, switch to Playwright. The most common cause of empty scrape results is a JavaScript-rendered site being scraped with the Http scraper.

Text Extractor Options

Readability: Scout’s smart extraction logic that identifies and pulls relevant content while removing navigation, ads, and clutter.
Trafilatura: A Python-based extractor focused on main content extraction from web pages.

Advanced Settings

Setting	Default	Description
Allow		Comma-separated list of URL patterns to include. Only matching URLs will be crawled.
Deny		Comma-separated list of URL patterns to exclude. Matching URLs will be skipped.
Exclude Patterns		Regex patterns to exclude. Example: `/private/, login.html$` skips URLs containing ‘/private/’ or ending with ‘login.html’
Strip		Comma-separated list of HTML tags to remove from content.
Strip URLs	true	Normalizes URLs by removing query parameters, fragments, and trailing slashes.
Allowed Domains	Source domain	Limits crawling to specific domains. Defaults to the starting URL’s domain.
Max Depth	5	Maximum crawl depth from the starting page. Higher values crawl more links.
Max Page Count	3000	Maximum number of pages to scrape. Prevents runaway crawls.

Setting Up a Web Scrape

Step 1: Create a Collection

First, create a Collection and Table to store the scraped content. See Creating Collections for detailed instructions.

Step 2: Add a Web Scraping Source

Navigate to your Collection in the Scout dashboard
Click Sources then Add Source
Select Web Scrape from the source options

Step 3: Configure the Scrape

Enter your URL and choose your scraping options:

For a single page:


URL: https://docs.scoutos.com/docs/overview
Scraper: Http
Text Extractor: Readability

For a full site crawl:


URL: https://docs.scoutos.com
Scraper: Http
Max Depth: 3
Max Page Count: 500
Exclude Patterns: /api/, /changelog/

For a sitemap:


URL: https://docs.scoutos.com/sitemap.xml
Scraper: Http
Text Extractor: Readability

Step 4: Run the Scrape

Click Run to start the scraping process. The dashboard shows real-time progress:

Pages discovered
Pages scraped
Errors encountered
Estimated completion

Monitoring Web Scrapes

Track your web scrapes from the Sources tab in your Collection. Each run shows:

Status: Running, completed, or failed
Progress: Pages scraped vs. total discovered
Errors: Any failed pages with error details
Duration: Time elapsed

Click on any scrape run to see detailed logs and per-page results. If individual pages failed, the logs show the specific error for each URL.

Results

Once scraping is complete:

Documents created: Each web page becomes a separate document in your table
Text indexed: Page content is automatically embedded and indexed for semantic search
Metadata preserved: URL, title, and other metadata are stored with each document

Sample Document Structure


{
  "id": "doc_abc123",
  "text": "Extracted content from the web page...",
  "metadata": {
    "url": "https://example.com/page",
    "title": "Page Title",
    "scraped_at": "2025-02-26T10:00:00Z"
  }
}

Best Practices

Optimize Crawl Scope

Start small: Begin with a low Max Depth (2-3) to test your configuration
Use patterns: Exclude /admin/, /login/, and other unnecessary paths
Set limits: Always set Max Page Count to prevent unexpected large crawls

Choose the Right Scraper

Use Http for static sites, blogs, documentation
Use Playwright for SPAs, React/Vue apps, sites with dynamic content

Content Quality

Readability usually produces cleaner results for articles and documentation
Trafilatura may work better for news articles and blog posts
Use Strip to remove unwanted HTML elements like <nav>, <footer>, <aside>

Performance Tips

Batch large sites: Use sitemaps to break up large crawls into smaller batches
Exclude media: Add image/video URLs to exclude patterns
Respect rate limits: Scout handles rate limiting automatically, but be mindful of target site resources

Troubleshooting

Content Not Extracted

If pages show empty content:

Switch to Playwright (the page likely uses client-side rendering)
Check if the page requires authentication
Verify the page is publicly accessible

Too Many Pages Discovered

If Max Page Count is hit unexpectedly:

Add URL patterns to Deny or Exclude Patterns
Lower Max Depth
Use a sitemap for precise control

Slow Crawls

If crawling is slower than expected:

Http is faster than Playwright, switch if possible
Reduce Max Depth to limit scope
Check your target site’s response times

Next Steps

Creating Collections: Set up tables for your scraped data
Sources: Manage schedules and run history for all source types
Querying Data: Search your scraped content with semantic search