Web Scraping: Sync Public Web Content into Scout Databases

Web scraping lets you ingest public web content directly into a Scout Database so your agents can search and reason over it. Use it to keep agents current with content that lives on the web — competitor sites, news, product pages, and external documentation. Each scraped page becomes a document that is embedded and indexed alongside the rest of your Database data.

Create a Database with at least one Table before setting up a web scraping Source. Scraped pages are written as documents into the Table you select.

Types of Web Scrapes

Scout offers three ways to scrape content, each suited to a different scope.

Single Page

Scrape one page by its URL. Best for one-off pages, specific articles, and individual documentation pages — for example importing a single blog post or product page.

Website Crawl

Start from a URL and automatically follow internal links across the site, respecting robots.txt. Best for complete sites, documentation portals, and knowledge bases.

Sitemap

Parse a sitemap URL and scrape the pages it lists. Best for large sites with an organized sitemap where you want precise control over which pages are ingested.

Setting Up a Web Scrape

Create a Database and Table

Set up a Database and a Table to hold the scraped content. Make sure the Table has a content column so extracted text is embedded for semantic search.

Add a Web Scraping Source

Open your Database, select the target Table, and go to the Sources tab. Click Add Source and choose Web Scrape.

Configure the scrape

Enter the URL and choose your scrape type and options. A few common configurations:Single page

URL: https://docs.scoutos.com/docs/overview
Scraper: Http
Text Extractor: Readability

Full site crawl

URL: https://docs.scoutos.com
Scraper: Http
Max Depth: 3
Max Page Count: 500
Exclude Patterns: /api/, /changelog/

Sitemap

URL: https://docs.scoutos.com/sitemap.xml
Scraper: Http
Text Extractor: Readability

Run the scrape

Click Run Now to start ingestion. The dashboard reports pages discovered, pages scraped, errors encountered, and an estimated time to completion.

Configuration Options

Basic Settings

Setting	Default	Description
URL (required)	—	The full URL to scrape, crawl, or the sitemap to parse.
Scraper	`Http`	How pages are fetched — `Http` or `Playwright`.
Text Extractor	`Readability`	How the main content is extracted from each page.

Scraper Options

Scraper	Best For	Trade-offs
Http	Static pages and server-rendered HTML	Fast, but struggles with single-page apps (SPAs) and JavaScript-rendered content.
Playwright	SPAs and dynamic, JavaScript-rendered content	Renders the page in a browser before extracting, so it is slower.

Start with the Http scraper. The most common cause of empty results is a JavaScript-rendered site being scraped with Http — if pages come back empty, switch to Playwright.

Text Extractor Options

Extractor	Description	Best For
Readability	Scout’s smart extraction that removes navigation, ads, and other clutter.	Documentation and most general pages.
Trafilatura	A Python-based extractor focused on the main body content.	News articles and blog posts.

Advanced Settings

Setting	Default	Description
Allow	(none)	Comma-separated URL patterns to include.
Deny	(none)	Comma-separated URL patterns to exclude.
Exclude Patterns	(none)	Regex patterns to exclude matching URLs.
Strip	(none)	Comma-separated HTML tags to remove before extraction.
Strip URLs	`true`	Remove URLs from the extracted text.
Allowed Domains	Source domain	Domains the crawler is permitted to follow links into.
Max Depth	`5`	How many link levels deep a crawl will follow from the starting URL.
Max Page Count	`3000`	The maximum number of pages a single scrape will ingest.

Monitoring a Scrape

Track running and completed scrapes from the Sources tab. For each scrape you can see:

Field	Meaning
Status	Whether the scrape is running, completed, or failed.
Progress	Pages discovered versus pages scraped.
Errors	Pages that could not be fetched or extracted.
Duration	How long the scrape has been running or took to finish.

Open a scrape to view detailed logs and per-page results, including which URLs succeeded and which returned errors.

Results

Each scraped page becomes a separate document in your Table. The extracted text is embedded and indexed for semantic search, and page metadata such as the URL and title is preserved for filtering and attribution.

{
  "id": "doc_abc123",
  "text": "Extracted content from the web page...",
  "metadata": {
    "url": "https://example.com/page",
    "title": "Page Title",
    "scraped_at": "2025-02-26T10:00:00Z"
  }
}

Scout deduplicates scraped pages by URL. Re-running a scrape updates existing documents in place rather than creating duplicates, so you can refresh content on a schedule without bloating your Table.

Best Practices

Optimize scope. Start with a low Max Depth, exclude paths you don’t need with Deny and Exclude Patterns, and set a Max Page Count so a crawl can’t run away.
Choose the right scraper. Use Http for static sites and Playwright only when content is JavaScript-rendered — Http is significantly faster.
Match the extractor to the content. Readability works best for documentation; Trafilatura is better for news and blogs. Use Strip to remove unwanted elements.
Batch large sites. For big sites, scrape from a sitemap rather than a deep crawl for more predictable coverage, and exclude media-heavy paths. Scout handles rate limiting automatically.

Troubleshooting

Pages come back empty

The site is likely JavaScript-rendered. Switch the Scraper from Http to Playwright. Also confirm the page is publicly accessible and doesn’t require authentication.

The crawl ingests too many pages

Narrow the scope with Deny and Exclude Patterns, lower Max Depth, or switch to a Sitemap scrape for precise control over which pages are included.

The crawl is slow

Use the Http scraper instead of Playwright where possible, reduce Max Depth, and check the target site’s response times — slow origins slow the whole crawl.

Next Steps

Creating Databases

Design your Table schema before configuring a web scrape.

Sources

See all the ways to sync data into Databases, including Notion and Google Sheets.

Querying Data

Search scraped content with semantic, keyword, and hybrid modes.

Get Started

Agents

Workflows

Data & Storage

Integrations

Platform

Web Scraping: Sync Public Web Content into Scout Databases

Types of Web Scrapes

Single Page

Website Crawl

Sitemap

Setting Up a Web Scrape

Configuration Options

Basic Settings

Scraper Options

Text Extractor Options

Advanced Settings

Monitoring a Scrape

Results

Best Practices

Troubleshooting

Next Steps

Creating Databases

Sources

Querying Data

​Types of Web Scrapes

Single Page

Website Crawl

Sitemap

​Setting Up a Web Scrape

​Configuration Options

​Basic Settings

​Scraper Options

​Text Extractor Options

​Advanced Settings

​Monitoring a Scrape

​Results

​Best Practices

​Troubleshooting

​Next Steps

Creating Databases

Sources

Querying Data

Types of Web Scrapes

Setting Up a Web Scrape

Configuration Options

Basic Settings

Scraper Options

Text Extractor Options

Advanced Settings

Monitoring a Scrape

Results

Best Practices

Troubleshooting

Next Steps