Skip to main content
Web scraping lets you ingest public web content directly into a Scout Database so your agents can search and reason over it. Use it to keep agents current with content that lives on the web — competitor sites, news, product pages, and external documentation. Each scraped page becomes a document that is embedded and indexed alongside the rest of your Database data.
Create a Database with at least one Table before setting up a web scraping Source. Scraped pages are written as documents into the Table you select.

Types of Web Scrapes

Scout offers three ways to scrape content, each suited to a different scope.

Single Page

Scrape one page by its URL. Best for one-off pages, specific articles, and individual documentation pages — for example importing a single blog post or product page.

Website Crawl

Start from a URL and automatically follow internal links across the site, respecting robots.txt. Best for complete sites, documentation portals, and knowledge bases.

Sitemap

Parse a sitemap URL and scrape the pages it lists. Best for large sites with an organized sitemap where you want precise control over which pages are ingested.

Setting Up a Web Scrape

1

Create a Database and Table

Set up a Database and a Table to hold the scraped content. Make sure the Table has a content column so extracted text is embedded for semantic search.
2

Add a Web Scraping Source

Open your Database, select the target Table, and go to the Sources tab. Click Add Source and choose Web Scrape.
3

Configure the scrape

Enter the URL and choose your scrape type and options. A few common configurations:Single page
URL: https://docs.scoutos.com/docs/overview
Scraper: Http
Text Extractor: Readability
Full site crawl
URL: https://docs.scoutos.com
Scraper: Http
Max Depth: 3
Max Page Count: 500
Exclude Patterns: /api/, /changelog/
Sitemap
URL: https://docs.scoutos.com/sitemap.xml
Scraper: Http
Text Extractor: Readability
4

Run the scrape

Click Run Now to start ingestion. The dashboard reports pages discovered, pages scraped, errors encountered, and an estimated time to completion.

Configuration Options

Basic Settings

SettingDefaultDescription
URL (required)The full URL to scrape, crawl, or the sitemap to parse.
ScraperHttpHow pages are fetched — Http or Playwright.
Text ExtractorReadabilityHow the main content is extracted from each page.

Scraper Options

ScraperBest ForTrade-offs
HttpStatic pages and server-rendered HTMLFast, but struggles with single-page apps (SPAs) and JavaScript-rendered content.
PlaywrightSPAs and dynamic, JavaScript-rendered contentRenders the page in a browser before extracting, so it is slower.
Start with the Http scraper. The most common cause of empty results is a JavaScript-rendered site being scraped with Http — if pages come back empty, switch to Playwright.

Text Extractor Options

ExtractorDescriptionBest For
ReadabilityScout’s smart extraction that removes navigation, ads, and other clutter.Documentation and most general pages.
TrafilaturaA Python-based extractor focused on the main body content.News articles and blog posts.

Advanced Settings

SettingDefaultDescription
Allow(none)Comma-separated URL patterns to include.
Deny(none)Comma-separated URL patterns to exclude.
Exclude Patterns(none)Regex patterns to exclude matching URLs.
Strip(none)Comma-separated HTML tags to remove before extraction.
Strip URLstrueRemove URLs from the extracted text.
Allowed DomainsSource domainDomains the crawler is permitted to follow links into.
Max Depth5How many link levels deep a crawl will follow from the starting URL.
Max Page Count3000The maximum number of pages a single scrape will ingest.

Monitoring a Scrape

Track running and completed scrapes from the Sources tab. For each scrape you can see:
FieldMeaning
StatusWhether the scrape is running, completed, or failed.
ProgressPages discovered versus pages scraped.
ErrorsPages that could not be fetched or extracted.
DurationHow long the scrape has been running or took to finish.
Open a scrape to view detailed logs and per-page results, including which URLs succeeded and which returned errors.

Results

Each scraped page becomes a separate document in your Table. The extracted text is embedded and indexed for semantic search, and page metadata such as the URL and title is preserved for filtering and attribution.
{
  "id": "doc_abc123",
  "text": "Extracted content from the web page...",
  "metadata": {
    "url": "https://example.com/page",
    "title": "Page Title",
    "scraped_at": "2025-02-26T10:00:00Z"
  }
}
Scout deduplicates scraped pages by URL. Re-running a scrape updates existing documents in place rather than creating duplicates, so you can refresh content on a schedule without bloating your Table.

Best Practices

  • Optimize scope. Start with a low Max Depth, exclude paths you don’t need with Deny and Exclude Patterns, and set a Max Page Count so a crawl can’t run away.
  • Choose the right scraper. Use Http for static sites and Playwright only when content is JavaScript-rendered — Http is significantly faster.
  • Match the extractor to the content. Readability works best for documentation; Trafilatura is better for news and blogs. Use Strip to remove unwanted elements.
  • Batch large sites. For big sites, scrape from a sitemap rather than a deep crawl for more predictable coverage, and exclude media-heavy paths. Scout handles rate limiting automatically.

Troubleshooting

The site is likely JavaScript-rendered. Switch the Scraper from Http to Playwright. Also confirm the page is publicly accessible and doesn’t require authentication.
Narrow the scope with Deny and Exclude Patterns, lower Max Depth, or switch to a Sitemap scrape for precise control over which pages are included.
Use the Http scraper instead of Playwright where possible, reduce Max Depth, and check the target site’s response times — slow origins slow the whole crawl.

Next Steps

Creating Databases

Design your Table schema before configuring a web scrape.

Sources

See all the ways to sync data into Databases, including Notion and Google Sheets.

Querying Data

Search scraped content with semantic, keyword, and hybrid modes.