Create a Database with at least one Table before setting up a web scraping Source. Scraped pages are written as documents into the Table you select.
Types of Web Scrapes
Scout offers three ways to scrape content, each suited to a different scope.Single Page
Scrape one page by its URL. Best for one-off pages, specific articles, and individual documentation pages — for example importing a single blog post or product page.
Website Crawl
Start from a URL and automatically follow internal links across the site, respecting
robots.txt. Best for complete sites, documentation portals, and knowledge bases.Sitemap
Parse a sitemap URL and scrape the pages it lists. Best for large sites with an organized sitemap where you want precise control over which pages are ingested.
Setting Up a Web Scrape
Create a Database and Table
Set up a Database and a Table to hold the scraped content. Make sure the Table has a
content column so extracted text is embedded for semantic search.Add a Web Scraping Source
Open your Database, select the target Table, and go to the Sources tab. Click Add Source and choose Web Scrape.
Configure the scrape
Enter the URL and choose your scrape type and options. A few common configurations:Single pageFull site crawlSitemap
Configuration Options
Basic Settings
| Setting | Default | Description |
|---|---|---|
| URL (required) | — | The full URL to scrape, crawl, or the sitemap to parse. |
| Scraper | Http | How pages are fetched — Http or Playwright. |
| Text Extractor | Readability | How the main content is extracted from each page. |
Scraper Options
| Scraper | Best For | Trade-offs |
|---|---|---|
| Http | Static pages and server-rendered HTML | Fast, but struggles with single-page apps (SPAs) and JavaScript-rendered content. |
| Playwright | SPAs and dynamic, JavaScript-rendered content | Renders the page in a browser before extracting, so it is slower. |
Text Extractor Options
| Extractor | Description | Best For |
|---|---|---|
| Readability | Scout’s smart extraction that removes navigation, ads, and other clutter. | Documentation and most general pages. |
| Trafilatura | A Python-based extractor focused on the main body content. | News articles and blog posts. |
Advanced Settings
| Setting | Default | Description |
|---|---|---|
| Allow | (none) | Comma-separated URL patterns to include. |
| Deny | (none) | Comma-separated URL patterns to exclude. |
| Exclude Patterns | (none) | Regex patterns to exclude matching URLs. |
| Strip | (none) | Comma-separated HTML tags to remove before extraction. |
| Strip URLs | true | Remove URLs from the extracted text. |
| Allowed Domains | Source domain | Domains the crawler is permitted to follow links into. |
| Max Depth | 5 | How many link levels deep a crawl will follow from the starting URL. |
| Max Page Count | 3000 | The maximum number of pages a single scrape will ingest. |
Monitoring a Scrape
Track running and completed scrapes from the Sources tab. For each scrape you can see:| Field | Meaning |
|---|---|
| Status | Whether the scrape is running, completed, or failed. |
| Progress | Pages discovered versus pages scraped. |
| Errors | Pages that could not be fetched or extracted. |
| Duration | How long the scrape has been running or took to finish. |
Results
Each scraped page becomes a separate document in your Table. The extracted text is embedded and indexed for semantic search, and page metadata such as the URL and title is preserved for filtering and attribution.Scout deduplicates scraped pages by URL. Re-running a scrape updates existing documents in place rather than creating duplicates, so you can refresh content on a schedule without bloating your Table.
Best Practices
- Optimize scope. Start with a low Max Depth, exclude paths you don’t need with Deny and Exclude Patterns, and set a Max Page Count so a crawl can’t run away.
- Choose the right scraper. Use Http for static sites and Playwright only when content is JavaScript-rendered — Http is significantly faster.
- Match the extractor to the content. Readability works best for documentation; Trafilatura is better for news and blogs. Use Strip to remove unwanted elements.
- Batch large sites. For big sites, scrape from a sitemap rather than a deep crawl for more predictable coverage, and exclude media-heavy paths. Scout handles rate limiting automatically.
Troubleshooting
Pages come back empty
Pages come back empty
The site is likely JavaScript-rendered. Switch the Scraper from Http to Playwright. Also confirm the page is publicly accessible and doesn’t require authentication.
The crawl ingests too many pages
The crawl ingests too many pages
Narrow the scope with Deny and Exclude Patterns, lower Max Depth, or switch to a Sitemap scrape for precise control over which pages are included.
The crawl is slow
The crawl is slow
Use the Http scraper instead of Playwright where possible, reduce Max Depth, and check the target site’s response times — slow origins slow the whole crawl.
Next Steps
Creating Databases
Design your Table schema before configuring a web scrape.
Sources
See all the ways to sync data into Databases, including Notion and Google Sheets.
Querying Data
Search scraped content with semantic, keyword, and hybrid modes.