> ## Documentation Index
> Fetch the complete documentation index at: https://docs.scoutos.com/llms.txt
> Use this file to discover all available pages before exploring further.

# Web Scraping: Sync Public Web Content into Scout Databases

> Pull public website content directly into a Scout Database for semantic search and RAG. Scrape a single page, crawl an entire site, or ingest a sitemap on a schedule.

Web scraping lets you ingest public web content directly into a Scout Database so your agents can search and reason over it. Use it to keep agents current with content that lives on the web — competitor sites, news, product pages, and external documentation. Each scraped page becomes a document that is embedded and indexed alongside the rest of your Database data.

<Note>
  Create a [Database](/databases/creating-databases) with at least one Table before setting up a web scraping Source. Scraped pages are written as documents into the Table you select.
</Note>

## Types of Web Scrapes

Scout offers three ways to scrape content, each suited to a different scope.

<CardGroup cols={3}>
  <Card title="Single Page" icon="file">
    Scrape one page by its URL. Best for one-off pages, specific articles, and individual documentation pages — for example importing a single blog post or product page.
  </Card>

  <Card title="Website Crawl" icon="sitemap">
    Start from a URL and automatically follow internal links across the site, respecting `robots.txt`. Best for complete sites, documentation portals, and knowledge bases.
  </Card>

  <Card title="Sitemap" icon="map">
    Parse a sitemap URL and scrape the pages it lists. Best for large sites with an organized sitemap where you want precise control over which pages are ingested.
  </Card>
</CardGroup>

## Setting Up a Web Scrape

<Steps>
  <Step title="Create a Database and Table">
    Set up a [Database](/databases/creating-databases) and a Table to hold the scraped content. Make sure the Table has a `content` column so extracted text is embedded for semantic search.
  </Step>

  <Step title="Add a Web Scraping Source">
    Open your Database, select the target Table, and go to the **Sources** tab. Click **Add Source** and choose **Web Scrape**.
  </Step>

  <Step title="Configure the scrape">
    Enter the **URL** and choose your scrape type and options. A few common configurations:

    **Single page**

    ```
    URL: https://docs.scoutos.com/docs/overview
    Scraper: Http
    Text Extractor: Readability
    ```

    **Full site crawl**

    ```
    URL: https://docs.scoutos.com
    Scraper: Http
    Max Depth: 3
    Max Page Count: 500
    Exclude Patterns: /api/, /changelog/
    ```

    **Sitemap**

    ```
    URL: https://docs.scoutos.com/sitemap.xml
    Scraper: Http
    Text Extractor: Readability
    ```
  </Step>

  <Step title="Run the scrape">
    Click **Run Now** to start ingestion. The dashboard reports pages discovered, pages scraped, errors encountered, and an estimated time to completion.
  </Step>
</Steps>

## Configuration Options

### Basic Settings

| Setting              | Default       | Description                                             |
| -------------------- | ------------- | ------------------------------------------------------- |
| **URL** *(required)* | —             | The full URL to scrape, crawl, or the sitemap to parse. |
| **Scraper**          | `Http`        | How pages are fetched — `Http` or `Playwright`.         |
| **Text Extractor**   | `Readability` | How the main content is extracted from each page.       |

### Scraper Options

| Scraper        | Best For                                      | Trade-offs                                                                        |
| -------------- | --------------------------------------------- | --------------------------------------------------------------------------------- |
| **Http**       | Static pages and server-rendered HTML         | Fast, but struggles with single-page apps (SPAs) and JavaScript-rendered content. |
| **Playwright** | SPAs and dynamic, JavaScript-rendered content | Renders the page in a browser before extracting, so it is slower.                 |

<Tip>
  Start with the **Http** scraper. The most common cause of empty results is a JavaScript-rendered site being scraped with Http — if pages come back empty, switch to **Playwright**.
</Tip>

### Text Extractor Options

| Extractor       | Description                                                               | Best For                              |
| --------------- | ------------------------------------------------------------------------- | ------------------------------------- |
| **Readability** | Scout's smart extraction that removes navigation, ads, and other clutter. | Documentation and most general pages. |
| **Trafilatura** | A Python-based extractor focused on the main body content.                | News articles and blog posts.         |

### Advanced Settings

| Setting              | Default       | Description                                                          |
| -------------------- | ------------- | -------------------------------------------------------------------- |
| **Allow**            | (none)        | Comma-separated URL patterns to include.                             |
| **Deny**             | (none)        | Comma-separated URL patterns to exclude.                             |
| **Exclude Patterns** | (none)        | Regex patterns to exclude matching URLs.                             |
| **Strip**            | (none)        | Comma-separated HTML tags to remove before extraction.               |
| **Strip URLs**       | `true`        | Remove URLs from the extracted text.                                 |
| **Allowed Domains**  | Source domain | Domains the crawler is permitted to follow links into.               |
| **Max Depth**        | `5`           | How many link levels deep a crawl will follow from the starting URL. |
| **Max Page Count**   | `3000`        | The maximum number of pages a single scrape will ingest.             |

## Monitoring a Scrape

Track running and completed scrapes from the **Sources** tab. For each scrape you can see:

| Field        | Meaning                                                 |
| ------------ | ------------------------------------------------------- |
| **Status**   | Whether the scrape is running, completed, or failed.    |
| **Progress** | Pages discovered versus pages scraped.                  |
| **Errors**   | Pages that could not be fetched or extracted.           |
| **Duration** | How long the scrape has been running or took to finish. |

Open a scrape to view detailed logs and per-page results, including which URLs succeeded and which returned errors.

## Results

Each scraped page becomes a separate document in your Table. The extracted text is embedded and indexed for semantic search, and page metadata such as the URL and title is preserved for filtering and attribution.

```json theme={null}
{
  "id": "doc_abc123",
  "text": "Extracted content from the web page...",
  "metadata": {
    "url": "https://example.com/page",
    "title": "Page Title",
    "scraped_at": "2025-02-26T10:00:00Z"
  }
}
```

<Note>
  Scout deduplicates scraped pages by URL. Re-running a scrape updates existing documents in place rather than creating duplicates, so you can refresh content on a schedule without bloating your Table.
</Note>

## Best Practices

* **Optimize scope.** Start with a low **Max Depth**, exclude paths you don't need with **Deny** and **Exclude Patterns**, and set a **Max Page Count** so a crawl can't run away.
* **Choose the right scraper.** Use **Http** for static sites and **Playwright** only when content is JavaScript-rendered — Http is significantly faster.
* **Match the extractor to the content.** **Readability** works best for documentation; **Trafilatura** is better for news and blogs. Use **Strip** to remove unwanted elements.
* **Batch large sites.** For big sites, scrape from a sitemap rather than a deep crawl for more predictable coverage, and exclude media-heavy paths. Scout handles rate limiting automatically.

## Troubleshooting

<AccordionGroup>
  <Accordion title="Pages come back empty">
    The site is likely JavaScript-rendered. Switch the **Scraper** from Http to **Playwright**. Also confirm the page is publicly accessible and doesn't require authentication.
  </Accordion>

  <Accordion title="The crawl ingests too many pages">
    Narrow the scope with **Deny** and **Exclude Patterns**, lower **Max Depth**, or switch to a **Sitemap** scrape for precise control over which pages are included.
  </Accordion>

  <Accordion title="The crawl is slow">
    Use the **Http** scraper instead of Playwright where possible, reduce **Max Depth**, and check the target site's response times — slow origins slow the whole crawl.
  </Accordion>
</AccordionGroup>

## Next Steps

<CardGroup cols={3}>
  <Card title="Creating Databases" icon="table" href="/databases/creating-databases">
    Design your Table schema before configuring a web scrape.
  </Card>

  <Card title="Sources" icon="rotate" href="/databases/sources">
    See all the ways to sync data into Databases, including Notion and Google Sheets.
  </Card>

  <Card title="Querying Data" icon="magnifying-glass" href="/databases/querying-data">
    Search scraped content with semantic, keyword, and hybrid modes.
  </Card>
</CardGroup>
