Web Page Scrape Block

Scrape web pages and extract HTML content using specified selectors

The Web Page Scrape block enables users to scrape web pages and extract HTML content, providing a powerful tool for data extraction and content analysis within Scout workflows.

Configuration (Required)

URL to which the request will be made
stringRequired

The URL specifies the web page to be scraped. Ensure the URL is valid and accessible to successfully retrieve the content. This field supports Jinja templating for dynamic URL construction.

Exclude Selectors
string

A comma-separated list of classes, ids, or tags to exclude from the scraped content. Use this to remove unwanted elements from the HTML output. This field supports Jinja templating for dynamic content.

See Workflow Logic & State > State Management for details on using dynamic variables in this block.

Outputs

The block outputs the scraped HTML content of the web page, excluding specified elements. This structured output allows for further processing and integration within your workflow.

Usage Context

Use this block to scrape web pages and extract HTML content, while excluding unwanted elements by specifying CSS selectors. It is ideal for scenarios where you need to dynamically fetch and process web data within workflows.

Best Practices

  • Ensure the URL is valid and accessible: Verify that the URL is correct and reachable to avoid errors during the scraping process.
  • Use Jinja templating to dynamically set the URL and exclude selectors based on workflow state: This allows for flexible and adaptable scraping operations.
  • Carefully choose exclude selectors to avoid removing necessary content from the page: Ensure that only unwanted elements are excluded to maintain the integrity of the extracted data.