Web Page Scrape Block
Scrape web pages within Scout workflows using Browserless API
The Web Page Scrape block enables users to scrape the content of a web page by making HTTP requests through the Browserless API. This block is useful for extracting HTML content while excluding specified elements, allowing users to dynamically interact with web pages and extract necessary information.
Configuration (Required)
The URL specifies the web page to be scraped. Ensure the URL is valid and accessible to successfully retrieve the content. This input supports Jinja templating for dynamic content.
A comma-separated list of classes, ids, or tags to exclude from the scraped content. Use this to remove unwanted elements from the HTML output. This input supports Jinja templating for dynamic content.
Outputs
The block outputs the HTML content of the web page after excluding specified selectors. This allows for further processing and integration within the workflow.
Usage Context
Use this block to scrape web pages and extract HTML content. Ensure that the URL is correctly specified and that any selectors to be excluded are accurately listed.
Best Practices
- Verify the URL is accessible and correct: Ensure the URL is valid to avoid errors during the scraping process.
- Use Jinja templating to dynamically construct URLs and exclude selectors: This allows for flexible and dynamic scraping operations.
- Ensure the BROWSERLESS_API_KEY is configured in the environment: This is necessary for the block to function correctly.