How to crawl the web with an LLM using Relevance AI and Browserless

6 min read

Crawling the web is a key data source to use with LLMs. Agents can use them to browse the internet and chains can use them to power features like your own T&Cs reviewer. Whatever the use-case, it’s critical to know how to use it. Let’s dive into some simple step-by-step instructions on how you can add crawling to your Relevance AI chain using Browserless within 5 minutes.

To get started, first sign up for a free account on Relevance AI and https://browserless.io. The free tier of Browserless provides 1000 free web crawls which are plenty to thoroughly test the service.

Steps

If you'd like to clone the template directly, you can follow the link here: https://chain.relevanceai.com/notebook/bcbe5a/4dc088f2dcfc-4e60-807e-353c334d4a5b/8372c05c-bcb2-4dff-9986-0c3ee537d17e/split

  1. Add an “API request” step to your chain.
  2. Set the URL to https://chrome.browserless.io/scrape?token=YOUR-TOKEN-HERE . Make sure to replace “YOUR-TOKEN-HERE” with the API key from browserless.
  3. Set the method to POST.
  4. Set the header to  { "Content-Type": "application/json" }
  5. Set the body to { "url": "{{params.url}}", "elements": [{ "selector": "body" }] } . This example will crawl the URL from params.url and will extract the data from the body selector.
  6. Once you run the step, you’ll see in the output.response_body object the response from Browserless with the crawled content.
  7. Add a “Generate text using LLMs” step and set the prompt to:
    CONTEXT:
    """
    {{steps.api_call.output.response_body.data[0].results[0].text}}
    """

    Summarise the above context. Format as markdown.
  8. This will inject the text from the Browserless call step into the prompt and summarise it.
  9. Run the chain and see the results! It’ll now crawl the URL using Browserless and summarise it using an LLM.

Best Practices

  • Customising the selector extract the specific piece of data you want from the crawled page will ensure the data fed into the LLM will be more accurate to your intention. Check out the documentation for Browserless’ /scrape endpoint here to learn more about how to define the elements to extract: https://www.browserless.io/docs/scrape#basic-usage.
  • Add a step after the API request to handle if the crawl failed. Certain websites try to block crawling and as such the response may not include the data you want.

The Scrape API is handy for interacting with a live web-browser and getting back textual data after JavaScript has been parsed and executed -- all wrapped up in a nice REST API. This becomes more important for sites that lazily-load content, require some kind of geographic awareness, or utilize contextual data like cookies and local-storage. A main feature shown here is the ability to remove all the HTML markup, giving back only textual data that your training requires.

Joel Griffith, CEO of Browserless
May 8, 2023
Contents
Daniel Vassilev
Tags:
Learn
You might also like