There is a large amount of text data available on the net which can be an amazing source for tasks such as Q/A, research, and context generation. Relevance provides you with an easy-to-use component for scraping website contents

How to use the Extract website content step

Add the component

You need to add the “Extract website content” step to your Tool (check how to get started with creating a tool).

Website URL

Directly enter a URL in the box or use the {{ to activate the variable mode. For instance, if the URL is in an input component called my_url, use {{my_url}}. Or if the URL is the output of a Google-Search step, use {{google.organic[0].link}} to access the url in the first search results (FAQ).

Method

You can specify if you wish to scrape data as Text or HTML. By default, Relevance scraps the Text.

Element Selector

You can specify which element from the HTML components to scrape. By default, it is set to body. Note that using + Add new, you can specify a list of elements to be scrapped.

Extra headers

If you need to provide special information to be able to scrape a website, provide the data as a JSON object. The below object shows an example where an authentication token called auth-token and a user-id are required.

{
    "auth-token":"AUTHENTICATION-TOKEN",
    "user-id":"USER-ID"
}

Follow the links below for more information about

Access the step output

The output is a dictionary with two keys page and selectors containing the extracted text and if any selectors were used. Below you can see samples where the default name assigned to the step scrape is used. Note that a step name is different from the step title. Step titles can be found on the top left of steps. A step name is shown on the bottom left, in smaller font and highlighted green.

scrape.output.page
scrape.output.selectors

Common errors

Wrong URL formatting

This error occurs when the URL field is set to a value that is not of type string. When using the output of another step make sure, you access the URL field. Read more at URL must be a string.

Non-array Elements

When setting up specific elements to be scrapped, make sure to use + Add new to have more than one Element. And if the button is clicked do not leave it as an empty list. Use the x icon to the right of the row to remove the extra line.

Studio transformation browserless_scrape input validation error: must be array {"type":"array"} /element_selector

Invalid URL

This error occurs when the provided URL is not valid. Protocol error (Page.navigate): Cannot navigate to invalid URL

Network issue

This error normally occurs when there are network issues. Make sure your connection is strong, refresh the page and try again. Navigation failed because browser has disconnected!

Time output

This happens wh Navigation timeout of 30000 ms exceeded