PDF-to-Text allows you to extract all the text data from PDF files and further analyze the text or use the text in applications such as question answering. Note that you can save the extracted text into a knowledge-set to avoid redoing the PDF-to-Text step.

On this page, we will introduce the Tool step at Relevance to convert PDF to text.

How to use Convert PDF to text step

Add the component

Add the PDF to text converter step to your Tool (check how to get started with creating a tool).

File URL

A PDF-to-text converter requires a file as an input. If your file is publicly accessible on the web (i.e. with no authentication or sign-up requirement), simply provide the URL directly or as a text input. Otherwise, you will need to add a File-to-URL input. In either situation, use the {{variable name}} to provide the data to the converter.

Use OCR

OCR (Optical character recognition or optical character reader) is needed for image PDFs (e.g. scanned data). This option uses more credits. So, only activate it for image PDFs.

Available converters

  • Fast converter: Relevance AI’s default audio and video-to-text converter which is fast and reasonably accurate
  • Quality converter: Slower and more accurate compared to the previous option

Follow the links below for more information about

Access the step output

The output is a dictionary with two keys text and number_of_pages containing the extracted text and the number of pages in the file respectively. Below you can see samples where the default name assigned to the step pdf_to_text is used. Note that a step name is different from the step title. Step titles can be found on the top left of steps. A step name is shown on the bottom left, in smaller font and highlighted green.

pdf_to_text.text
pdf_to_text.number_of_pages

Common errors

Unsupported protocol

An error similar to the one noted below indicates that the provided input is not a valid URL.

Error:
Only HTTP(S) protocols are supported