One of the frequently used templates at Relevance is “Extract categories in data”. This Tool analyses a subset of a CSV file and provides you with suggestions on existing categories (i.e. themes/topics) in the data. These categories can be later used for text categorization.

Extract categories in data

How to use the Tool

Locate the Tool in the template page and click on Use template. You can use the Tool as is or clone it.

Tool inputs and output

The Tool requires three main and four optional inputs:

  1. A CSV file (CSV)
  2. The name of the target column (Target column: Column containing the text for categorization)
  3. Row numbers of the target subset (Rows to look at) Provide the main inputs and hit Run once, you will see the LLM response in a few seconds similar to what is shown in the image below. You can use the copy button or select the text to copy for the next step. Extract categories in data
  • Use the “Extract categories in data” Tool on different subsets of your file (i.e. multiple runs for different ranges of rows). Copy the suggestions to a text file and finalize the category suggestion list by applying your domain knowledge considering the next step requirements.
  • LLMs have limited capacity for receiving input data. Instead of using your whole file (i.e. first row to the last row) use subsets of the file to analyze the data for category suggestions.
  1. Maximum preferred word count per categories/themes/topics: An optional input, set to 3 by default, indicating how wordy suggested categories could be. Note that it is about the number of words in each suggested category/topic and not the total number of suggestions.
  2. Maximum number of categories/themes/topic to extract: An optional input, set to 10 by default, indicating the expected number of category/topic suggestions.
  3. Objective: An optional input, set to General by default, indicating the objective for extracting categories. In other words indicating a lense through which the data is analyzed.
  4. Example (Example(s) of category/theme/topic extraction done by you): LLMs are proven to work better when they see samples. Provide sample(s) of your text data and the categories you would annotate for the samples.
  • Using , you can provide multiple categories per sample
  • Keep the writing style uniform (e.g.Capital each word)

The output is a list of suggested categories.

We Highly recommend

  • running this Tool on different subsets of your file
  • checking the received suggestions from multiple runs all together
  • finalizing the category list using your domain knowledge and the goals for text categorization
  • use the finalized list in the Text categorizer/Classifier Tool.

Tool components

If you clone a template, or make a Tool from scratch, you will have access to the Build tab. Build is where one put together different components to build a Tool suitable for their needs.

User inputs

User inputs

  1. File to URL: An easy-to-use, one step component, which takes care of all you need when uploading a file for further analysis.

  2. Text input: An input text component suitable for short text pieces, such as name, topic, a question.

    This component is used twice in this Tool. Target column and the objective are both of Text inputs.

  3. Table: A component for entering structured data as input, for instance, rows of samples, each containing fields such as name, last name and age.

    This component is used twice in this Tool. Row range (from - to) and well as Examples (Text - Categories/topics) are both samples of structured input data.

  4. Numeric input: An input component suitable for providing numeric values, such as scores, age, maximum or minimum required values.

    This component is used twice in this Tool. Both maximum word count per category and maximum number of suggested categories are of numeric inputs.

Tool steps

There are 4 components under the Tool steps in this analysis flow. These components take care of three tasks: loading the specified subset of the file, properly formatting the provided samples, and the LLM step.

Loading the specified subset of the file

  1. Loading the file into readable json format CSV to JSON

A spreadsheet to JSON component is available which receives a CSV file and extract the data under JSON format which can be later used for further processing.

  1. Selecting the specified subset of the data code

A Python code component is available to Run Python codes when necessary.

In this case, the Python code, filters out any rows that is not in the specified range.

Properly formatting the provided samples

code

A Python code component is available to Run Python codes when necessary.

In this case, the Python code, forms the entered samples in the format that is suitable abd compatible to the prompt.

Large Language Model (LLM)

LLM

A large language model component is all set up to provide you access to GPT (and many other LLMs). In the prompt section, you will provide the required information as well as instructions to what is expected to be done.

A Good Prompt

  1. Be short and precise with your instruction/request from the LLM
  2. Explicitly note constraints and goals
  3. Include formatting instruction when necessary