Extract categories in data
One of the frequently used templates at Relevance is “Extract categories in data”. This Tool analyses a subset of a CSV file and provides you with suggestions on existing categories (i.e. themes/topics) in the data. These categories can be later used for text categorization.
How to use the Tool
Locate the Tool in the template page and click on Use template
.
You can use the Tool as is or
clone it.
Tool inputs and output
The Tool requires three main and four optional inputs:
- A CSV file (CSV)
- The name of the target column (Target column: Column containing the text for categorization)
- Row numbers of the target subset (Rows to look at)
Provide the main inputs and hit
Run once
, you will see the LLM response in a few seconds similar to what is shown in the image below. You can use the copy button or select the text to copy for the next step.
- Use the “Extract categories in data” Tool on different subsets of your file (i.e. multiple runs for different ranges of rows). Copy the suggestions to a text file and finalize the category suggestion list by applying your domain knowledge considering the next step requirements.
- LLMs have limited capacity for receiving input data. Instead of using your whole file (i.e. first row to the last row) use subsets of the file to analyze the data for category suggestions.
- Maximum preferred word count per categories/themes/topics: An optional input, set to 3 by default, indicating how wordy suggested categories could be. Note that it is about the number of words in each suggested category/topic and not the total number of suggestions.
- Maximum number of categories/themes/topic to extract: An optional input, set to 10 by default, indicating the expected number of category/topic suggestions.
- Objective:
An optional input, set to
General
by default, indicating the objective for extracting categories. In other words indicating a lense through which the data is analyzed. - Example (Example(s) of category/theme/topic extraction done by you): LLMs are proven to work better when they see samples. Provide sample(s) of your text data and the categories you would annotate for the samples.
- Using
,
you can provide multiple categories per sample - Keep the writing style uniform (e.g.Capital each word)
The output is a list of suggested categories.
We Highly recommend
- running this Tool on different subsets of your file
- checking the received suggestions from multiple runs all together
- finalizing the category list using your domain knowledge and the goals for text categorization
- use the finalized list in the Text categorizer/Classifier Tool.
Tool components
If you clone a template, or make a Tool from scratch, you will have access to the Build tab. Build is where one put together different components to build a Tool suitable for their needs.
User inputs
-
File to URL: An easy-to-use, one step component, which takes care of all you need when uploading a file for further analysis.
-
Text input: An input text component suitable for short text pieces, such as name, topic, a question.
This component is used twice in this Tool. Target column and the objective are both of Text inputs.
-
Table: A component for entering structured data as input, for instance, rows of samples, each containing fields such as name, last name and age.
This component is used twice in this Tool. Row range (from - to) and well as Examples (Text - Categories/topics) are both samples of structured input data.
-
Numeric input: An input component suitable for providing numeric values, such as scores, age, maximum or minimum required values.
This component is used twice in this Tool. Both maximum word count per category and maximum number of suggested categories are of numeric inputs.
Tool steps
There are 4 components under the Tool steps in this analysis flow. These components take care of three tasks: loading the specified subset of the file, properly formatting the provided samples, and the LLM step.
Loading the specified subset of the file
- Loading the file into readable json format
A spreadsheet to JSON component is available which receives a CSV file and extract the data under JSON format which can be later used for further processing.
- Selecting the specified subset of the data
A Python code component is available to Run Python codes when necessary.
In this case, the Python code, filters out any rows that is not in the specified range.
Properly formatting the provided samples
A Python code component is available to Run Python codes when necessary.
In this case, the Python code, forms the entered samples in the format that is suitable abd compatible to the prompt.
Large Language Model (LLM)
A large language model component is all set up to provide you access to GPT (and many other LLMs). In the prompt section, you will provide the required information as well as instructions to what is expected to be done.
- Be short and precise with your instruction/request from the LLM
- Explicitly note constraints and goals
- Include formatting instruction when necessary