Welcome to the documentation for the “Extract Categories in Data” Tool! This Tool is designed to analyze text data and
suggest a list of main categories (i.e. themes and topics) seen in the data. Whether you are a data analyst, marketer,
or researcher, this Tool will assist you in organizing and understanding your data more effectively.
With its advanced AI capabilities and user-friendly interface, this Tool is a game-changer in fast data understanding.
The “Extract Categories in Data” Tool leverages cutting-edge large langue models to analyze text data and extract meaningful categories.
It goes beyond simple keyword matching and provides a comprehensive analysis of the data to identify the main themes and topics.
By extracting categories, this Tool enables you to gain a deeper understanding of your data and uncover hidden patterns and trends.
With its powerful capabilities and intuitive design, it is the perfect solution for understanding unstructured data.
Accurate Categorization:
The “Extract Categories in Data” Tool utilizes advanced AI to accurately categorize text data. It analyzes the content and context
to identify the main categories and themes. This feature ensures that you receive reliable and accurate categorization results, allowing
you to make informed decisions based on the insights gained.
Customizable Word Count:
You have the flexibility to customize the word count for each category or theme. Whether you want a more detailed analysis or a broader overview,
you can specify the wordiness of the suggested categories.
Flexible Taxonomy Count:
The Tool allows you to specify the total number of categories or themes to be extracted from the data. Whether you need a few key categories or
a more comprehensive taxonomy, you can customize the count accordingly. This flexibility ensures that the extracted categories provide a meaningful
representation of your data.
Objective-Focused Analysis:
You can define the objective or focus for categorization, such as “Customer service” or “Product feedback.” By specifying the objective, the Tool
directs the LLM model to a certain perspective. This objective-focused analysis enables you to extract categories that are relevant and valuable
for your particular use case.
Locate the Tool in the template page and click on Use template.
You can use the Tool as is or
clone it.Follow these steps to extract categories from your text data:
Upload Data: Upload your CSV file containing the text you wish to understand (e.g. a CSV file of survey responses).
Specify Field Name: Enter the header of the target column, exactly as it appears in your CSV file. This field contains the text data
that you want to analyze.
Specify what subset to look at: You can specify a subset of the file to be looked at. This is to replicate the data understanding process.
No matter how large the number of survey responses are, the discussed topic are more or less the same over the file.
To have a uniform coverage of the topics, avoid using alphabetically sorted CSV files.
Use the “Extract categories in data” Tool on different subsets of your file (i.e. multiple runs
for different ranges of rows). Copy the suggestions to a text file and finalize the category
suggestion list by applying your domain knowledge considering the
next step requirements.
LLMs have limited capacity for receiving input data. Instead of using your whole file
(i.e. first row to the last row) use subsets of the file to analyze the data for category
suggestions.
Customize Word Count (Optional): If desired, you can customize the word count for each category or theme. This allows you to control the
wordiness of suggested categories. If left unspecified, the default word count will be used.
Specify Taxonomy Count (Optional): You can specify the total number of categories or themes to be extracted from the data. This customization
allows you to control the breadth of the extracted categories. If left unspecified, the default taxonomy count will be used.
Define Objective (Optional): If you have a specific objective or focus for categorization, you can define it here. This objective-focused
analysis ensures that the extracted categories align with your specific goals. If left unspecified, a general objective will be used.
Example (Optional): You can provide sample(s) of category extraction done by you. LLMs are proven to work better when they see samples.
Provide sample(s) of your text data and the categories you would annotate for the samples.
Using , you can provide multiple categories per sample
Keep the writing style uniform (e.g. Capital each word)
Run the Tool: Once you have uploaded the CSV file and filled the form, click the “Run Tool” button (on the App page) or use
the run options on your data table (bulk/single run) to initiate the categorization process. The Tool will analyze the data and
generate a list of main categories, themes, and topics based on the content and context of the text.
View Results: The Tool will provide you with the extracted categories in a clear and organized format. You can explore the main categories, themes,
and topics identified in your data. The results will help you gain a deeper understanding of your text data and uncover valuable insights.
We Highly recommend
running this Tool on different subsets of your file
checking the received suggestions from multiple runs all together
finalizing the category list using your domain knowledge and the goals for text categorization
If you clone
a template, or make a Tool from scratch, you will have access to the
Build tab. Build is where one put together different components to build a Tool suitable for
their needs.
File to URL: An easy-to-use, one step component,
which takes care of all you need when uploading a file for further analysis.
Text input:
An input text component suitable for short text pieces, such as name, topic, a question.This component is used twice in this Tool. Target column and the objective are both of Text
inputs.
Table:
A component for entering structured data as input,
for instance, rows of samples, each containing fields such as name, last name and age.This component is used twice in this Tool. Row range (from - to) and well as Examples
(Text - Categories/topics) are both samples of structured input data.
Numeric input:
An input component suitable for providing numeric values, such as scores,
age, maximum or minimum required values.This component is used twice in this Tool. Both maximum word count per category and maximum
number of suggested categories are of numeric inputs.
There are 4 components under the Tool steps in this analysis flow. These components take care of three
tasks: loading the specified subset of the file, properly formatting the provided samples,
and the LLM step.
A spreadsheet to JSON component
is available which receives a CSV file and extract the data under JSON format which can be later used for further
processing.
Selecting the specified subset of the data
A Python code component is available to Run
Python codes when necessary.In this case, the Python code, filters out any rows that is not in the specified range.
A Python code component is available to Run
Python codes when necessary.In this case, the Python code, forms the entered samples in the format that is suitable
abd compatible to the prompt.
A large language model component is all set up to provide
you access to GPT (and many other LLMs).
In the prompt section, you will provide the required information as well as instructions to what
is expected to be done.A Good Prompt
Be short and precise with your instruction/request from the LLM