Automatic label training
Automatic label training enables you to train automatic classification models for resources and paragraphs. You can train and perform inferences of these models on your own data by either using our web application or our REST API. Below, you will find instructions on how to complete both options.
The images used in this guide are meant to provide a visual reference for the user. However, they may become outdated due to changes in the user interface.
Tabe of Contents
Upload the information
Within one of our KB we have everything required to start our training.
Initially, we must navigate to the Resource List tab. We will first access the left-hand sidebar in our KB. By selecting the third option, Resource List, we will be redirected to the resource listing page.
Our resource list is currently empty. To enable the execution of our various training exercises, we must upload resources and appropriately label them with one or more Label sets.
Our next step involves uploading resources from a dataset of movie summaries. To initiate this process we must click on the pink button labeled Upload. This option offers the flexibility to upload various resources in any data format. For our purposes, we will select the first option, Upload files, since we will mainly upload text files in txt, pdf and docx format. In other training examples we will explore more types of data.
If we click on the first option, we will see an option called Upload your files where we can leave our files for preprocessing to begin. Nuclia performs many background tasks while the files are being uploaded, such as vectorizing, summarizing information, extracting entities and relationships, and automatically classifying. Since this is a new KB we do not have any pre-trained classification models, so we will not see any automatic classification. The purpose of this guide is to train a paragraph classifier and a resource classifier to perform this functionality.
We are going to use summaries of movies, specifically from the genres of action, horror, and romance. Uploading these resources is as simple as dragging and dropping them within the Upload your files option and clicking on Add. This option allows us to upload multiple files simultaneously. It is important to mention that we could label the files by groups using this option, but we will do this on the resource view to also explain how to create a Label set.
Once we have clicked on Add, we will see how all of our files are being loaded.
If we now go back to the Resource list, we will find that a clock icon has appeared in the upper right corner, indicating how many of our files are being processed.
Clicking on this icon will take us to a view where we can see the resources that are about to be processed, and how they will be processed.
Create a Label set
While our files are being processed, we will create a Label set, which is a set of labels that we will use to label our resources. To create a Label set, we need to go to the left sidebar and click on the Classification tab.
At the moment this page is empty, but if we click on the Add new option we can start building a Label set.
At this stage we can configure the Label set based on our specific needs by adjusting the following options:
Labelset name
: The name we want to use for our Label set.Color
: The color that will represent our Label set.Classification type
: Labels can be associated with resources or paragraphs. By labeling a resource we take into account all of its content. However, we can also label only specific paragraphs within a file.Exclusive label
: Our classifiers are multilabel by default. This means a resource or paragraph can have multiple labels associated with it from a Label set.Labels
: A list of labels within our Label set.
We will name our Label set "resources_movies", represent it with the color blue, and classify it as a resource type. We will also check the Add only one label by label set
option. Since our three labels are horror, action, and romance, we want each resource to belong to only one of these three categories. Therefore we will check the box to indicate that our labels are unique. Finally, we will define the labels that we mentioned earlier.
We will also create a Label set called "paragraphs_movies" and configure it in the same way as the "resources_movies" Label set, but this time we will label individual paragraphs. This will allow us to demonstrate how our paragraph classifier works.
Label your data
We can now return to our Resource list, where the documents we uploaded will be finished processing. As we can see, they all appear in the list. While we could browse through them and explore different Nuclia features, let's focus on the classification task for now.
Remember that we need to create two types of annotations: for paragraphs and for resources. Let's begin with the latter. If we click on our resources, we will see an option called Add Labels at the top of our list. By clicking on it, we can see all the labels we have added for the resource type.
Now we can simply select each document and the label to which it belongs and all our documents will be labeled.
We will now tag three examples for each category. Our classifiers are designed to work with very few examples of each label.
Once we have labeled the documents we can label their paragraphs. To do this, we will go to the Actions column and click on the option represented by three vertical points. Here there are a variety of options that will help us manage our resources. In this case we will select Classify, which will take us to the classification view.
In the classification view we can easily label the content of our documents. We must note it is also possible to label our resources using this view. To do this we must click on the Resource option next to the left sidebar.
Since we have already labeled our resources we can now proceed to label some paragraphs. To do this, we need to go back to the previous view and select the option Select the Labels from the dropdown menu on the left. Here we will see all our paragraph Label sets. In our case, they will be identical to the resource Label sets. The process is similar to what we have done before, where we select the paragraphs and the labels we want to associate with them.
We can label paragraphs from each category in this manner.
Train your own model
Once we have all our labels we can begin Training. To do this, we will navigate to the left sidebar and click on the Training option once again.
Here we have various options based on the type of training we wish to utilize. In our case, we will select the options for Automatic Resource Label Training and Automatic Paragraph Label Training. If you would like to explore other options, you can refer to the other tutorials in this section. These tutorials teach you how to train label suggesters and entity extractors.
If we click on the Choose One Label Set option, we will be presented with various Label sets to choose from. It's important to note that these options only support a single Label set per training, but it's possible for these to be of the multi-label type. We will select the two Label sets we have created and begin training for each type.
In addition to what has been presented thus far Nuclia provides users with the opportunity to perform various tasks, including training and prediction, using our API. To utilize this functionality, simply refer to the official documentation that is available on Api References
Once each of our trainings have finished we will be able to see information about them, such as the execution date and time.
Predict examples
At this point, our KB has automatic classification models that can label any resources we upload using the labels we have used during training, as long as they can be appropriately categorized. To test the functionality of our models, we will upload three different movie summaries and observe how they are classified by the system
Once our files have been processed we can see how they have been automatically labeled. If we go into them we can see that their paragraphs are labeled as well.
In this way, we have trained two classification models easily and quickly. Nuclia enables users to build all kinds of machine learning models with very little data needed.