NucliaDB SDK
Nucliadb-sdk is an open source Python library designed to help you easily access your NucliaDB. It allows you to access both a local NucliaDB and Nuclia.cloud.
With it you can:
- Upload text, files and vectors, labels and annotations to your NucliaDB.
- Access and modify your resources.
- Annotate your resources.
- Perform text searches.
- Perform semantic searches.
- Filter your data by label.
Installation
If you do not have NucliaDB installed, you can either:
- Run NucliaDB docker image:
docker run -it \
-e LOG=INFO \
-p 8080:8080 \
-p 8060:8060 \
-p 8040:8040 \
-v nucliadb-standalone:/data \
nuclia/nucliadb:latest
- Install with pip and run:
pip install nucliadb
nucliadb
To get started with Nuclia follow this link
Once you are set up, you can install the library via pip:
pip install nucliadb-sdk
Basic concepts
Before you look through our docs, here are some useful concepts:
KnowledgeBox
: Our concept of a data container, usually referred to as KB.Vectorset
: Each set of vectors associated with our text/file data. You can define as many vectorsets as you want for each KB.Labelset
: Each group of labels associated with our text/file data. You can define as many labelsets as you want for each KB.Search
: You can perform a search over text fields, but also over any defined vectorsets. The text search looks for exact matches, while the vector search returns those with higher cosine similarity.
Main usage
Create a local NucliaDB Knowledge Box
create_knowledge_box()
To create a new Knowledge Box you can use the method create_knowledgebox
from utils
. It creates a new KB and returns a Knowledge Box object.
If you provide a slug
and a KB with that name already exists in your NucliaDB, it will return an error.
Parameters:
slug
: Name of the KB. If none is provided it will create a new one with a random slug.nucliadb_base_url
: URL and port where your NucliaDB is located. If none is provided it assumes you have a local NucliaDB with the default setuphttp://localhost:8080
.
Output: nucliadb_sdk.knowledgebox.KnowledgeBox
Example:
from nucliadb_sdk import create_knowledge_box
my_kb = create_knowledge_box("my_new_kb")
get_or_create(slug)
You can also use the method get_or_create
from utils
. It returns an existing KB if there already is one with the provided name in your DB, and creates a new KB if there isn't.
Parameters:
slug
: Name of the KB.nucliadb_base_url
URL and port where your NucliaDB is located. If none is provided it assumes you have a local NucliaDB with the default setuphttp://localhost:8080
.
Output: nucliadb_sdk.knowledgebox.KnowledgeBox
Example:
from nucliadb_sdk import get_or_create
my_kb = get_or_create("my_new_kb")
get_kb(slug)
If you only want to retrieve a KB that has already been created, you can use get_kb
from utils
. It returns an existing KB if there is already one with the provided name in your DB, and None
if there isn't.
Parameters:
slug
: Name of the KB.nucliadb_base_url
URL and port where your NucliaDB is located,. If none is provided it assumes you have a local NucliaDB with the default setuphttp://localhost:8080
.
Output: nucliadb_sdk.knowledgebox.KnowledgeBox
Example:
from nucliadb_sdk import get_kb
my_kb = get_kb("my_new_kb")
Delete local NucliaDB Knowledge Box
Deletes a local NucliaDB KB.
Parameters:
slug
: Name of the KB you want to delete.
Example:
from nucliadb_sdk import delete_kb
delete_kb(my_kb_name)
Access a Nuclia.cloud Knowledge Box
If you are working with a local NucliaDB you have a simple solution to the previous section.
To work with a KB from Nuclia.cloud, first you need to create a NucliaDBClient
NucliaDBClient
Helps you to connect to Nuclia. It takes the following parameters:
url
: URL of the target KB. Usually in this format"https://europe-1.nuclia.cloud/api/v1/kb/{kb_id}
.api_key
: Key to access the KB through the API obtained this way.writer/reader/search/train_host
: (optional) URL and ports of said endpoints.
Once you have it, you can instantiate a KnowledgeBox
object with the client as a parameter.
Example:
from nucliadb_sdk.client import Environment, NucliaDBClient
from nucliadb_sdk import KnowledgeBox
url_kb = (
"https://europe-1.nuclia.cloud/api/v1/kb/{my_kb_id}"
)
nucliadbclient = NucliaDBClient(
api_key=my_api_key,
url=url_kb,
)
my_kb = KnowledgeBox(nucliadbclient)
Upload data
Once you have a Knowledge Box object you can upload data to your KB. Your uploads can be text, binary files, vectors or a combination of all of these, and they can have associated labels and entities.
Parameters:
key
: (optional) Upload id allows you to modify/update an existing resource.binary
: (optional) Binary file to upload.text
: (optional) Text to upload.labels
: (optional) List of labels for this resource. They can be defined as a list of strings or of labels.entities
: (optional) Needs to be defined as a list of entity objects.vectors
: (optional) Can be defined as a list of vectors or as a dictionary with the name(s) of the vectorset as keys and the corresponding vector (numpy, array or tensors) as values.
Output: resource_id
Unique string that identifies the uploaded resource.
Note: Once uploaded, you can use both resource_id
an key
to access the uploaded resource.
Example:
from nucliadb_sdk import KnowledgeBox, Entity, File
from sentence_transformers import SentenceTransformer
encoder = SentenceTransformer("all-MiniLM-L6-v2")
resource_id = my_kb.upload(
key="mykey1",
binary=File(data=b"asd", filename="data"),
text="I'm Sierra, a very happy dog",
labels=["emotion/positive"],
entities=[Entity(type="NAME", value="Sierra", positions=[(4, 9)])],
vectors={"all-MiniLM-L6-v2": encoder.encode(["I'm Sierra, a very happy dog"])[0]},
)
uknowledgebox[resource_id] == knowledgebox["mykey1"]
Labelset and Label
Labelset is how you define a group of labels that make sense together. You can define labels two ways:
Create a labelset manually on a KB with the function set_labels
Parameters:
labelset
: String with the name of the labelset.labels
: List of Strings with the names of the labels.labelset_type
: The granularity of the resource that the labels will refer to (sentences, paragraphs or resources). At the moment only LabelType.RESOURCES is supported.
Example:
from nucliadb_sdk import get_or_create
my_kb = get_or_create("my_new_kb")
my_kb.set_labels("emotion", ["positive","negative"], LabelType.RESOURCES)
You can also upload labels with the upload
function, defining them as a string made up of the labelset and the label names labelset_name/label_name
.
For example if you upload this resource:
from nucliadb_sdk import KnowledgeBox
from sentence_transformers import SentenceTransformer
encoder = SentenceTransformer("all-MiniLM-L6-v2")
resource_id = knowledgebox.upload(
text="She's having a terrible day",
labels=["emotion/negative"],
vectors={"all-MiniLM-L6-v2": encoder.encode(["She's having a terrible day"])[0]},
)
Your KB will automatically add the labelset emotion
if it is not already created.
You can also define labels with the class Label
:
from nucliadb_sdk import Label
Label(labelset="emotion", label="neutral")
To list all of the labelsets and labels associated with uploaded resources, you can use the function get_uploaded_labels()
on our Knowledge Box.
It will return a dictionary with the labelset names as keys and labelset structures as values. These structures contain:
count
: Total number of labeled resources.labels
: Dictionary with all label names as keys and their number of occurrences as values.
Example:
from nucliadb_sdk import KnowledgeBox
my_kb.get_uploaded_labels()
Output:
{'emotion': LabelSet(count=10, labels={'positive': 3, 'neutral': 2, 'negative': 5})}
Vectorset and Vector
Vectorsets and vectors work in a very similar way to labels and labelsets. A vectorset is a set of vectors that you can define for your KB. They usually represent encodings of your text/file with a common model. You can define as many vectorsets per KB as you want.
You can do it manually with the new_vectorset
function of a KB object.
Parameters:
key
: String with the name of the vectorset.dimensions
: Dimensions of your vectors.similarity
: Strategy to calculate distance between vectors. By default it iscosine
(cosine similarity) but it can also bedot
dot product.
Example:
from nucliadb_sdk import KnowledgeBox
my_kb.new_vectorset("roberta", 1024)
Even if you do not create a vectorset manually, it will be created automatically when you upload resources with vectors.
Example:
resource_id = my_kb.upload(
text="She's having a terrible day",
labels=["emotion/negative"],
vectors={"all-MiniLM-L6-v2": encoder.encode(["She's having a terrible day"])[0]},
)
To see all the vectorsets of a KB you will use list_vectorset
.
Example:
from nucliadb_sdk import KnowledgeBox
my_kb.list_vectorset()
It will return a dictionary with all the defined vectorsets and their dimensions:
{'roberta': VectorSet(dimension=1024), 'all-MiniLM-L6-v2': VectorSet(dimension=384)}
To delete a vectorset you will use del_vectorset
.
If you run this:
my_kb.del_vectorset("roberta")
And list your vectorsets again:
my_kb.list_vectorset()
The result will be this:
VectorSets(vectorsets={'all-MiniLM-L6-v2': VectorSet(dimension=384)})
Search
The search
function will return resources from your KB that match one or more conditions:
- Filter by label: Returns resources where the labels match those provided.
- Full text/keyword search: Given a word or set of words it will return resources that have those words in their text field.
- Semantic search: Given some vectors and a vectorset it will return results where the vectors are similar to those provided, sorted by their cosine similarity.
Each matched result has the following fields:
Key
: Id of the matched resource.Text
: Text of the matched result.Labels
: Labels associated with the result.Score
:Sscore of the result (0 for searches by label).Score_type
: BM25 for keyword search, cosine similarity for semantic search.
Now we will detail each kind of search.
Search by label
To filter by label you need to use the parameter filter
.
filter
takes in an array of label objects or strings.
The simplest way is to filter with a string that is a combination of the labelset and label names, labelset_name/label_name
.
results = my_kb.search(
filter=["emotion/positive"]
)
If you wanted to iterate over the results, you can use:
for result in results:
print(f"Resource key: {result.key}")
print(f"Text: {result.text}")
print(f"Labels: {result.labels}")
And it will output something like this:
Resource key: f1de1c1e3fac43aaa53dcdc54ffd07fc
Text: I'm Sierra, a very happy dog
Labels: ['positive']
Resource key: b445359d434b47dfb6a37ca45c14c2b3
Text: what a delighful day
Labels: ['positive']
Full text search
For this search you will use the parameter text
, that is, string.
You can use it to look for one word, like this:
from nucliadb_sdk import KnowledgeBox
results = my_kb.search(
text="dog"
)
If you wanted to iterate over the results, you can just use:
for result in results:
print(f"Text: {result.text}")
print(f"Labels: {result.labels}")
print(f"Score: {result.score}")
print(f"Key: {result.key}")
print(f"Score Type: {result.score_type}")
And the results will look like this:
Resource key: 4f1f570398c543e0b8c3b86e87ee2fbd
Text: Dog in catalan is gos
Score type: BM25
Score: 0.8871671557426453
Labels: ['neutral']
Resource key: 665e85f0fb2e4b2fbde8b4957b7462c1
Text: I'm Sierra, a very happy dog
Score type: BM25
Score: 0.7739118337631226
Labels: ['positive']
You can also look for multiple words:
results = my_kb.search(
text="he is"
)
And the results will look like this:
Resource key: d22d0d8acba040a2afd7a26ea0517769
Text: he is heartbroken
Score type: BM25
Score: 2.501499891281128
Labels: ['negative']
Resource key: 808c1557027e4109b4be8cbe995be8b1
Text: He said that the race is quite tough
Score type: BM25
Score: 1.7510499954223633
Labels: ['neutral']
Or combine it with the label filter:
results = my_kb.search(
filter=["emotion/neutral"]
text="dog"
)
In this case you will only have one result:
Resource key: 4f1f570398c543e0b8c3b86e87ee2fbd
Text: Dog in catalan is gos
Score type: BM25
Score: 0.8871671557426453
Labels: ['neutral']
Semantic search
For the vector search you need to input the vectors with the vector
parameter and indicate the vectorset you want to search with the vectorset
parameter.
You can define a threshold with the minimum cosine similarity you want in your results with min_score
.
The first step will usually be calculating the vectors for a query with the model you used for the same vectorset in your resources:
from nucliadb_sdk import KnowledgeBox
from sentence_transformers import SentenceTransformer
encoder = SentenceTransformer("all-MiniLM-L6-v2")
query_vectors = encoder.encode(["To be in love"])[0]
Then you pass them to the search.
Note that the parameter vector
takes arrays, numpy arrays or tensors, so no need for any conversion.
results = my_kb.search(vector = query_vectors, vectorset="all-MiniLM-L6-v2",min_score=0.25)
You can iterate over the results just like in the previous searches:
for result in results:
print(f"Text: {result.text}")
print(f"Labels: {result.labels}")
print(f"Score: {result.score}")
print(f"Key: {result.key}")
print(f"Score Type: {result.score_type}")
print("------")
As you can see, your results are ordered by score:
Text: love is tough
Labels: ['negative']
Score: 0.4688602387905121
Key: a027ee34f3a7489d9a264b9f3d08d3a5
Score Type: COSINE
------
Text: he is heartbroken
Labels: ['negative']
Score: 0.27540814876556396
Key: 25bc7b22b4fb4f64848a1b7394fb69b1
Score Type: COSINE
Note: You can combine semantic and keyword search with a label filter and the results will be filtered by both constraints. If you use both text and vector search, the results will match one constraint OR the other