Skip to main content

NucliaDB SDK

Nucliadb-sdk is a python open source library designed for accessing your NucliaDB in an easy way. It allows you to access both a local NucliaDB and Nuclia.cloud.

With it you can:

  • Upload text, files and vectors, labels and annotations to your NucliaDB
  • Access and modify your resources
  • Annotate your resources
  • Perform text searches
  • Perform semantic searches
  • Filter your data by label

Installation

In case you do not have NucliaDB installed, you can either:

  • Run NucliaDB docker image:
docker run -it \
-e LOG=INFO \
-p 8080:8080 \
-p 8060:8060 \
-p 8040:8040 \
-v nucliadb-standalone:/data \
nuclia/nucliadb:latest
  • Or install with pip and run:
pip install nucliadb
nucliadb

To get started with Nuclia follow this link

Once you are all set-up, you can install the library via pip:

pip install nucliadb-sdk

Basic concepts

Before you dive into our docs, some useful concepts:

  • KnowledgeBox: our concept of a data container, usually referred to as KB.
  • Vectorset: each set of vectors associated to our text/file data. We can define as many vectorsets as we want for each KB.
  • Labelset: each group of labels associated to our text/file data. We can define as many labelsets as we want for each KB.
  • Search: we can perform search over our text fields, but also over any of our defined vectorsets. The text search looks for exact matches, while the vector one returns the ones with higher cosine similarity.

Main usage

Create a local NucliaDB Knowledgebox

create_knowledge_box()

To create a new knowledgebox we can use the method create_knowledgebox from utils. It creates a new KB and returns a KnowledgeBox object. If we provided a slug and a KB with that name already exists in our NucliaDB, it will return an error.

Parameters:

  • slug: name of the kb, if none is provided it will create a new one with a random slug
  • nucliadb_base_url url and port where our NucliaDB is located, if none is provided assumes we have a local NucliaDB with the default setup http://localhost:8080

Output: nucliadb_sdk.knowledgebox.KnowledgeBox

Example:

from nucliadb_sdk import create_knowledge_box

my_kb = create_knowledge_box("my_new_kb")

get_or_create(slug)

We can also use the method get_or_create from utils . It returns an existing KB if there is already one with the provided name in our DB, and creates a new KB if it doesn't.

Parameters:

  • slug: name of the kb
  • nucliadb_base_url url and port where our NucliaDB is located, if none is provided assumes we have a local NucliaDB with the default setup http://localhost:8080

Output: nucliadb_sdk.knowledgebox.KnowledgeBox

Example:

from nucliadb_sdk import get_or_create

my_kb = get_or_create("my_new_kb")

get_kb(slug)

In case we only want to retrieve a KB if it has already been created, we can use get_kb from utils . It returns an existing KB if there is already one with the provided name in our DB, and None if it doesn't.

Parameters:

  • slug: name of the kb
  • nucliadb_base_url url and port where our NucliaDB is located, if none is provided assumes we have a local NucliaDB with the default setup http://localhost:8080

Output: nucliadb_sdk.knowledgebox.KnowledgeBox

Example:

from nucliadb_sdk import get_kb

my_kb = get_kb("my_new_kb")

Delete local NucliaDB Knowledgebox

Deletes a local nucliadb kb.

Parameters:

  • slug: name of the kb we want to delete

Example:


from nucliadb_sdk import delete_kb

delete_kb(my_kb_name)

Access a Nuclia.cloud Knowledgebox

If you are working with a local NucliaDB you have a simple solution on the previous section To work with a KB from Nuclia.cloud, first we need to create a NucliaDBClient

NucliaDBClient helps us connect to Nuclia. It takes the following parameters:

  • url, url of the target KB, usually in this format "https://europe-1.nuclia.cloud/api/v1/kb/{kb_id}
  • api_key, key to access the KB through the API, to obtain like this
  • writer/reader/search/train_host, optional, url and ports of the said endpoints

Once we have it, we can just instantiate a KnowledgeBox object with the client as parameter. Example:

from nucliadb_sdk.client import Environment, NucliaDBClient
from nucliadb_sdk import KnowledgeBox

url_kb = (
"https://europe-1.nuclia.cloud/api/v1/kb/{my_kb_id}"
)

nucliadbclient = NucliaDBClient(
api_key=my_api_key,
url=url_kb,
)

my_kb = KnowledgeBox(nucliadbclient)

Upload data

Once we have a KnowledgeBox object we can upload data to our KB. Our uploads can be of text, binary files, vectors or all together, and they can have labels and entities associated.

Parameters:

  • key: upload id, optional, but allows us to modify/update an existing resource
  • binary: optional, binary file to upload
  • text: optional, text to upload
  • labels: optional,list of labels for this resource. They can be defined as a list of Strings or of Labels
  • entities: optional, needs to be defined as a list of Entity objects
  • vectors: optional, can be defined as a list of Vectors or as a distionary with the name/s of the vectorset as keys and the corresponding vector (numpy, array or tensors) as values.

Output: resource_id Unique string that identifies the uploaded resource

Note: Once uploaded, we can use both resource_id an key to access the uploaded resource

Example:

from nucliadb_sdk import KnowledgeBox, Entity, File
from sentence_transformers import SentenceTransformer

encoder = SentenceTransformer("all-MiniLM-L6-v2")
resource_id = my_kb.upload(
key="mykey1",
binary=File(data=b"asd", filename="data"),
text="I'm Sierra, a very happy dog",
labels=["emotion/positive"],
entities=[Entity(type="NAME", value="Sierra", positions=[(4, 9)])],
vectors={"all-MiniLM-L6-v2": encoder.encode(["I'm Sierra, a very happy dog"])[0]},
)
uknowledgebox[resource_id] == knowledgebox["mykey1"]

Labelset and Label

Labelset is how we define a group of Labels that make sense together. We can define Labels two ways:

Create a Labelset manually on a KB with the function set_labels

Parameters:

  • labelset: String with the name of the labelset
  • labels: list of Strings with the names of the labels
  • labelset_type: the granularity of the resource that the labels will refer to (sentences, paragraphs or resources). At the moment only LabelType.RESOURCES is supported

Example:

from nucliadb_sdk import get_or_create
my_kb = get_or_create("my_new_kb")
my_kb.set_labels("emotion", ["positive","negative"], LabelType.RESOURCES)

Or we can just upload labels with our upload function, defining them as a string made up of the labelset and the label names labelset_name/label_name

For example if we upload this resource:

from nucliadb_sdk import KnowledgeBox
from sentence_transformers import SentenceTransformer

encoder = SentenceTransformer("all-MiniLM-L6-v2")

resource_id = knowledgebox.upload(
text="She's having a terrible day",
labels=["emotion/negative"],
vectors={"all-MiniLM-L6-v2": encoder.encode(["She's having a terrible day"])[0]},
)

Automatically our KB will add the labelset emotion if it is not already created.

We can also define Labels with the class Label:

from nucliadb_sdk import Label

Label(labelset="emotion", label="neutral")

To list of the Labelsets and Labels associated with uploaded resources, we can use the function get_uploaded_labels() on our Knowledgebox.

It will return a dictionary with the labelset names as keys and Labelset structures as values. This structures contain:

  • count: total number of labeled resources
  • labels: dictionary witll all label names as keys and their number of occurrences as value

Example:

from nucliadb_sdk import KnowledgeBox

my_kb.get_uploaded_labels()

Output:

{'emotion': LabelSet(count=10, labels={'positive': 3, 'neutral': 2, 'negative': 5})}

Vectorset and Vector

Vectorsets and Vectors work in a very similar way to Labels and Labelsets. A Vectorset is a set of Vectors that we can define for our KB, they usually represent encodings of our text/file with a common model. We can define as many Vectorsets per KB as we want.

We can do it manually with the new_vectorset function of a KB object.

Parameters:

  • key: String with the name of the vectorset
  • dimensions: Dimensions of our vectors
  • similarity: Strategy to calculate distance between vectors, by default it's cosine (cosine similarity) but it can also be dotdot product

Example:

from nucliadb_sdk import KnowledgeBox

my_kb.new_vectorset("roberta", 1024)

Even if we do not create a Vectorset manually, it will be created automatically when we upload resources with vectors.

Example:

resource_id = my_kb.upload(
text="She's having a terrible day",
labels=["emotion/negative"],
vectors={"all-MiniLM-L6-v2": encoder.encode(["She's having a terrible day"])[0]},
)

To see all the Vectosets of a KB we use list_vectorset.

Example:

from nucliadb_sdk import KnowledgeBox

my_kb.list_vectorset()

It will return a dictionary with all the defined Vectorsets and their dimensions:

{'roberta': VectorSet(dimension=1024), 'all-MiniLM-L6-v2': VectorSet(dimension=384)}

To delete a Vectorset we use del_vectorset.

If we run this:


my_kb.del_vectorset("roberta")

And list our Vectorsets again:

my_kb.list_vectorset()

The result will be this:

VectorSets(vectorsets={'all-MiniLM-L6-v2': VectorSet(dimension=384)})

Our search function will return resources from our KB that match one/more conditions:

  • Filter by label, returns resources whose labels match the provided ones.
  • Full text/keyword search, given a word or set of words returns resources that have them in their text field
  • Semantic search, given some vectors and a Vectorset , returns results whose vectors are similar to the ones provided, sorted by their cosine similarity.

Each matched result has the following fields:

  • Key: Id of the matched resource
  • Text: Text of the matched result
  • Labels: labels associated with the result
  • Score: score of the result (0 for searches by label)
  • Score_type: BM25 for keyword search, Cosine similarity for semantic search

Now let's go into detail for each kind of search:

Search by label

To filter by label we need to use the parameter filter. filter takes in an array of Label objects or strings.

The simpler way is to filter with a string that must be a combination of the Labelset and Label names, labelset_name/label_name.

results = my_kb.search(
filter=["emotion/positive"]
)

If we wanted to iterate over the results, we can just do:

for result in results:
print(f"Resource key: {result.key}")
print(f"Text: {result.text}")
print(f"Labels: {result.labels}")

And it will output something like this:

Resource key: f1de1c1e3fac43aaa53dcdc54ffd07fc
Text: I'm Sierra, a very happy dog
Labels: ['positive']
Resource key: b445359d434b47dfb6a37ca45c14c2b3
Text: what a delighful day
Labels: ['positive']

For this search we use the parameter text, that is String.

We can use it with one word, like this:

from nucliadb_sdk import KnowledgeBox

results = my_kb.search(
text="dog"
)

If we wanted to iterate over the results, we can just do:

for result in results:
print(f"Text: {result.text}")
print(f"Labels: {result.labels}")
print(f"Score: {result.score}")
print(f"Key: {result.key}")
print(f"Score Type: {result.score_type}")

And the results will look like this:

Resource key: 4f1f570398c543e0b8c3b86e87ee2fbd
Text: Dog in catalan is gos
Score type: BM25
Score: 0.8871671557426453
Labels: ['neutral']
Resource key: 665e85f0fb2e4b2fbde8b4957b7462c1
Text: I'm Sierra, a very happy dog
Score type: BM25
Score: 0.7739118337631226
Labels: ['positive']

We can also look for multiple words:

results = my_kb.search(
text="he is"
)

And the results will look like this:

Resource key: d22d0d8acba040a2afd7a26ea0517769
Text: he is heartbroken
Score type: BM25
Score: 2.501499891281128
Labels: ['negative']
Resource key: 808c1557027e4109b4be8cbe995be8b1
Text: He said that the race is quite tough
Score type: BM25
Score: 1.7510499954223633
Labels: ['neutral']

Or combine it with the Label filter:

results = my_kb.search(
filter=["emotion/neutral"]
text="dog"
)

In this case we'll only have one result:

Resource key: 4f1f570398c543e0b8c3b86e87ee2fbd
Text: Dog in catalan is gos
Score type: BM25
Score: 0.8871671557426453
Labels: ['neutral']

For the vector search we need to input the vectors with the vector parameter and indicate the Vectorset we want to search in with the vectorset parameter. If wanted we can define a threshold with the minimun cosine similarity that we want in our results with min_score.

The first step will usually be calculating the vectors for a query with the model we used for the same vectorset in our resources:

from nucliadb_sdk import KnowledgeBox
from sentence_transformers import SentenceTransformer

encoder = SentenceTransformer("all-MiniLM-L6-v2")


query_vectors = encoder.encode(["To be in love"])[0]

Then we pass them to the search.
Note that the parameter vector takes arrays, numpy arrays or tensors, so no need for any conversion.

results = my_kb.search(vector = query_vectors, vectorset="all-MiniLM-L6-v2",min_score=0.25)

We iterate over the results just like in the previous searches:

for result in results:
print(f"Text: {result.text}")
print(f"Labels: {result.labels}")
print(f"Score: {result.score}")
print(f"Key: {result.key}")
print(f"Score Type: {result.score_type}")
print("------")

As you can see, our results are ordered by score:

Text: love is tough
Labels: ['negative']
Score: 0.4688602387905121
Key: a027ee34f3a7489d9a264b9f3d08d3a5
Score Type: COSINE
------
Text: he is heartbroken
Labels: ['negative']
Score: 0.27540814876556396
Key: 25bc7b22b4fb4f64848a1b7394fb69b1
Score Type: COSINE

Note: We can combine semantic and keyword search with a label filter and the results will apppear filtered by both constraints. But if we use both text and vector search, the results will be the ones that match one constraint OR the other