Skip to main content

NucliaDB SDK

Nucliadb-sdk is an open source Python library designed to help you easily access your NucliaDB. It allows you to access both a local NucliaDB and Nuclia.cloud.

With it you can:

  • Upload text, files and vectors, labels and annotations to your NucliaDB.
  • Access and modify your resources.
  • Annotate your resources.
  • Perform text searches.
  • Perform semantic searches.
  • Filter your data by label.

Installation

If you do not have NucliaDB installed, you can either:

  • Run NucliaDB docker image:
docker run -it \
-e LOG=INFO \
-p 8080:8080 \
-p 8060:8060 \
-p 8040:8040 \
-v nucliadb-standalone:/data \
nuclia/nucliadb:latest
  • Install with pip and run:
pip install nucliadb
nucliadb

To get started with Nuclia follow this link

Once you are set up, you can install the library via pip:

pip install nucliadb-sdk

Basic concepts

Before you look through our docs, here are some useful concepts:

  • KnowledgeBox: Our concept of a data container, usually referred to as KB.
  • Vectorset: Each set of vectors associated with our text/file data. You can define as many vectorsets as you want for each KB.
  • Labelset: Each group of labels associated with our text/file data. You can define as many labelsets as you want for each KB.
  • Search: You can perform a search over text fields, but also over any defined vectorsets. The text search looks for exact matches, while the vector search returns those with higher cosine similarity.

Main usage

Create a local NucliaDB Knowledge Box

create_knowledge_box()

To create a new Knowledge Box you can use the method create_knowledgebox from utils. It creates a new KB and returns a Knowledge Box object. If you provide a slug and a KB with that name already exists in your NucliaDB, it will return an error.

Parameters:

  • slug: Name of the KB. If none is provided it will create a new one with a random slug.
  • nucliadb_base_url: URL and port where your NucliaDB is located. If none is provided it assumes you have a local NucliaDB with the default setup http://localhost:8080.

Output: nucliadb_sdk.knowledgebox.KnowledgeBox

Example:

from nucliadb_sdk import create_knowledge_box

my_kb = create_knowledge_box("my_new_kb")

get_or_create(slug)

You can also use the method get_or_create from utils . It returns an existing KB if there already is one with the provided name in your DB, and creates a new KB if there isn't.

Parameters:

  • slug: Name of the KB.
  • nucliadb_base_url URL and port where your NucliaDB is located. If none is provided it assumes you have a local NucliaDB with the default setup http://localhost:8080.

Output: nucliadb_sdk.knowledgebox.KnowledgeBox

Example:

from nucliadb_sdk import get_or_create

my_kb = get_or_create("my_new_kb")

get_kb(slug)

If you only want to retrieve a KB that has already been created, you can use get_kb from utils . It returns an existing KB if there is already one with the provided name in your DB, and None if there isn't.

Parameters:

  • slug: Name of the KB.
  • nucliadb_base_url URL and port where your NucliaDB is located,. If none is provided it assumes you have a local NucliaDB with the default setup http://localhost:8080.

Output: nucliadb_sdk.knowledgebox.KnowledgeBox

Example:

from nucliadb_sdk import get_kb

my_kb = get_kb("my_new_kb")

Delete local NucliaDB Knowledge Box

Deletes a local NucliaDB KB.

Parameters:

  • slug: Name of the KB you want to delete.

Example:


from nucliadb_sdk import delete_kb

delete_kb(my_kb_name)

Access a Nuclia.cloud Knowledge Box

If you are working with a local NucliaDB you have a simple solution to the previous section. To work with a KB from Nuclia.cloud, first you need to create a NucliaDBClient

NucliaDBClient Helps you to connect to Nuclia. It takes the following parameters:

  • url: URL of the target KB. Usually in this format "https://europe-1.nuclia.cloud/api/v1/kb/{kb_id}.
  • api_key: Key to access the KB through the API obtained this way.
  • writer/reader/search/train_host: (optional) URL and ports of said endpoints.

Once you have it, you can instantiate a KnowledgeBox object with the client as a parameter. Example:

from nucliadb_sdk.client import Environment, NucliaDBClient
from nucliadb_sdk import KnowledgeBox

url_kb = (
"https://europe-1.nuclia.cloud/api/v1/kb/{my_kb_id}"
)

nucliadbclient = NucliaDBClient(
api_key=my_api_key,
url=url_kb,
)

my_kb = KnowledgeBox(nucliadbclient)

Upload data

Once you have a Knowledge Box object you can upload data to your KB. Your uploads can be text, binary files, vectors or a combination of all of these, and they can have associated labels and entities.

Parameters:

  • key: (optional) Upload id allows you to modify/update an existing resource.
  • binary: (optional) Binary file to upload.
  • text: (optional) Text to upload.
  • labels: (optional) List of labels for this resource. They can be defined as a list of strings or of labels.
  • entities: (optional) Needs to be defined as a list of entity objects.
  • vectors: (optional) Can be defined as a list of vectors or as a dictionary with the name(s) of the vectorset as keys and the corresponding vector (numpy, array or tensors) as values.

Output: resource_id Unique string that identifies the uploaded resource.

Note: Once uploaded, you can use both resource_id an key to access the uploaded resource.

Example:

from nucliadb_sdk import KnowledgeBox, Entity, File
from sentence_transformers import SentenceTransformer

encoder = SentenceTransformer("all-MiniLM-L6-v2")
resource_id = my_kb.upload(
key="mykey1",
binary=File(data=b"asd", filename="data"),
text="I'm Sierra, a very happy dog",
labels=["emotion/positive"],
entities=[Entity(type="NAME", value="Sierra", positions=[(4, 9)])],
vectors={"all-MiniLM-L6-v2": encoder.encode(["I'm Sierra, a very happy dog"])[0]},
)
uknowledgebox[resource_id] == knowledgebox["mykey1"]

Labelset and Label

Labelset is how you define a group of labels that make sense together. You can define labels two ways:

Create a labelset manually on a KB with the function set_labels

Parameters:

  • labelset: String with the name of the labelset.
  • labels: List of Strings with the names of the labels.
  • labelset_type: The granularity of the resource that the labels will refer to (sentences, paragraphs or resources). At the moment only LabelType.RESOURCES is supported.

Example:

from nucliadb_sdk import get_or_create
my_kb = get_or_create("my_new_kb")
my_kb.set_labels("emotion", ["positive","negative"], LabelType.RESOURCES)

You can also upload labels with the upload function, defining them as a string made up of the labelset and the label names labelset_name/label_name.

For example if you upload this resource:

from nucliadb_sdk import KnowledgeBox
from sentence_transformers import SentenceTransformer

encoder = SentenceTransformer("all-MiniLM-L6-v2")

resource_id = knowledgebox.upload(
text="She's having a terrible day",
labels=["emotion/negative"],
vectors={"all-MiniLM-L6-v2": encoder.encode(["She's having a terrible day"])[0]},
)

Your KB will automatically add the labelset emotion if it is not already created.

You can also define labels with the class Label:

from nucliadb_sdk import Label

Label(labelset="emotion", label="neutral")

To list all of the labelsets and labels associated with uploaded resources, you can use the function get_uploaded_labels() on our Knowledge Box.

It will return a dictionary with the labelset names as keys and labelset structures as values. These structures contain:

  • count: Total number of labeled resources.
  • labels: Dictionary with all label names as keys and their number of occurrences as values.

Example:

from nucliadb_sdk import KnowledgeBox

my_kb.get_uploaded_labels()

Output:

{'emotion': LabelSet(count=10, labels={'positive': 3, 'neutral': 2, 'negative': 5})}

Vectorset and Vector

Vectorsets and vectors work in a very similar way to labels and labelsets. A vectorset is a set of vectors that you can define for your KB. They usually represent encodings of your text/file with a common model. You can define as many vectorsets per KB as you want.

You can do it manually with the new_vectorset function of a KB object.

Parameters:

  • key: String with the name of the vectorset.
  • dimensions: Dimensions of your vectors.
  • similarity: Strategy to calculate distance between vectors. By default it is cosine (cosine similarity) but it can also be dotdot product.

Example:

from nucliadb_sdk import KnowledgeBox

my_kb.new_vectorset("roberta", 1024)

Even if you do not create a vectorset manually, it will be created automatically when you upload resources with vectors.

Example:

resource_id = my_kb.upload(
text="She's having a terrible day",
labels=["emotion/negative"],
vectors={"all-MiniLM-L6-v2": encoder.encode(["She's having a terrible day"])[0]},
)

To see all the vectorsets of a KB you will use list_vectorset.

Example:

from nucliadb_sdk import KnowledgeBox

my_kb.list_vectorset()

It will return a dictionary with all the defined vectorsets and their dimensions:

{'roberta': VectorSet(dimension=1024), 'all-MiniLM-L6-v2': VectorSet(dimension=384)}

To delete a vectorset you will use del_vectorset.

If you run this:


my_kb.del_vectorset("roberta")

And list your vectorsets again:

my_kb.list_vectorset()

The result will be this:

VectorSets(vectorsets={'all-MiniLM-L6-v2': VectorSet(dimension=384)})

The search function will return resources from your KB that match one or more conditions:

  • Filter by label: Returns resources where the labels match those provided.
  • Full text/keyword search: Given a word or set of words it will return resources that have those words in their text field.
  • Semantic search: Given some vectors and a vectorset it will return results where the vectors are similar to those provided, sorted by their cosine similarity.

Each matched result has the following fields:

  • Key: Id of the matched resource.
  • Text: Text of the matched result.
  • Labels: Labels associated with the result.
  • Score:Sscore of the result (0 for searches by label).
  • Score_type: BM25 for keyword search, cosine similarity for semantic search.

Now we will detail each kind of search.

Search by label

To filter by label you need to use the parameter filter. filter takes in an array of label objects or strings.

The simplest way is to filter with a string that is a combination of the labelset and label names, labelset_name/label_name.

results = my_kb.search(
filter=["emotion/positive"]
)

If you wanted to iterate over the results, you can use:

for result in results:
print(f"Resource key: {result.key}")
print(f"Text: {result.text}")
print(f"Labels: {result.labels}")

And it will output something like this:

Resource key: f1de1c1e3fac43aaa53dcdc54ffd07fc
Text: I'm Sierra, a very happy dog
Labels: ['positive']
Resource key: b445359d434b47dfb6a37ca45c14c2b3
Text: what a delighful day
Labels: ['positive']

For this search you will use the parameter text, that is, string.

You can use it to look for one word, like this:

from nucliadb_sdk import KnowledgeBox

results = my_kb.search(
text="dog"
)

If you wanted to iterate over the results, you can just use:

for result in results:
print(f"Text: {result.text}")
print(f"Labels: {result.labels}")
print(f"Score: {result.score}")
print(f"Key: {result.key}")
print(f"Score Type: {result.score_type}")

And the results will look like this:

Resource key: 4f1f570398c543e0b8c3b86e87ee2fbd
Text: Dog in catalan is gos
Score type: BM25
Score: 0.8871671557426453
Labels: ['neutral']
Resource key: 665e85f0fb2e4b2fbde8b4957b7462c1
Text: I'm Sierra, a very happy dog
Score type: BM25
Score: 0.7739118337631226
Labels: ['positive']

You can also look for multiple words:

results = my_kb.search(
text="he is"
)

And the results will look like this:

Resource key: d22d0d8acba040a2afd7a26ea0517769
Text: he is heartbroken
Score type: BM25
Score: 2.501499891281128
Labels: ['negative']
Resource key: 808c1557027e4109b4be8cbe995be8b1
Text: He said that the race is quite tough
Score type: BM25
Score: 1.7510499954223633
Labels: ['neutral']

Or combine it with the label filter:

results = my_kb.search(
filter=["emotion/neutral"]
text="dog"
)

In this case you will only have one result:

Resource key: 4f1f570398c543e0b8c3b86e87ee2fbd
Text: Dog in catalan is gos
Score type: BM25
Score: 0.8871671557426453
Labels: ['neutral']

For the vector search you need to input the vectors with the vector parameter and indicate the vectorset you want to search with the vectorset parameter. You can define a threshold with the minimum cosine similarity you want in your results with min_score.

The first step will usually be calculating the vectors for a query with the model you used for the same vectorset in your resources:

from nucliadb_sdk import KnowledgeBox
from sentence_transformers import SentenceTransformer

encoder = SentenceTransformer("all-MiniLM-L6-v2")


query_vectors = encoder.encode(["To be in love"])[0]

Then you pass them to the search.
Note that the parameter vector takes arrays, numpy arrays or tensors, so no need for any conversion.

results = my_kb.search(vector = query_vectors, vectorset="all-MiniLM-L6-v2",min_score=0.25)

You can iterate over the results just like in the previous searches:

for result in results:
print(f"Text: {result.text}")
print(f"Labels: {result.labels}")
print(f"Score: {result.score}")
print(f"Key: {result.key}")
print(f"Score Type: {result.score_type}")
print("------")

As you can see, your results are ordered by score:

Text: love is tough
Labels: ['negative']
Score: 0.4688602387905121
Key: a027ee34f3a7489d9a264b9f3d08d3a5
Score Type: COSINE
------
Text: he is heartbroken
Labels: ['negative']
Score: 0.27540814876556396
Key: 25bc7b22b4fb4f64848a1b7394fb69b1
Score Type: COSINE

Note: You can combine semantic and keyword search with a label filter and the results will be filtered by both constraints. If you use both text and vector search, the results will match one constraint OR the other