NucliaDB SDK
Nucliadb-sdk is a python open source library designed for accessing your NucliaDB in an easy way. It allows you to access both a local NucliaDB and Nuclia.cloud.
With it you can:
- Upload text, files and vectors, labels and annotations to your NucliaDB
- Access and modify your resources
- Annotate your resources
- Perform text searches
- Perform semantic searches
- Filter your data by label
Installation
In case you do not have NucliaDB installed, you can either:
- Run NucliaDB docker image:
docker run -it \
-e LOG=INFO \
-p 8080:8080 \
-p 8060:8060 \
-p 8040:8040 \
-v nucliadb-standalone:/data \
nuclia/nucliadb:latest
- Or install with pip and run:
pip install nucliadb
nucliadb
To get started with Nuclia follow this link
Once you are all set-up, you can install the library via pip:
pip install nucliadb-sdk
Basic concepts
Before you dive into our docs, some useful concepts:
KnowledgeBox
: our concept of a data container, usually referred to as KB.Vectorset
: each set of vectors associated to our text/file data. We can define as many vectorsets as we want for each KB.Labelset
: each group of labels associated to our text/file data. We can define as many labelsets as we want for each KB.Search
: we can perform search over our text fields, but also over any of our defined vectorsets. The text search looks for exact matches, while the vector one returns the ones with higher cosine similarity.
Main usage
Create a local NucliaDB Knowledgebox
create_knowledge_box()
To create a new knowledgebox we can use the method create_knowledgebox
from utils
. It creates a new KB and returns a KnowledgeBox object.
If we provided a slug
and a KB with that name already exists in our NucliaDB, it will return an error.
Parameters:
slug
: name of the kb, if none is provided it will create a new one with a random slugnucliadb_base_url
url and port where our NucliaDB is located, if none is provided assumes we have a local NucliaDB with the default setuphttp://localhost:8080
Output: nucliadb_sdk.knowledgebox.KnowledgeBox
Example:
from nucliadb_sdk import create_knowledge_box
my_kb = create_knowledge_box("my_new_kb")
get_or_create(slug)
We can also use the method get_or_create
from utils
. It returns an existing KB if there is already one with the provided name in our DB, and creates a new KB if it doesn't.
Parameters:
slug
: name of the kbnucliadb_base_url
url and port where our NucliaDB is located, if none is provided assumes we have a local NucliaDB with the default setuphttp://localhost:8080
Output: nucliadb_sdk.knowledgebox.KnowledgeBox
Example:
from nucliadb_sdk import get_or_create
my_kb = get_or_create("my_new_kb")
get_kb(slug)
In case we only want to retrieve a KB if it has already been created, we can use get_kb
from utils
. It returns an existing KB if there is already one with the provided name in our DB, and None
if it doesn't.
Parameters:
slug
: name of the kbnucliadb_base_url
url and port where our NucliaDB is located, if none is provided assumes we have a local NucliaDB with the default setuphttp://localhost:8080
Output: nucliadb_sdk.knowledgebox.KnowledgeBox
Example:
from nucliadb_sdk import get_kb
my_kb = get_kb("my_new_kb")
Delete local NucliaDB Knowledgebox
Deletes a local nucliadb kb.
Parameters:
slug
: name of the kb we want to delete
Example:
from nucliadb_sdk import delete_kb
delete_kb(my_kb_name)
Access a Nuclia.cloud Knowledgebox
If you are working with a local NucliaDB you have a simple solution on the previous section
To work with a KB from Nuclia.cloud, first we need to create a NucliaDBClient
NucliaDBClient
helps us connect to Nuclia. It takes the following parameters:
url
, url of the target KB, usually in this format"https://europe-1.nuclia.cloud/api/v1/kb/{kb_id}
api_key
, key to access the KB through the API, to obtain like thiswriter/reader/search/train_host
, optional, url and ports of the said endpoints
Once we have it, we can just instantiate a KnowledgeBox
object with the client as parameter.
Example:
from nucliadb_sdk.client import Environment, NucliaDBClient
from nucliadb_sdk import KnowledgeBox
url_kb = (
"https://europe-1.nuclia.cloud/api/v1/kb/{my_kb_id}"
)
nucliadbclient = NucliaDBClient(
api_key=my_api_key,
url=url_kb,
)
my_kb = KnowledgeBox(nucliadbclient)
Upload data
Once we have a KnowledgeBox object we can upload data to our KB. Our uploads can be of text, binary files, vectors or all together, and they can have labels and entities associated.
Parameters:
key
: upload id, optional, but allows us to modify/update an existing resourcebinary
: optional, binary file to uploadtext
: optional, text to uploadlabels
: optional,list of labels for this resource. They can be defined as a list of Strings or of Labelsentities
: optional, needs to be defined as a list of Entity objectsvectors
: optional, can be defined as a list of Vectors or as a distionary with the name/s of the vectorset as keys and the corresponding vector (numpy, array or tensors) as values.
Output: resource_id
Unique string that identifies the uploaded resource
Note: Once uploaded, we can use both resource_id
an key
to access the uploaded resource
Example:
from nucliadb_sdk import KnowledgeBox, Entity, File
from sentence_transformers import SentenceTransformer
encoder = SentenceTransformer("all-MiniLM-L6-v2")
resource_id = my_kb.upload(
key="mykey1",
binary=File(data=b"asd", filename="data"),
text="I'm Sierra, a very happy dog",
labels=["emotion/positive"],
entities=[Entity(type="NAME", value="Sierra", positions=[(4, 9)])],
vectors={"all-MiniLM-L6-v2": encoder.encode(["I'm Sierra, a very happy dog"])[0]},
)
uknowledgebox[resource_id] == knowledgebox["mykey1"]
Labelset and Label
Labelset is how we define a group of Labels that make sense together. We can define Labels two ways:
Create a Labelset manually on a KB with the function set_labels
Parameters:
labelset
: String with the name of the labelsetlabels
: list of Strings with the names of the labelslabelset_type
: the granularity of the resource that the labels will refer to (sentences, paragraphs or resources). At the moment only LabelType.RESOURCES is supported
Example:
from nucliadb_sdk import get_or_create
my_kb = get_or_create("my_new_kb")
my_kb.set_labels("emotion", ["positive","negative"], LabelType.RESOURCES)
Or we can just upload labels with our upload
function, defining them as a string made up of the labelset and the label names labelset_name/label_name
For example if we upload this resource:
from nucliadb_sdk import KnowledgeBox
from sentence_transformers import SentenceTransformer
encoder = SentenceTransformer("all-MiniLM-L6-v2")
resource_id = knowledgebox.upload(
text="She's having a terrible day",
labels=["emotion/negative"],
vectors={"all-MiniLM-L6-v2": encoder.encode(["She's having a terrible day"])[0]},
)
Automatically our KB will add the labelset emotion
if it is not already created.
We can also define Labels with the class Label
:
from nucliadb_sdk import Label
Label(labelset="emotion", label="neutral")
To list of the Labelsets and Labels associated with uploaded resources, we can use the function get_uploaded_labels()
on our Knowledgebox.
It will return a dictionary with the labelset names as keys and Labelset structures as values. This structures contain:
count
: total number of labeled resourceslabels
: dictionary witll all label names as keys and their number of occurrences as value
Example:
from nucliadb_sdk import KnowledgeBox
my_kb.get_uploaded_labels()
Output:
{'emotion': LabelSet(count=10, labels={'positive': 3, 'neutral': 2, 'negative': 5})}
Vectorset and Vector
Vectorsets and Vectors work in a very similar way to Labels and Labelsets. A Vectorset is a set of Vectors that we can define for our KB, they usually represent encodings of our text/file with a common model. We can define as many Vectorsets per KB as we want.
We can do it manually with the new_vectorset
function of a KB object.
Parameters:
key
: String with the name of the vectorsetdimensions
: Dimensions of our vectorssimilarity
: Strategy to calculate distance between vectors, by default it'scosine
(cosine similarity) but it can also bedot
dot product
Example:
from nucliadb_sdk import KnowledgeBox
my_kb.new_vectorset("roberta", 1024)
Even if we do not create a Vectorset manually, it will be created automatically when we upload resources with vectors.
Example:
resource_id = my_kb.upload(
text="She's having a terrible day",
labels=["emotion/negative"],
vectors={"all-MiniLM-L6-v2": encoder.encode(["She's having a terrible day"])[0]},
)
To see all the Vectosets of a KB we use list_vectorset
.
Example:
from nucliadb_sdk import KnowledgeBox
my_kb.list_vectorset()
It will return a dictionary with all the defined Vectorsets and their dimensions:
{'roberta': VectorSet(dimension=1024), 'all-MiniLM-L6-v2': VectorSet(dimension=384)}
To delete a Vectorset we use del_vectorset
.
If we run this:
my_kb.del_vectorset("roberta")
And list our Vectorsets again:
my_kb.list_vectorset()
The result will be this:
VectorSets(vectorsets={'all-MiniLM-L6-v2': VectorSet(dimension=384)})
Search
Our search
function will return resources from our KB that match one/more conditions:
- Filter by label, returns resources whose labels match the provided ones.
- Full text/keyword search, given a word or set of words returns resources that have them in their text field
- Semantic search, given some vectors and a Vectorset , returns results whose vectors are similar to the ones provided, sorted by their cosine similarity.
Each matched result has the following fields:
Key
: Id of the matched resourceText
: Text of the matched resultLabels
: labels associated with the resultScore
: score of the result (0 for searches by label)Score_type
: BM25 for keyword search, Cosine similarity for semantic search
Now let's go into detail for each kind of search:
Search by label
To filter by label we need to use the parameter filter
.
filter
takes in an array of Label objects or strings.
The simpler way is to filter with a string that must be a combination of the Labelset and Label names, labelset_name/label_name
.
results = my_kb.search(
filter=["emotion/positive"]
)
If we wanted to iterate over the results, we can just do:
for result in results:
print(f"Resource key: {result.key}")
print(f"Text: {result.text}")
print(f"Labels: {result.labels}")
And it will output something like this:
Resource key: f1de1c1e3fac43aaa53dcdc54ffd07fc
Text: I'm Sierra, a very happy dog
Labels: ['positive']
Resource key: b445359d434b47dfb6a37ca45c14c2b3
Text: what a delighful day
Labels: ['positive']
Full text search
For this search we use the parameter text
, that is String.
We can use it with one word, like this:
from nucliadb_sdk import KnowledgeBox
results = my_kb.search(
text="dog"
)
If we wanted to iterate over the results, we can just do:
for result in results:
print(f"Text: {result.text}")
print(f"Labels: {result.labels}")
print(f"Score: {result.score}")
print(f"Key: {result.key}")
print(f"Score Type: {result.score_type}")
And the results will look like this:
Resource key: 4f1f570398c543e0b8c3b86e87ee2fbd
Text: Dog in catalan is gos
Score type: BM25
Score: 0.8871671557426453
Labels: ['neutral']
Resource key: 665e85f0fb2e4b2fbde8b4957b7462c1
Text: I'm Sierra, a very happy dog
Score type: BM25
Score: 0.7739118337631226
Labels: ['positive']
We can also look for multiple words:
results = my_kb.search(
text="he is"
)
And the results will look like this:
Resource key: d22d0d8acba040a2afd7a26ea0517769
Text: he is heartbroken
Score type: BM25
Score: 2.501499891281128
Labels: ['negative']
Resource key: 808c1557027e4109b4be8cbe995be8b1
Text: He said that the race is quite tough
Score type: BM25
Score: 1.7510499954223633
Labels: ['neutral']
Or combine it with the Label filter:
results = my_kb.search(
filter=["emotion/neutral"]
text="dog"
)
In this case we'll only have one result:
Resource key: 4f1f570398c543e0b8c3b86e87ee2fbd
Text: Dog in catalan is gos
Score type: BM25
Score: 0.8871671557426453
Labels: ['neutral']
Semantic search
For the vector search we need to input the vectors with the vector
parameter and indicate the Vectorset we want to search in with the vectorset
parameter.
If wanted we can define a threshold with the minimun cosine similarity that we want in our results with min_score
.
The first step will usually be calculating the vectors for a query with the model we used for the same vectorset in our resources:
from nucliadb_sdk import KnowledgeBox
from sentence_transformers import SentenceTransformer
encoder = SentenceTransformer("all-MiniLM-L6-v2")
query_vectors = encoder.encode(["To be in love"])[0]
Then we pass them to the search.
Note that the parameter vector
takes arrays, numpy arrays or tensors, so no need for any conversion.
results = my_kb.search(vector = query_vectors, vectorset="all-MiniLM-L6-v2",min_score=0.25)
We iterate over the results just like in the previous searches:
for result in results:
print(f"Text: {result.text}")
print(f"Labels: {result.labels}")
print(f"Score: {result.score}")
print(f"Key: {result.key}")
print(f"Score Type: {result.score_type}")
print("------")
As you can see, our results are ordered by score:
Text: love is tough
Labels: ['negative']
Score: 0.4688602387905121
Key: a027ee34f3a7489d9a264b9f3d08d3a5
Score Type: COSINE
------
Text: he is heartbroken
Labels: ['negative']
Score: 0.27540814876556396
Key: 25bc7b22b4fb4f64848a1b7394fb69b1
Score Type: COSINE
Note: We can combine semantic and keyword search with a label filter and the results will apppear filtered by both constraints. But if we use both text and vector search, the results will be the ones that match one constraint OR the other