Skip to main content

NucliaDB Dataset

Nucliadb-dataset is a python open source library designed to export your NucliaDB data to an arrow file compatible with most NLP/ML dataset formats. It allows you to export data from either a local NucliaDB or Nuclia.cloud.

Installation

To make good use of this library you should have already uploaded resources into your local NucliaDB or in Nuclia.

In case you do not have NucliaDB installed, you can either:

  • Run NucliaDB docker image:
docker run -it \
-e LOG=INFO \
-p 8080:8080 \
-p 8060:8060 \
-p 8040:8040 \
-v nucliadb-standalone:/data \
nuclia/nucliadb:latest
  • Or install with pip and run:
pip install nucliadb
nucliadb

To get started with Nuclia follow this link

Once you are set up, you can install the library via pip:

pip install nucliadb-dataset

Basic concepts

Before you look through our docs, here are some useful concepts:

  • KnowledgeBox: Our concept of a data container, often referred to as KB.

  • NucliaDBDataset: Data structure that represents resources contained in a KB in a ready to train format, usually filtered by one or more labelsets/sets of entities.

  • Partitions: For big KBs your data can be stored in different logic units: an arrow file will be generated for each one.

  • TASKS: Describes the granularity of the task for which you are exporting the data in arrow files. All of them export an arrow with fields text and labels, but with different content.

    • PARAGRAPH_CLASSIFICATION: Returns text blocks as text and the array of labels that correspond to that text blocks in labels.

    • FIELD_CLASSIFICATION: Returns resource fields as text and the array of labels that correspond to that field in labels.

    • SENTENCE_CLASSIFICATION: Returns strings as text and the array of labels that correspond to that sentence in labels. In this case the labels can correspond to both field or text block levels.

    • TOKEN_CLASSIFICATION: Returns an array of tokens as text and in labels the NER annotations of those tokens in BIO format.

PARAGRAPH_CLASSIFICATION and SENTENCE_CLASSIFICATION are only avalaible for resources processed through the Nuclia platform or with an API_KEY Labels always appear in Nuclia's format labelset_name/label_name.

Exporting data from a local NucliaDB

The most straightforward way to export arrow files from a KB, particularly if you have a local NucliaDB, is to use the function download_all_partitions from nucliadb_dataset.

Parameters:

  • task: Determines the format and source of the exported data. It has to be one of those defined above.
  • nucliadb_base_url: (optional) The base url of the NucliaDB from which you will get your data. By default this will be http://localhost:8080.
  • path: (optional) The path to the directory where you want your files to go. This will be the current directory by default.
  • kbid: (optional) ID of the KB from which you want to export the data.
  • slug: (optional) Slug corresponding to the KB from which you want to export the data.
  • labels: List of strings with either the label sets or entity families you want to export.
  • sdk: (optional) Instance of the nucliadb_sdk.NucliaDB object to which you are interacting to. If not set, it will create one pointing to nucliadb_base_url.

Note: you will need either the kbid or slug parameter to locate your KB

Returns:

  • List with all the paths of the generated arrow files.

Creating and exporting tokens and entities from a local KB

Here is an example of how to upload resources with entities, generate its arrow file and create a HuggingFace Dataset:

You will create or retrieve the KB from your NucliaDB:

from nucliadb_sdk import NucliaDB, exceptions

def get_or_create_kb(sdk: NucliaDB, slug: str) -> str:
try:
kb = sdk.get_knowledge_box_by_slug(slug=slug)
except exceptions.NotFoundError:
kb = sdk.create_knowledge_box(slug=slug)
finally:
return kb.uuid


sdk = NucliaDB()
kbid = get_or_create_kb(sdk, "entity_test")

Then you will define the entity family you are going to use, and its entities:

from nucliadb_models.entities import Entity, CreateEntitiesGroupPayload

entity_group = CreateEntitiesGroupPayload(
group="TECH",
entities={
"NucliaDB": Entity(value="NucliaDB"),
"ChatGPT": Entity(value="ChatGPT"),
}
)
sdk.create_entitygroup(kbid=kbid, content=entity_group)

Upload some resources:

from nucliadb_models.metadata import UserFieldMetadata

sdk.create_resource(
kbid=kbid,
texts={"text": {"body": "I'm NucliaDB"}},
fieldmetadata=[UserFieldMetadata(
field={"field_type": "text", "field": "text"},
token=[{"token": "NucliaDB", "klass": "TECH", "start": 4, "end": 12}]
)],
)

sdk.create_resource(
kbid=kbid,
texts={"text": {"body": "I'm not ChatGPT"}},
fieldmetadata=[UserFieldMetadata(
field={"field_type": "text", "field": "text"},
token=[{"token": "ChatGPT", "klass": "TECH", "start": 8, "end": 15}]
)],
)

sdk.create_resource(
kbid=kbid,
texts={"text": {"body": "Natural language processing is the future"}}
)

Then you can download the files:


from nucliadb_dataset.dataset import download_all_partitions

arrow_filenames = download_all_partitions(
task="TOKEN_CLASSIFICATION",
kbid=kbid,
labels=["TECH"]
sdk=sdk
)

And load them in a HF Dataset:

from datasets import Dataset, concatenate_datasets 
all_datasets = []
for filename in arrow_filenames:
all_datasets.append(Dataset.from_file(filename))
ds = concatenate_datasets(all_datasets)

The Dataset will look like this:

Dataset({
features: ['text', 'labels'],
num_rows: 2
})

With the tagged sentences and labels:

ds["text"]
## [["I'm", 'not', 'ChatGPT'], ["I'm", 'NucliaDB']]
ds["labels"]
### [['O', 'O', 'B-TECH'], ['O', 'B-TECH']]

Exporting fields and labels from a local KB to a python dictionary list

For instance, if you have a KB called my_kb in your NucliaDB with resource labels from a label set called emotion, you can generate an arrow with the text from each resource as text and its labels in labels. To do so, we need to use the task FIELD_CLASSIFICATION.

First you will generate the arrow files:

from nucliadb_dataset.dataset import download_all_partitions
import pyarrow as pa

arrow_filenames = download_all_partitions(
task="FIELD_CLASSIFICATION",
slug="my_new_kb",
labels=["emotion"]
)

Then you can read the arrows and convert them to a python list of dictionaries:

import pyarrow as pa

for file in arrow_filenames:
with pa.memory_map(file, "rb") as source:
loaded_array = pa.ipc.open_stream(source).read_all()

my_data_dict = loaded_array.to_pylist()

The contents of my_data_dict would look like this:

[{'text': 'love is tough', 'labels': ['emotion/negative']},
{'text': "Valentine's day is next week", 'labels': ['emotion/neutral']},
{'text': 'he is heartbroken', 'labels': ['emotion/negative']},
{'text': "I'm Sierra, a very happy dog", 'labels': ['emotion/positive']},
{'text': 'what a delighful day', 'labels': ['emotion/positive']},
{'text': 'He said that the race is quite tough',
'labels': ['emotion/neutral']},
{'text': 'Dog in catalan is gos', 'labels': ['emotion/neutral']}]

Exporting data from Nuclia.cloud

Creating a Dataset from a Nuclia KB

Another way of generating your arrow files is from a NucliaDBDataset object. This is the ideal way for exporting from nuclia.cloud, but you can also use it for your local NucliaDB. For this we will first need to create a nucliadb_sdk.NucliaDB.

NucliaDB connects you to Nuclia. It takes the following parameters:

  • url: url of the target KB, usually in this format "https://europe-1.nuclia.cloud/api.
  • api_key: key to access the KB through the API, obtained here.
  • region: Region.ON_PREM for local NucliaDB, Region.EUROPE1 for Nuclia.cloud.

Example:

from nucliadb_sdk import Region, NucliaDB

sdk = NucliaDB(
api_key=my_api_key,
region=Region.EUROPE1,
url="https://europe-1.nuclia.cloud/api",
)

Once you have your NucliaDB, you can create a NucliaDBDataset. You will need the following parameters:

  • sdk: Your previously created NucliaDB.
  • labels: Array with the target label sets or entity families.
  • task: Selected from the those defined in the class Task.
  • base_path: (optional) Path to the directory where you want your files to go. This will be the current directory by default.

Example for a SENTENCE_CLASSIFICATION task on a label set called movie_type:

from nucliadb_dataset.dataset import NucliaDBDataset,Task

dataset_reader = NucliaDBDataset(
sdk=sdk,
kbid=kbid,
labels=["movie_type"],
task=Task.SENTENCE_CLASSIFICATION,
base_path=tmpdirname.name,
)

Exporting sentences and labels to a Pandas Dataframe

Once you have a NucliaDBDataset created like the one below:

from nucliadb_dataset.dataset import NucliaDBDataset,Task

dataset_reader = NucliaDBDataset(
sdk=sdk,
kbid=kbid,
task=Task.SENTENCE_CLASSIFICATION,
labels=["movie_type"],
base_path=tmpdirname.name,
)

You can use the method read_all_partitions that will return a list of all the generated arrow files.

It has the following parameters:

  • force: (optional) By default it does not overwrite the arrow files if you have already downloaded them, if set to "True" it does.
  • path: (optional) The path where the arrow files will be stored. By default it takes the path specified when instantiating the NucliaDBDataset object, or the current path if none was provided.

This updates arrow_filenames with the list of arrow files:

arrow_filenames = dataset_reader.read_all_partitions():

Then you can read the arrows and load them into a pandas dataframe:

import pyarrow as pa

for file in arrow_filenames:
with pa.memory_map(file, "rb") as source:
loaded_array = pa.ipc.open_stream(source).read_all()
my_pandas_data = loaded_array.to_pandas()

Where my_pandas_data looks like this:

textlabels
0After a virus turns most people into zombies, the world's surviving humans remain locked in an ongoing battle against the hungry undead.['movie_type/horror']
1Four survivors -- Tallahassee (Woody Harrelson) and his cohorts Columbus (Jesse Eisenberg), Wichita (Emma Stone) and Little Rock (Abigail Breslin) -- abide by a list of survival rules and zombie-killing strategies as they make their way toward a rumored safe haven in Los Angeles.['movie_type/horror']
...............
1100All Quiet on the Western Front tells the gripping story of a young German soldier on the Western Front of World War I. Paul and his comrades experience first-hand how the initial euphoria of war turns into desperation and fear as they fight for their lives, and each other, in the trenches.['movie_type/action']
1101The film from director Edward Berger is based on the world renowned bestseller of the same name by Erich Maria Remarque.['movie_type/action']

Exporting fields and labels to a HF Dataset

You can create a nucliadb_sdk.NucliaDB and a NucliaDBDataset as previously explained, but using the task Task.FIELD_CLASSIFICATION:

from nucliadb_dataset.dataset import NucliaDBDataset

dataset_reader = NucliaDBDataset(
sdk=sdk,
kbid=kbid,
task=Task.FIELD_CLASSIFICATION,
labels=["sentiment_resources"],
)

Then you can download the arrows with read_all_partitions and convert them to a HF Dataset:

from datasets import Dataset, concatenate_datasets

all_datasets = []
for filename in dataset_reader.read_all_partitions():
all_datasets.append(Dataset.from_file(filename))
my_hf_dataset = concatenate_datasets(all_datasets)

The content of my_hf_dataset will look like this:

{'text': [" \n okay i'm sorry but TAYLOR SWIFT LOOKS NOTHING \n LIKE ..... Luck and Hilton puts you in a good place \n going into NFL Sunday. \n \n \n ", 'new_sentiment_export.pdf\n'],
'labels': [['sentiment_resources/positive'], ['sentiment_resources/positive']]}

Note that the title and content of a file are exported separately. This is because they are different fields

Exporting text blocks and labels to a Polars Dataset

You can create a nucliadb_sdk.NucliaDB and a NucliaDBDataset as previously explained, but using the task Task.PARAGRAPH_CLASSIFICATION:

from nucliadb_dataset.dataset import NucliaDBDataset

dataset_reader = NucliaDBDataset(
sdk=sdk,
kbid=kbid,
task=Task.PARAGRAPH_CLASSIFICATION,
labels=["p_football_teams"],
)

Then you will download the arrows with read_all_partitions and convert them to a Polars Dataset:

import polars
import pyarrow as pa

for file in dataset_reader.read_all_partitions():
with pa.memory_map(file, "rb") as source:
loaded_array = pa.ipc.open_stream(source).read_all()
my_polars_data = polars.from_arrow(loaded_array)

The content of my_polars_data will look like this:

text	labels
str list[str]
" El Real Bet... ["p_football_teams/Real Betis"]
" La constitu... ["p_football_teams/Real Madrid"]
" El fútbol e... ["p_football_teams/Real Madrid"]
" Fue él quie... ["p_football_teams/Real Madrid"]
" Procedente ... ["p_football_teams/Real Madrid"]
" El 29 de no... ["p_football_teams/FC Barcelona"]
" EL FC BARCE... ["p_football_teams/FC Barcelona"]
"11-fcb-resourc... ["p_football_teams/Real Madrid"]
" Francesc Tito... ["p_football_teams/FC Barcelona"]
" Vilanova ta... ["p_football_teams/FC Barcelona"]
" LA MUERTE D... ["p_football_teams/Real Madrid"]
" Josep Suñol, ... ["p_football_teams/FC Barcelona"]
" EL CAMPO DE... ["p_football_teams/FC Barcelona"]
" LOS AÑOS 30... ["p_football_teams/FC Barcelona"]
" Lo que a pr... ["p_football_teams/Real Madrid"]
" A comienzos... ["p_football_teams/Real Madrid"]
" Primero, al V... ["p_football_teams/Real Madrid"]
" Tras quince... ["p_football_teams/Real Betis"]
" En cambio, el... ["p_football_teams/Real Betis"]
" LA COPA LAT... ["p_football_teams/FC Barcelona"]

Converting your arrow files to different formats

Once you have a list of generated arrow files, obtained either via download_all_partitions or read_all_partitions you can easily convert them to many diferent formats.

Here is the data in case you want to reproduce these examples (These examples will use a local NucliaDB, but the same conversions will work with any KB):

from nucliadb_sdk import NucliaDB
from nucliadb_models.metadata import UserMetadata

sdk = NucliaDB()

my_kb = sdk.create_knowledge_box(slug="my_new_kb").uuid
sentences = ["I'm Sierra, a very happy dog","She's having a terrible day","what a delighful day","Dog in catalan is gos","he is heartbroken","He said that the race is quite tough","love is tough"]
labels = [("emotion", "positive"), ("emotion", "negative"), ("emotion", "positive"),("emotion", "neutral"), ("emotion", "negative"), ("emotion", "neutral"), ("emotion", "negative")]
for i in range(len(sentences)):
sdk.create_resource(
kbid=kbid,
texts={"text": {"body": sentences[i]}},
usermetadata=UserMetadata(
classifications=[{"labelset": labels[i][0], "label": labels[i][1]}]
)

Wait for the data to be processed by Nuclia, and then:

arrow_filenames = download_all_partitions(
task="FIELD_CLASSIFICATION",
slug="my_new_kb",
labels=["emotion"]
)

Convert to a HF Dataset

From the list of generated arrow files arrow_filenames you can convert to a HuggingFace Dataframe with only these few lines of code:

from datasets import Dataset, concatenate_datasets 
all_datasets = []
for filename in arrow_filenames:
all_datasets.append(Dataset.from_file(filename))
ds = concatenate_datasets(all_datasets)

The Dataset will look like this:

Dataset({
features: ['text', 'labels'],
num_rows: 7
})

With this content:

{'text': ["I'm Sierra, a very happy dog",
'love is tough',
'he is heartbroken',
'what a delighful day',
'Dog in catalan is gos',
'He said that the race is quite tough'],
'labels': [['emotion/positive'],
['emotion/negative'],
['emotion/negative'],
['emotion/positive'],
['emotion/neutral'],
['emotion/neutral']]}

Convert to a Pandas Dataframe

From the list of generated arrow files arrow_filenames you can convert to a Pandas Dataframe with these few lines of code:

import pyarrow as pa

for file in arrow_filenames:
with pa.memory_map(file, "rb") as source:
loaded_array = pa.ipc.open_stream(source).read_all()
my_pandas_data = loaded_array.to_pandas()

You will get something like this:

text	labels
0 I'm Sierra, a very happy dog [emotion/positive]
1 love is tough [emotion/negative]
2 he is heartbroken [emotion/negative]
3 what a delighful day [emotion/positive]
4 Dog in catalan is gos [emotion/neutral]
5 He said that the race is quite tough [emotion/neutral]
6 She's having a terrible day [emotion/negative]

Convert to a Python list of dictionaries

From the list of generated arrowfiles arrow_filenames you can convert to a list of Python dictionaries with these few lines of code:

import pyarrow as pa

for file in arrow_filenames:
with pa.memory_map(file, "rb") as source:
loaded_array = pa.ipc.open_stream(source).read_all()

my_data_dict = loaded_array.to_pylist()

You will get something like this:

[{'text': "I'm Sierra, a very happy dog", 'labels': ['emotion/positive']},
{'text': 'love is tough', 'labels': ['emotion/negative']},
{'text': 'he is heartbroken', 'labels': ['emotion/negative']},
{'text': 'what a delighful day', 'labels': ['emotion/positive']},
{'text': 'Dog in catalan is gos', 'labels': ['emotion/neutral']},
{'text': 'He said that the race is quite tough',
'labels': ['emotion/neutral']},
{'text': "She's having a terrible day", 'labels': ['emotion/negative']}]

Convert to a Polars Dataframe

From the list of generated arrowfiles arrow_filenames you can convert to a Polars Dataframe with these few lines of code:

import pyarrow as pa
import polars

for file in arrow_filenames:
with pa.memory_map(file, "rb") as source:
loaded_array = pa.ipc.open_stream(source).read_all()
my_polars_data = polars.from_arrow(loaded_array)

You will get something like this:

text	labels
str list[str]
"I'm Sierra, a ... ["emotion/positive"]
"love is tough" ["emotion/negative"]
"he is heartbro... ["emotion/negative"]
"what a delighf... ["emotion/positive"]
"Dog in catalan... ["emotion/neutral"]
"He said that t... ["emotion/neutral"]
"She's having a... ["emotion/negative"]