NucliaDB Dataset
Nucliadb-dataset is a python open source library designed to export your NucliaDB data to an arrow file compatible with most NLP/ML dataset formats. It allows you to export data from either a local NucliaDB or Nuclia.cloud.
Installation
To make good use of this library you should have already uploaded resources into your local NucliaDB or in Nuclia.
In case you do not have NucliaDB installed, you can either:
- Run NucliaDB docker image:
docker run -it \
-e LOG=INFO \
-p 8080:8080 \
-p 8060:8060 \
-p 8040:8040 \
-v nucliadb-standalone:/data \
nuclia/nucliadb:latest
- Or install with pip and run:
pip install nucliadb
nucliadb
To get started with Nuclia follow this link
Once you are set up, you can install the library via pip:
pip install nucliadb-dataset
Basic concepts
Before you look through our docs, here are some useful concepts:
-
KnowledgeBox
: Our concept of a data container, often referred to as KB. -
NucliaDBDataset
: Data structure that represents resources contained in a KB in a ready to train format, usually filtered by one or more labelsets/sets of entities. -
Partitions
: For big KBs your data can be stored in different logic units: an arrow file will be generated for each one. -
TASKS
: Describes the granularity of the task for which you are exporting the data in arrow files. All of them export an arrow with fieldstext
andlabels
, but with different content.-
PARAGRAPH_CLASSIFICATION
: Returns text blocks astext
and the array of labels that correspond to that text blocks inlabels
. -
FIELD_CLASSIFICATION
: Returns resource fields astext
and the array of labels that correspond to that field inlabels
. -
SENTENCE_CLASSIFICATION
: Returns strings astext
and the array of labels that correspond to that sentence inlabels
. In this case the labels can correspond to both field or text block levels. -
TOKEN_CLASSIFICATION
: Returns an array of tokens astext
and inlabels
the NER annotations of those tokens in BIO format.
-
PARAGRAPH_CLASSIFICATION
and SENTENCE_CLASSIFICATION
are only avalaible for resources processed through the Nuclia platform or with an API_KEY
Labels
always appear in Nuclia's format labelset_name/label_name
.
Exporting data from a local NucliaDB
The most straightforward way to export arrow files from a KB, particularly if you have a local NucliaDB, is to use the function download_all_partitions
from nucliadb_dataset
.
Parameters:
task
: Determines the format and source of the exported data. It has to be one of those defined above.nucliadb_base_url
: (optional) The base url of the NucliaDB from which you will get your data. By default this will behttp://localhost:8080
.path
: (optional) The path to the directory where you want your files to go. This will be the current directory by default.kbid
: (optional) ID of the KB from which you want to export the data.slug
: (optional) Slug corresponding to the KB from which you want to export the data.labels
: List of strings with either the label sets or entity families you want to export.sdk
: (optional) Instance of thenucliadb_sdk.NucliaDB
object to which you are interacting to. If not set, it will create one pointing tonucliadb_base_url
.
Note: you will need either the kbid
or slug
parameter to locate your KB
Returns:
- List with all the paths of the generated arrow files.
Creating and exporting tokens and entities from a local KB
Here is an example of how to upload resources with entities, generate its arrow file and create a HuggingFace Dataset:
You will create or retrieve the KB from your NucliaDB:
from nucliadb_sdk import NucliaDB, exceptions
def get_or_create_kb(sdk: NucliaDB, slug: str) -> str:
try:
kb = sdk.get_knowledge_box_by_slug(slug=slug)
except exceptions.NotFoundError:
kb = sdk.create_knowledge_box(slug=slug)
finally:
return kb.uuid
sdk = NucliaDB()
kbid = get_or_create_kb(sdk, "entity_test")
Then you will define the entity family you are going to use, and its entities:
from nucliadb_models.entities import Entity, CreateEntitiesGroupPayload
entity_group = CreateEntitiesGroupPayload(
group="TECH",
entities={
"NucliaDB": Entity(value="NucliaDB"),
"ChatGPT": Entity(value="ChatGPT"),
}
)
sdk.create_entitygroup(kbid=kbid, content=entity_group)
Upload some resources:
from nucliadb_models.metadata import UserFieldMetadata
sdk.create_resource(
kbid=kbid,
texts={"text": {"body": "I'm NucliaDB"}},
fieldmetadata=[UserFieldMetadata(
field={"field_type": "text", "field": "text"},
token=[{"token": "NucliaDB", "klass": "TECH", "start": 4, "end": 12}]
)],
)
sdk.create_resource(
kbid=kbid,
texts={"text": {"body": "I'm not ChatGPT"}},
fieldmetadata=[UserFieldMetadata(
field={"field_type": "text", "field": "text"},
token=[{"token": "ChatGPT", "klass": "TECH", "start": 8, "end": 15}]
)],
)
sdk.create_resource(
kbid=kbid,
texts={"text": {"body": "Natural language processing is the future"}}
)
Then you can download the files:
from nucliadb_dataset.dataset import download_all_partitions
arrow_filenames = download_all_partitions(
task="TOKEN_CLASSIFICATION",
kbid=kbid,
labels=["TECH"]
sdk=sdk
)
And load them in a HF Dataset:
from datasets import Dataset, concatenate_datasets
all_datasets = []
for filename in arrow_filenames:
all_datasets.append(Dataset.from_file(filename))
ds = concatenate_datasets(all_datasets)
The Dataset will look like this:
Dataset({
features: ['text', 'labels'],
num_rows: 2
})
With the tagged sentences and labels:
ds["text"]
## [["I'm", 'not', 'ChatGPT'], ["I'm", 'NucliaDB']]
ds["labels"]
### [['O', 'O', 'B-TECH'], ['O', 'B-TECH']]
Exporting fields and labels from a local KB to a python dictionary list
For instance, if you have a KB called my_kb
in your NucliaDB with resource labels from a label set called emotion
, you can generate an arrow with the text from each resource as text
and its labels in labels
.
To do so, we need to use the task FIELD_CLASSIFICATION
.
First you will generate the arrow files:
from nucliadb_dataset.dataset import download_all_partitions
import pyarrow as pa
arrow_filenames = download_all_partitions(
task="FIELD_CLASSIFICATION",
slug="my_new_kb",
labels=["emotion"]
)
Then you can read the arrows and convert them to a python list of dictionaries:
import pyarrow as pa
for file in arrow_filenames:
with pa.memory_map(file, "rb") as source:
loaded_array = pa.ipc.open_stream(source).read_all()
my_data_dict = loaded_array.to_pylist()
The contents of my_data_dict
would look like this:
[{'text': 'love is tough', 'labels': ['emotion/negative']},
{'text': "Valentine's day is next week", 'labels': ['emotion/neutral']},
{'text': 'he is heartbroken', 'labels': ['emotion/negative']},
{'text': "I'm Sierra, a very happy dog", 'labels': ['emotion/positive']},
{'text': 'what a delighful day', 'labels': ['emotion/positive']},
{'text': 'He said that the race is quite tough',
'labels': ['emotion/neutral']},
{'text': 'Dog in catalan is gos', 'labels': ['emotion/neutral']}]
Exporting data from Nuclia.cloud
Creating a Dataset from a Nuclia KB
Another way of generating your arrow files is from a NucliaDBDataset object. This is the ideal way for exporting from nuclia.cloud, but you can also use it for your local NucliaDB.
For this we will first need to create a nucliadb_sdk.NucliaDB
.
NucliaDB
connects you to Nuclia. It takes the following parameters:
url
: url of the target KB, usually in this format"https://europe-1.nuclia.cloud/api
.api_key
: key to access the KB through the API, obtained here.region
: Region.ON_PREM for local NucliaDB, Region.EUROPE1 forNuclia.cloud
.
Example:
from nucliadb_sdk import Region, NucliaDB
sdk = NucliaDB(
api_key=my_api_key,
region=Region.EUROPE1,
url="https://europe-1.nuclia.cloud/api",
)
Once you have your NucliaDB
, you can create a NucliaDBDataset.
You will need the following parameters:
sdk
: Your previously created NucliaDB.labels
: Array with the target label sets or entity families.task
: Selected from the those defined in the classTask
.base_path
: (optional) Path to the directory where you want your files to go. This will be the current directory by default.
Example for a SENTENCE_CLASSIFICATION
task on a label set called movie_type
:
from nucliadb_dataset.dataset import NucliaDBDataset,Task
dataset_reader = NucliaDBDataset(
sdk=sdk,
kbid=kbid,
labels=["movie_type"],
task=Task.SENTENCE_CLASSIFICATION,
base_path=tmpdirname.name,
)
Exporting sentences and labels to a Pandas Dataframe
Once you have a NucliaDBDataset
created like the one below:
from nucliadb_dataset.dataset import NucliaDBDataset,Task
dataset_reader = NucliaDBDataset(
sdk=sdk,
kbid=kbid,
task=Task.SENTENCE_CLASSIFICATION,
labels=["movie_type"],
base_path=tmpdirname.name,
)
You can use the method read_all_partitions
that will return a list of all the generated arrow files.
It has the following parameters:
force
: (optional) By default it does not overwrite the arrow files if you have already downloaded them, if set to "True" it does.path
: (optional) The path where the arrow files will be stored. By default it takes the path specified when instantiating theNucliaDBDataset
object, or the current path if none was provided.
This updates arrow_filenames
with the list of arrow files:
arrow_filenames = dataset_reader.read_all_partitions():
Then you can read the arrows and load them into a pandas dataframe:
import pyarrow as pa
for file in arrow_filenames:
with pa.memory_map(file, "rb") as source:
loaded_array = pa.ipc.open_stream(source).read_all()
my_pandas_data = loaded_array.to_pandas()
Where my_pandas_data
looks like this:
text | labels | |
---|---|---|
0 | After a virus turns most people into zombies, the world's surviving humans remain locked in an ongoing battle against the hungry undead. | ['movie_type/horror'] |
1 | Four survivors -- Tallahassee (Woody Harrelson) and his cohorts Columbus (Jesse Eisenberg), Wichita (Emma Stone) and Little Rock (Abigail Breslin) -- abide by a list of survival rules and zombie-killing strategies as they make their way toward a rumored safe haven in Los Angeles. | ['movie_type/horror'] |
... | ........ | .... |
1100 | All Quiet on the Western Front tells the gripping story of a young German soldier on the Western Front of World War I. Paul and his comrades experience first-hand how the initial euphoria of war turns into desperation and fear as they fight for their lives, and each other, in the trenches. | ['movie_type/action'] |
1101 | The film from director Edward Berger is based on the world renowned bestseller of the same name by Erich Maria Remarque. | ['movie_type/action'] |
Exporting fields and labels to a HF Dataset
You can create a nucliadb_sdk.NucliaDB
and a NucliaDBDataset
as previously explained, but using the task Task.FIELD_CLASSIFICATION
:
from nucliadb_dataset.dataset import NucliaDBDataset
dataset_reader = NucliaDBDataset(
sdk=sdk,
kbid=kbid,
task=Task.FIELD_CLASSIFICATION,
labels=["sentiment_resources"],
)
Then you can download the arrows with read_all_partitions
and convert them to a HF Dataset:
from datasets import Dataset, concatenate_datasets
all_datasets = []
for filename in dataset_reader.read_all_partitions():
all_datasets.append(Dataset.from_file(filename))
my_hf_dataset = concatenate_datasets(all_datasets)
The content of my_hf_dataset
will look like this:
{'text': [" \n okay i'm sorry but TAYLOR SWIFT LOOKS NOTHING \n LIKE ..... Luck and Hilton puts you in a good place \n going into NFL Sunday. \n \n \n ", 'new_sentiment_export.pdf\n'],
'labels': [['sentiment_resources/positive'], ['sentiment_resources/positive']]}
Note that the title and content of a file are exported separately. This is because they are different fields
Exporting text blocks and labels to a Polars Dataset
You can create a nucliadb_sdk.NucliaDB
and a NucliaDBDataset
as previously explained, but using the task Task.PARAGRAPH_CLASSIFICATION
:
from nucliadb_dataset.dataset import NucliaDBDataset
dataset_reader = NucliaDBDataset(
sdk=sdk,
kbid=kbid,
task=Task.PARAGRAPH_CLASSIFICATION,
labels=["p_football_teams"],
)
Then you will download the arrows with read_all_partitions
and convert them to a Polars Dataset:
import polars
import pyarrow as pa
for file in dataset_reader.read_all_partitions():
with pa.memory_map(file, "rb") as source:
loaded_array = pa.ipc.open_stream(source).read_all()
my_polars_data = polars.from_arrow(loaded_array)
The content of my_polars_data
will look like this:
text labels
str list[str]
" El Real Bet... ["p_football_teams/Real Betis"]
" La constitu... ["p_football_teams/Real Madrid"]
" El fútbol e... ["p_football_teams/Real Madrid"]
" Fue él quie... ["p_football_teams/Real Madrid"]
" Procedente ... ["p_football_teams/Real Madrid"]
" El 29 de no... ["p_football_teams/FC Barcelona"]
" EL FC BARCE... ["p_football_teams/FC Barcelona"]
"11-fcb-resourc... ["p_football_teams/Real Madrid"]
" Francesc Tito... ["p_football_teams/FC Barcelona"]
" Vilanova ta... ["p_football_teams/FC Barcelona"]
" LA MUERTE D... ["p_football_teams/Real Madrid"]
" Josep Suñol, ... ["p_football_teams/FC Barcelona"]
" EL CAMPO DE... ["p_football_teams/FC Barcelona"]
" LOS AÑOS 30... ["p_football_teams/FC Barcelona"]
" Lo que a pr... ["p_football_teams/Real Madrid"]
" A comienzos... ["p_football_teams/Real Madrid"]
" Primero, al V... ["p_football_teams/Real Madrid"]
" Tras quince... ["p_football_teams/Real Betis"]
" En cambio, el... ["p_football_teams/Real Betis"]
" LA COPA LAT... ["p_football_teams/FC Barcelona"]
Converting your arrow files to different formats
Once you have a list of generated arrow files, obtained either via download_all_partitions
or read_all_partitions
you can easily convert them to many diferent formats.
Here is the data in case you want to reproduce these examples (These examples will use a local NucliaDB, but the same conversions will work with any KB):
from nucliadb_sdk import NucliaDB
from nucliadb_models.metadata import UserMetadata
sdk = NucliaDB()
my_kb = sdk.create_knowledge_box(slug="my_new_kb").uuid
sentences = ["I'm Sierra, a very happy dog","She's having a terrible day","what a delighful day","Dog in catalan is gos","he is heartbroken","He said that the race is quite tough","love is tough"]
labels = [("emotion", "positive"), ("emotion", "negative"), ("emotion", "positive"),("emotion", "neutral"), ("emotion", "negative"), ("emotion", "neutral"), ("emotion", "negative")]
for i in range(len(sentences)):
sdk.create_resource(
kbid=kbid,
texts={"text": {"body": sentences[i]}},
usermetadata=UserMetadata(
classifications=[{"labelset": labels[i][0], "label": labels[i][1]}]
)
Wait for the data to be processed by Nuclia, and then:
arrow_filenames = download_all_partitions(
task="FIELD_CLASSIFICATION",
slug="my_new_kb",
labels=["emotion"]
)
Convert to a HF Dataset
From the list of generated arrow files arrow_filenames
you can convert to a HuggingFace Dataframe with only these few lines of code:
from datasets import Dataset, concatenate_datasets
all_datasets = []
for filename in arrow_filenames:
all_datasets.append(Dataset.from_file(filename))
ds = concatenate_datasets(all_datasets)
The Dataset will look like this:
Dataset({
features: ['text', 'labels'],
num_rows: 7
})
With this content:
{'text': ["I'm Sierra, a very happy dog",
'love is tough',
'he is heartbroken',
'what a delighful day',
'Dog in catalan is gos',
'He said that the race is quite tough'],
'labels': [['emotion/positive'],
['emotion/negative'],
['emotion/negative'],
['emotion/positive'],
['emotion/neutral'],
['emotion/neutral']]}
Convert to a Pandas Dataframe
From the list of generated arrow files arrow_filenames
you can convert to a Pandas Dataframe with these few lines of code:
import pyarrow as pa
for file in arrow_filenames:
with pa.memory_map(file, "rb") as source:
loaded_array = pa.ipc.open_stream(source).read_all()
my_pandas_data = loaded_array.to_pandas()
You will get something like this:
text labels
0 I'm Sierra, a very happy dog [emotion/positive]
1 love is tough [emotion/negative]
2 he is heartbroken [emotion/negative]
3 what a delighful day [emotion/positive]
4 Dog in catalan is gos [emotion/neutral]
5 He said that the race is quite tough [emotion/neutral]
6 She's having a terrible day [emotion/negative]
Convert to a Python list of dictionaries
From the list of generated arrowfiles arrow_filenames
you can convert to a list of Python dictionaries with these few lines of code:
import pyarrow as pa
for file in arrow_filenames:
with pa.memory_map(file, "rb") as source:
loaded_array = pa.ipc.open_stream(source).read_all()
my_data_dict = loaded_array.to_pylist()
You will get something like this:
[{'text': "I'm Sierra, a very happy dog", 'labels': ['emotion/positive']},
{'text': 'love is tough', 'labels': ['emotion/negative']},
{'text': 'he is heartbroken', 'labels': ['emotion/negative']},
{'text': 'what a delighful day', 'labels': ['emotion/positive']},
{'text': 'Dog in catalan is gos', 'labels': ['emotion/neutral']},
{'text': 'He said that the race is quite tough',
'labels': ['emotion/neutral']},
{'text': "She's having a terrible day", 'labels': ['emotion/negative']}]
Convert to a Polars Dataframe
From the list of generated arrowfiles arrow_filenames
you can convert to a Polars Dataframe with these few lines of code:
import pyarrow as pa
import polars
for file in arrow_filenames:
with pa.memory_map(file, "rb") as source:
loaded_array = pa.ipc.open_stream(source).read_all()
my_polars_data = polars.from_arrow(loaded_array)
You will get something like this:
text labels
str list[str]
"I'm Sierra, a ... ["emotion/positive"]
"love is tough" ["emotion/negative"]
"he is heartbro... ["emotion/negative"]
"what a delighf... ["emotion/positive"]
"Dog in catalan... ["emotion/neutral"]
"He said that t... ["emotion/neutral"]
"She's having a... ["emotion/negative"]