NucliaDB Dataset
Nucliadb-dataset is a python open source library designed to export your NucliaDB data to an arrow file compatible with most NLP/ML dataset formats. It allows you to export data from either a local NucliaDB or Nuclia.cloud.
Installation
To make good use of this library you should have already uploaded resources into your local NucliaDB or in Nuclia.
In case you do not have NucliaDB installed, you can either:
- Run NucliaDB docker image:
docker run -it \
-e LOG=INFO \
-p 8080:8080 \
-p 8060:8060 \
-p 8040:8040 \
-v nucliadb-standalone:/data \
nuclia/nucliadb:latest
- Or install with pip and run:
pip install nucliadb
nucliadb
To get started with Nuclia follow this link
Once you are set up, you can install the library via pip:
pip install nucliadb-dataset
Basic concepts
Before you look through our docs, here are some useful concepts:
KnowledgeBox
: Our concept of a data container, often referred to as KB.NucliaDBDataset
: Data structure that represents resources contained in a KB in a ready to train format, usually filtered by one or more labelsets/sets of entities.Partitions
: For big KBs your data can be stored in different logic units: an arrow file will be generated for each one.TASKS
: Describes the granularity of the task for which you are exporting the data in arrow files. All of them export an arrow with fieldstext
andlabels
, but with different content.PARAGRAPH_CLASSIFICATION
: Returns paragraphs astext
and the array of labels that correspond to that paragraph inlabels
.FIELD_CLASSIFICATION
: Returns resource fields astext
and the array of labels that correspond to that field inlabels
.SENTENCE_CLASSIFICATION
: Returns strings astext
and the array of labels that correspond to that sentence inlabels
. In this case the labels can correspond to both field or paragraph level.TOKEN_CLASSIFICATION
: Returns an array of tokens astext
and inlabels
the NER annotations of those tokens in BIO format.
PARAGRAPH_CLASSIFICATION
and SENTENCE_CLASSIFICATION
are only avalaible for resources processed through the Nuclia platform or with an API_KEY
Labels
always appear in Nuclia's format labelset_name/label_name
.
Exporting data from a local NucliaDB
The most straightforward way to export arrow files from a KB, particularly if you have a local NucliaDB, is to use the function download_all_partitions
from nucliadb_dataset
.
Parameters:
task
: Determines the format and source of the exported data. It has to be one of those defined above.nucliadb_base_url
: (optional) The base url of the NucliaDB from which you will get your data. By default this will behttp://localhost:8080
.path
: (optional) The path to the directory where you want your files to go. This will be the current directory by default.knowledgebox
: (optional) KB object corresponding to the KB from which you want to export the data.slug
: (optional) Slug corresponding to the KB from which you want to export the data.labels
: List of strings with either the label sets or entity families you want to export.
Note: you will need either the knowledgebox
or slug
parameter to locate your KB
Returns:
- List with all the paths of the generated arrow files.
Creating and exporting tokens and entities from a local KB
Here is an example of how to upload resources with entities, generate its arrow file and create a HuggingFace Dataset:
You will create or retrieve the KB from your NucliaDB:
from nucliadb_sdk import get_or_create
my_kb=get_or_create("entity_test")
Then you will define the entity family you are going to use, and its entities:
from nucliadb_sdk import KnowledgeBox
my_kb.set_entities("TECH",["NucliaDB","ChatGPT"])
Upload some resources:
from nucliadb_sdk import Entity, KnowledgeBox
my_kb.upload(
text="I'm NucliaDB",
entities=[Entity(type="TECH", value="NucliaDB", positions=[(4, 12)])],
)
my_kb.upload(
text="I'm not ChatGPT",
entities=[Entity(type="TECH", value="ChatGPT", positions=[(8, 15)])],
)
my_kb.upload(
text="Natural language processing is the future"
)
Then you can download the files:
from nucliadb_dataset.dataset import download_all_partitions
arrow_filenames = download_all_partitions(
task="TOKEN_CLASSIFICATION",
knowledgebox=my_kb,
labels=["TECH"]
)
And load them in a HF Dataset:
from datasets import Dataset, concatenate_datasets
all_datasets = []
for filename in arrow_filenames:
all_datasets.append(Dataset.from_file(filename))
ds = concatenate_datasets(all_datasets)
The Dataset will look like this:
Dataset({
features: ['text', 'labels'],
num_rows: 2
})
With the tagged sentences and labels:
ds["text"]
## [["I'm", 'not', 'ChatGPT'], ["I'm", 'NucliaDB']]
ds["labels"]
### [['O', 'O', 'B-TECH'], ['O', 'B-TECH']]
Exporting fields and labels from a local KB to a python dictionary list
For instance, if you have a KB called my_kb
in your NucliaDB with resource labels from a label set called emotion
, you can generate an arrow with the text from each resource as text
and its labels in labels
.
To do so, we need to use the task FIELD_CLASSIFICATION
.
First you will generate the arrow files:
from nucliadb_dataset.dataset import download_all_partitions
import pyarrow as pa
arrow_filenames = download_all_partitions(
task="FIELD_CLASSIFICATION",
slug="my_new_kb",
labels=["emotion"]
)
Then you can read the arrows and convert them to a python list of dictionaries:
import pyarrow as pa
for file in arrow_filenames:
with pa.memory_map(file, "rb") as source:
loaded_array = pa.ipc.open_stream(source).read_all()
my_data_dict = loaded_array.to_pylist()
The contents of my_data_dict
would look like this:
[{'text': 'love is tough', 'labels': ['emotion/negative']},
{'text': "Valentine's day is next week", 'labels': ['emotion/neutral']},
{'text': 'he is heartbroken', 'labels': ['emotion/negative']},
{'text': "I'm Sierra, a very happy dog", 'labels': ['emotion/positive']},
{'text': 'what a delighful day', 'labels': ['emotion/positive']},
{'text': 'He said that the race is quite tough',
'labels': ['emotion/neutral']},
{'text': 'Dog in catalan is gos', 'labels': ['emotion/neutral']}]
Exporting data from Nuclia.cloud
Creating a Dataset from a Nuclia KB
Another way of generating your arrow files is from a NucliaDBDataset object. This is the ideal way for exporting from nuclia.cloud, but you can also use it for your local NucliaDB.
For this we will first need to create a NucliaDBClient
.
NucliaDBClient
connects you to Nuclia. It takes the following parameters:
url
: url of the target KB, usually in this format"https://europe-1.nuclia.cloud/api/v1/kb/{kb_id}
.api_key
: key to access the KB through the API, obtained here.environment
: Environment.OSS for local NucliaDB, Environment.CLOUD forNuclia.cloud
.writer/reader/search/train_host
: (optional) url and ports of the said endpoints.
Example:
from nucliadb_sdk.client import Environment, NucliaDBClient
url_kb = (
"https://europe-1.nuclia.cloud/api/v1/kb/{my_kb_id}"
)
nucliadbclient = NucliaDBClient(
api_key=my_api_key,
environment=Environment.CLOUD,
url=url_kb,
)
Once you have your NucliaDBClient
, you can create a NucliaDBDataset.
You will need the following parameters:
client
: Your previously created NucliaDBClient.labels
: Array with the target label sets or entity families.task
: Selected from the those defined in the classTask
.base_path
: (optional) Path to the directory where you want your files to go. This will be the current directory by default.
Example for a SENTENCE_CLASSIFICATION
task on a label set called movie_type
:
from nucliadb_dataset.dataset import NucliaDBDataset,Task
dataset_reader = NucliaDBDataset(
client=nucliadbclient,
labels=["movie_type"],
task=Task.SENTENCE_CLASSIFICATION,
base_path=tmpdirname.name,
)
Exporting sentences and labels to a Pandas Dataframe
Once you have a NucliaDBDataset
created like the one above:
from nucliadb_dataset.dataset import NucliaDBDataset,Task
from nucliadb_sdk.client import Environment, NucliaDBClient
url_kb = (
"https://europe-1.nuclia.cloud/api/v1/kb/{my_kb_id}"
)
nucliadbclient = NucliaDBClient(
api_key=my_api_key,
environment=Environment.CLOUD,
url=url_kb,
)
dataset_reader = NucliaDBDataset(
client=nucliadbclient,
labels=["movie_type"],
task=Task.SENTENCE_CLASSIFICATION,
base_path=tmpdirname.name,
)
You can use the method read_all_partitions
that will return a list of all the generated arrow files.
It has the following parameters:
force
: (optional) By default it does not overwrite the arrow files if you have already downloaded them, if set to "True" it does.path
: (optional) The path where the arrow files will be stored. By default it takes the path specified when instantiating theNucliaDBDataset
object, or the current path if none was provided.
This updates arrow_filenames
with the list of arrow files:
arrow_filenames = dataset_reader.read_all_partitions():
Then you can read the arrows and load them into a pandas dataframe:
import pyarrow as pa
for file in arrow_filenames:
with pa.memory_map(file, "rb") as source:
loaded_array = pa.ipc.open_stream(source).read_all()
my_pandas_data = loaded_array.to_pandas()
Where my_pandas_data
looks like this:
text | labels | |
---|---|---|
0 | After a virus turns most people into zombies, the world's surviving humans remain locked in an ongoing battle against the hungry undead. | ['movie_type/horror'] |
1 | Four survivors -- Tallahassee (Woody Harrelson) and his cohorts Columbus (Jesse Eisenberg), Wichita (Emma Stone) and Little Rock (Abigail Breslin) -- abide by a list of survival rules and zombie-killing strategies as they make their way toward a rumored safe haven in Los Angeles. | ['movie_type/horror'] |
... | ........ | .... |
1100 | All Quiet on the Western Front tells the gripping story of a young German soldier on the Western Front of World War I. Paul and his comrades experience first-hand how the initial euphoria of war turns into desperation and fear as they fight for their lives, and each other, in the trenches. | ['movie_type/action'] |
1101 | The film from director Edward Berger is based on the world renowned bestseller of the same name by Erich Maria Remarque. | ['movie_type/action'] |
Exporting fields and labels to a HF Dataset
You can create a NucliaDBClient
and a NucliaDBDataset
as previously explained, but using the task Task.FIELD_CLASSIFICATION
:
from nucliadb_dataset.dataset import NucliaDBDataset
from nucliadb_sdk.client import Environment, NucliaDBClient
apikey=my_api_key
url_kb = (
"https://europe-1.nuclia.cloud/api/v1/kb/{my_kb_id}"
)
nucliadbclient = NucliaDBClient(
api_key=apikey,
environment=Environment.CLOUD,
url=url_kb,
)
dataset_reader = NucliaDBDataset(
client=nucliadbclient,
task=Task.FIELD_CLASSIFICATION,
labels=["sentiment_resources"],
)
Then you can download the arrows with read_all_partitions
and convert them to a HF Dataset:
from datasets import Dataset, concatenate_datasets
all_datasets = []
for filename in dataset_reader.read_all_partitions():
all_datasets.append(Dataset.from_file(filename))
my_hf_dataset = concatenate_datasets(all_datasets)
The content of my_hf_dataset
will look like this:
{'text': [" \n okay i'm sorry but TAYLOR SWIFT LOOKS NOTHING \n LIKE ..... Luck and Hilton puts you in a good place \n going into NFL Sunday. \n \n \n ", 'new_sentiment_export.pdf\n'],
'labels': [['sentiment_resources/positive'], ['sentiment_resources/positive']]}
Note that the title and content of a file are exported separately. This is because they are different fields
Exporting paragraphs and labels to a Polars Dataset
You can create a NucliaDBClient
and a NucliaDBDataset
as previously explained, but using the task Task.PARAGRAPH_CLASSIFICATION
:
from nucliadb_dataset.dataset import NucliaDBDataset
from nucliadb_sdk.client import Environment, NucliaDBClient
apikey=my_api_key
url_kb = (
"https://europe-1.nuclia.cloud/api/v1/kb/{my_kb_id}"
)
nucliadbclient = NucliaDBClient(
api_key=apikey,
environment=Environment.CLOUD,
url=url_kb,
)
dataset_reader = NucliaDBDataset(
client=nucliadbclient,
task=Task.PARAGRAPH_CLASSIFICATION,
labels=["p_football_teams"],
)
Then you will download the arrows with read_all_partitions
and convert them to a Polars Dataset:
import polars
import pyarrow as pa
for file in dataset_reader.read_all_partitions():
with pa.memory_map(file, "rb") as source:
loaded_array = pa.ipc.open_stream(source).read_all()
my_polars_data = polars.from_arrow(loaded_array)
The content of my_polars_data
will look like this:
text labels
str list[str]
" El Real Bet... ["p_football_teams/Real Betis"]
" La constitu... ["p_football_teams/Real Madrid"]
" El fútbol e... ["p_football_teams/Real Madrid"]
" Fue él quie... ["p_football_teams/Real Madrid"]
" Procedente ... ["p_football_teams/Real Madrid"]
" El 29 de no... ["p_football_teams/FC Barcelona"]
" EL FC BARCE... ["p_football_teams/FC Barcelona"]
"11-fcb-resourc... ["p_football_teams/Real Madrid"]
" Francesc Tito... ["p_football_teams/FC Barcelona"]
" Vilanova ta... ["p_football_teams/FC Barcelona"]
" LA MUERTE D... ["p_football_teams/Real Madrid"]
" Josep Suñol, ... ["p_football_teams/FC Barcelona"]
" EL CAMPO DE... ["p_football_teams/FC Barcelona"]
" LOS AÑOS 30... ["p_football_teams/FC Barcelona"]
" Lo que a pr... ["p_football_teams/Real Madrid"]
" A comienzos... ["p_football_teams/Real Madrid"]
" Primero, al V... ["p_football_teams/Real Madrid"]
" Tras quince... ["p_football_teams/Real Betis"]
" En cambio, el... ["p_football_teams/Real Betis"]
" LA COPA LAT... ["p_football_teams/FC Barcelona"]
Converting your arrow files to different formats
Once you have a list of generated arrow files, obtained either via download_all_partitions
or read_all_partitions
you can easily convert them to many diferent formats.
Here is the data in case you want to reproduce these examples (These examples will use a local NucliaDB, but the same conversions will work with any KB):
from nucliadb_sdk import KnowledgeBox,create_knowledge_box
from sentence_transformers import SentenceTransformer
my_kb = create_knowledge_box("my_new_kb")
encoder = SentenceTransformer("all-MiniLM-L6-v2")
sentences = ["I'm Sierra, a very happy dog","She's having a terrible day","what a delighful day","Dog in catalan is gos","he is heartbroken","He said that the race is quite tough","love is tough"]
labels = ["emotion/positive","emotion/negative","emotion/positive","emotion/neutral","emotion/negative","emotion/neutral","emotion/negative"]
for i in range(len(sentences)):
resource_id = my_kb.upload(
text=sentences[i],
labels=[labels[i]],
vectors={"all-MiniLM-L6-v2": encoder.encode([sentences[i]])[0]},
)
arrow_filenames = download_all_partitions(
task="FIELD_CLASSIFICATION",
slug="my_new_kb",
labels=["emotion"]
)
Convert to a HF Dataset
From the list of generated arrow files arrow_filenames
you can convert to a HuggingFace Dataframe with only these few lines of code:
from datasets import Dataset, concatenate_datasets
all_datasets = []
for filename in arrow_filenames:
all_datasets.append(Dataset.from_file(filename))
ds = concatenate_datasets(all_datasets)
The Dataset will look like this:
Dataset({
features: ['text', 'labels'],
num_rows: 7
})
With this content:
{'text': ["I'm Sierra, a very happy dog",
'love is tough',
'he is heartbroken',
'what a delighful day',
'Dog in catalan is gos',
'He said that the race is quite tough'],
'labels': [['emotion/positive'],
['emotion/negative'],
['emotion/negative'],
['emotion/positive'],
['emotion/neutral'],
['emotion/neutral']]}
Convert to a Pandas Dataframe
From the list of generated arrow files arrow_filenames
you can convert to a Pandas Dataframe with these few lines of code:
import pyarrow as pa
for file in arrow_filenames:
with pa.memory_map(file, "rb") as source:
loaded_array = pa.ipc.open_stream(source).read_all()
my_pandas_data = loaded_array.to_pandas()
You will get something like this:
text labels
0 I'm Sierra, a very happy dog [emotion/positive]
1 love is tough [emotion/negative]
2 he is heartbroken [emotion/negative]
3 what a delighful day [emotion/positive]
4 Dog in catalan is gos [emotion/neutral]
5 He said that the race is quite tough [emotion/neutral]
6 She's having a terrible day [emotion/negative]
Convert to a Python list of dictionaries
From the list of generated arrowfiles arrow_filenames
you can convert to a list of Python dictionaries with these few lines of code:
import pyarrow as pa
for file in arrow_filenames:
with pa.memory_map(file, "rb") as source:
loaded_array = pa.ipc.open_stream(source).read_all()
my_data_dict = loaded_array.to_pylist()
You will get something like this:
[{'text': "I'm Sierra, a very happy dog", 'labels': ['emotion/positive']},
{'text': 'love is tough', 'labels': ['emotion/negative']},
{'text': 'he is heartbroken', 'labels': ['emotion/negative']},
{'text': 'what a delighful day', 'labels': ['emotion/positive']},
{'text': 'Dog in catalan is gos', 'labels': ['emotion/neutral']},
{'text': 'He said that the race is quite tough',
'labels': ['emotion/neutral']},
{'text': "She's having a terrible day", 'labels': ['emotion/negative']}]
Convert to a Polars Dataframe
From the list of generated arrowfiles arrow_filenames
you can convert to a Polars Dataframe with these few lines of code:
import pyarrow as pa
import polars
for file in arrow_filenames:
with pa.memory_map(file, "rb") as source:
loaded_array = pa.ipc.open_stream(source).read_all()
my_polars_data = polars.from_arrow(loaded_array)
You will get something like this:
text labels
str list[str]
"I'm Sierra, a ... ["emotion/positive"]
"love is tough" ["emotion/negative"]
"he is heartbro... ["emotion/negative"]
"what a delighf... ["emotion/positive"]
"Dog in catalan... ["emotion/neutral"]
"He said that t... ["emotion/neutral"]
"She's having a... ["emotion/negative"]