Skip to main content

Ingestion best practices

There are several ways to ingest data in Nuclia:

  • Programmatically, using the REST API or the Python SDK or the JavaScript SDK.
  • From a terminal, using the CLI.
  • Or using the Nuclia Dashboard web application (either by uploading files manually, or by syncing your 3rd-party storage services like Google Drive, OneDrive, Dropbox, etc. with the Sync agent).

And there are different kind of data you can ingest (files, text, web pages), possibly with some metadata (like tags, labels, etc.).

Consequently, there is not a single best practice for ingestion, but rather a set of best practices depending on your use case.

Repeatable vs one-shot ingestions

If you need to ingest data regularly (e.g. every day, every hour, etc.), you should use the REST API or the SDKs. This way, you can automate the ingestion process. You can also use the Sync Agent to watch your 3rd-party storage services for new files and sync them automatically.

If you need to ingest data only once, you can use the Nuclia Dashboard web application or the CLI.

Ability to update data

If you need to update the data you ingest (e.g. you ingest a file, then you modify it and want to ingest it again), you should define a unique identifier for each piece of data you ingest.

This identifier must be stored in the slug attribute of your resource.

It will allow you to get/update/delete the resource by querying the API with this identifier (rather than the Nuclia auto-generated ID):

GET /api/v1/kb/{kbid}/slug/{rslug}
PATCH /api/v1/kb/{kbid}/slug/{rslug}
DELETE /api/v1/kb/{kbid}/slug/{rslug}

See the resource endpoints for more information.

Allow efficient filtering

Filtering is important for several reasons:

  • It can be useful for your end-users to narrow down the data they are looking for (example: ask a question that should be answered by a specific piece of data).
  • It can be used to control access to the data.
  • It can support the data lifecycle management (for example, archive data that is not used anymore).

Filtering is based on the metadata attached to your data (see Filtering).

Some of these metadata are automatically extracted from the data you ingest (like the file type or the language), but you can also add your own metadata.

The most important ones are:

  • origin.path: For any hierarchical source (typically a file tree), it allows to store the path of the original file. This attribute can be used for filtering on partial match on the left on the value (like filtering on /a/b would match resources indexed under /a/b but also /a/b/c or /a/b/c/d).
  • usermetadata.classifications: It allows to set your own labelset/label pairs on the resources.
  • security.access_groups: It allows to store the user groups that have access to the resource. IMPORTANT: It is not enforcing security, it is just an handy way to filter based on groups as this filter works on intersection (if a resource has security.access_groups set to ["group1", "group2"], it will be returned if you filter on ["group1"], ["group2","group3"], ["group1","group2","group3"], etc.).

Avoid noise

When you ingest data, you should avoid pushing to your Nuclia Knowledge Box information that is not needed as they will pollute the indexing.

Examples:

  • When indexing a web page, you should avoid indexing the header, footer, and other parts of the page that are not relevant to the content. The LinkField offers a css_selector parameter that allows you to select only the relevant part of the page.
  • When indexing a video, do not index the transcript, as Nuclia is already able to extract the transcript from the video (hence it would be repeated information).
  • When indexing a document, it is usually not recommended to also index translations of the same document in several languages, because even if the words are different, the semantic content is identical. Note: in soem dcases, you can index translations, but then it is recommended to use the language attribute to filter the results when querying your Knowledge Box.