How to query a Knowledge Box

Nuclia supports 4 different search endpoints:

/search: returns several result sets according the different search techniques (full-text, fuzzy, semantic).
/find: returns a single result set where all different results are merged into a hierarchical structure.
/ask: streams a generative answer and a single result set where all different results are merged into a hierarchical structure.
/graph*: returns a set of paths, nodes or relations from the knowledge graph

Except of graph endpoints (explained here), all endpoints support the same query parameters.

Search parameters

Query

Simple query

A simple text search can be performed using a plain text value.

Example:

This query would return results containing the words little and prince:

Little Prince

By putting words into quotes you can search for an exact match:

"Little Prince"

In this case, you will only get results containing the word sequence little prince.

Using the minus sign - in front of a word you can exclude a word from the search:

Little Prince -sheep

This query would return results containing the words little and prince but not sheep.

Filters

Filters can be specified by using the filter_expression parameter and are discussed in-depth in the filtering documentation.

Minimum score

All search endpoints provide parameters to filter out results that are not good enough. For instance, the POST ask, find and search provide the min_score parameter on the payload with which you can control:

Semantic score: the measure of meaning or semantic similarity between the search vector and the results. When not provided, NucliaDB will use the minimum score associated to the semantic model configured for the Knowledge Box. Read more on the subsection below.
BM25 score: BM25 is the ranking function used by NucliaDB text search index to rank matching documents according to their relevance to a given search query. NucliaDB does not filter by BM25 score by default.

An example payload would be:

{
  "query": "Hakuna Matata",
  "min_score": {
    "bm25": 4.0,
    "semantic": 1.2
  }
}

On the GET version of the search and find endpoints, these can be specified as query params:

/api/v1/kb/kbid/search?query=Hakuna%20Matata&min_score_bm25=4.0&min_score_semantic=1.2

BM25 scores

Theoretically, the range of the BM25 score can be anything from 0 to infinity. In practice, however, scores are typically within a certain range (e.g., 0 to 10). The score depends on several factors, including the frequency of the term in the document, the length of the document, the average length of documents in the collection, and the frequency of the term in the entire document collection.

A higher BM25 score indicates a higher relevance of the document to the search query. However, because the score is not normalized, the absolute value of the score is not directly interpretable and is not comparable across different queries or document collections.

A score of 0 indicates that the document has no relevance to the query (i.e., none of the query terms appear in the document).

Semantic scores

The range of the semantic similarity is different depending on the semantic model in use and their associated similarity function. At the time of writing, there are two different similarity functions used by NucliaDB's vectors index:

Dot product: also known as scalar product. The range is any real number. The dot product of two vectors will be higher the more similar the vectors are.
Cosine: the range of the cosine similarity function is between -1 and 1. However, in practice it is typically between 0 and 1 because negative scores indicate that vectors are not similar. A score of 1 means that the two vectors are identical.

As mentioned above, each semantic model uses a different similarity function. Moreover, Nuclia has pre-defined a min score for each model to provide a good generic search experience. Below is a list with the semantic models supported at the time of writing with the associated similarity function and the default min score used by NucliaDB's search engine:

Semantic Model	Similarity function	Score range	Default min score
en (English)	Cosine	real numbers from -1 to 1	0.7
multilingual-2023-02-21	Dot product	any real number	1.5
multilingual-2023-08-16	Dot product	any real number	0.7

note

For more specific search use-cases, it is recommended that you experiment different min scores on your dataset and the types of queries that are expected. To that effect, you can use the parameters explained above.

Result options

Features

A search query can be executed against different targets. The target is defined by the features parameter which supports 4 values:

fulltext: the query is executed as full-text search against all resource texts (including attributes like title or summary, and all content fields).
keyword: the query is executed as fuzzy search against all text block texts.
semantic: the query is executed as semantic search against all resource texts.
relations: the query is executed as graph search against all resource entities.

These features can be combined by repeating the features parameter:

features=document&features=vector

By using the faceted parameter, you will get a facets attribute in paragraphs, sentences and fulltext.

/origin.tags: tags defined in the resource's origin property
/classification.labels: labels: /classification.labels/{labelset}/{label}
/icon: mime type of resource
/metadata.status: processing status
/entities: resource entities: /entities/{entity-type}/{entity-id}
/metadata.language: primary language of the document
/metadata.languages: all other detected languages
/origin.metadata: metadata provided by the user
origin.path: path of the resource in the source system. It will match any path starting with the provided value.

Examples:

To get the total amount of matches for each image file type (like jpg, png, gif, etc.), use:
```
faceted=/icon/image
```
To get the total amount of matches for each language (like en, it, fr, etc.), use:
```
faceted=/metadata.language
```

Highlight matching words

By setting the split parameter to true, you will get the start and end positions of each matching word in text blocks and fulltext results.

If you additionally set the highlight parameter to true, the matching words are enclosed into <mark> tags.

How to call the search endpoint

To search in all resources, the search endpoints are:

https://<zone>.nuclia.cloud/api/v1/kb/<your-knowledge-box-id>/search
https://<zone>.nuclia.cloud/api/v1/kb/<your-knowledge-box-id>/find

Search endpoints can be called with a GET or a POST request.

A typical curl command to call the search endpoint is:

https://<zone>.nuclia.cloud/api/v1/kb/<your-knowledge-box-id>/find?query=Batman&features=semantic&features=relations

If your Knowledge Box is not public, you must provide the X-NUCLIA-SERVICEACCOUNT header with an API token or an Authorization header.

To search in a specific resource, the search endpoint path is:

https://<zone>.nuclia.cloud/api/v1/kb/<your-knowledge-box-id>/resource/<resource-id>/find

Reference documentation

The Nuclia API documentation is available here.

Search parameters​

Query​

Simple query​

Filters​

Minimum score​

BM25 scores​

Semantic scores​

Result options​

Features​

Facets​

Highlight matching words​

How to call the search endpoint​

Reference documentation​