Tune the Search Strategy
To make a basic search query, simply pass a question in the query
parameter of the /find
or /ask
endpoint. This will return the most relevant data for your query based on Nuclia's semantic and keyword search capabilities.
To improve the accuracy of your search results, consider using the following parameters and strategies.
Search Modes
Nuclia offers several search modes: semantic search, keyword search, fulltext search, and graph search:
- Semantic search is based on the meaning of the query and the content of the paragraphs, it will retrieve the paragraphs that are semantically close to the query. It is the most powerful search mode.
- Keyword search is based on the keywords of the query and the content of the paragraphs, it will retrieve the paragraphs that contain the keywords of the query. It is a more naive search mode but it can be very relevant when searching for a specific term (typically a brand name, a product name, etc.).
- Fulltext search is based on the keywords of the query and the content of the resources, it will retrieve the resources that contain the query's terms. It only applies to
/find
endpoint. - Graph search is based on the entities of the query and the content of the resources, it will retrieve the resources that contain the entities of the query. It is a very powerful search mode when the query is about a specific entity (like a person, a location, an organization, etc.).
You can apply multiple search modes to a single query. The /ask
endpoint uses semantic search, keyword search, and graph search by default. The /find
endpoint uses semantic search and fulltext search by default.
But you might want to change the default behavior. To do so, you can use the features
parameter.
Typically, when searching with a different language than the resource's language, keyword search might be inconsistent, matching words that are written the same but have different meanings in the two languages. In this case, you can use semantic search only by passing:
features: ['semantic']
Rephrase the query
The rephrase
parameter allows you to optimize the query for semantic search.
Typically, when the user's query is a set of keywords (like "prune apple tree period"), it might perform badly on semantic search. To avoid this, you can pass rephrase=true
so the query is rephrased to a more natural language question (like "When is the best time to prune an apple tree?").
Auto-filter
When passing autofilter=true
, the user's query will be scanned for keywords that are likely to be named entities (or NER) and they will be automatically used as filters. This is useful when the user's query is a question like "How to replace the brakes on the XMB55 bike?", because this query might have very good semantic matches on any resource explaining how to replace brakes on a bike, but the user is only interested in the XMB55 bike.
Filters
The filters
parameter allows you to filter the results depending on the value of different properties provided on the resource.
It happens before the query is executed, so it will reduce the scope of the search to a given set of resources.
The following attributes are supported:
/origin.tags
: tags defined in the resource'sorigin
property Example:/origin.tags/blue
,/origin.tags/green
/classification.labels
: labels:/classification.labels/{labelset}/{label}
Example:/classification.labels/movie-genre/science-fiction
/icon
: mime type of resource Example:/icon/application/pdf
or/icon/movie/mp4
/metadata.status
: processing status Example:/metadata.status/PROCESSED
,/metadata.status/PENDING
or/metadata.status/ERROR
/entities
: resource entities:/entities/{entity-type}/{entity-id}
Example:/entities/CITY/Barcelona
/metadata.language
: primary language of the document Example:/metadata.language/ca
for catalan language/metadata.languages
: all other detected languages Example:/metadata.languages/tr
for turkish language/origin.metadata
: metadata provided by the user Example:/origin.metadata/fieldname/value
origin.path
: path of the resource in the source system. It will match any path starting with the provided value. Example:/origin.path/Users/JohnDoe/Documents
will match files in theDocuments
folder of theJohnDoe
user, but also the ones inDocuments/Work
orDocuments/Personal
, etc.
Examples:
-
To retrieve PNG images only, use:
filters=/icon/image/png
-
To retrieve results in which the principal language is Italian, use:
filters=/metadata.language/it
-
To retrieve results referring to the UNESCO organization, use:
filters=/entities/ORG/UNESCO
Filters can be combined by repeating the filters
parameter. This example will retrieve results which are PDF and which are referring to the UNESCO organization:
filters=/icon/application/pdf&filters=/entities/ORG/UNESCO
Advanced filtering
As shown above, combining multiple filters will implicitly return the intersection (i.e: AND
operator) between the specified filters.
If your use-case needs more complex filtering expressions, you can use the POST
versions of the search endpoints to provide a filtering expression.
Filtering expressions accept the following keys: all
, any
, none
and not_all
. Here are some examples:
all
{
"filters": [
{"all": ["/icon/application/pdf", "/entities/ORG/UNESCO"]}
]
}
Which would be equivalent to the last example of the previous section: it will return resources that are PDF and have the UNESCO entity associated with them.
any
{
"filters": [
{"any": ["/icon/application/pdf", "/icon/movie/mp4"]}
]
}
Will return resources that are either PDF or mp4 videos. This is equivalent to the OR
logical operation.
none
{
"filters": [
{"none": ["/icon/application/pdf", "/icon/movie/mp4"]}
]
}
Will return results from documents that are neither PDF nor mp4 videos. This is equivalent to the NOT(a OR b)
logical expression.
not_all
{
"filters": [
{"not_all": ["/icon/application/pdf", "/entities/ORG/UNESCO"]}
]
}
Essentially, it will return the complementary set of results to the all
example: all documents except those that are PDFs and also have UNESCO entity related to. This is equivalent to the NOT(a AND b)
logical expression.
Combining
If you need even more complex filtering expressions, you can combine multiple expression terms as more elements of the filters list:
{
"filters": [
{"all": ["/icon/application/pdf"]},
{"any": ["/entities/ORG/UNESCO", "/entities/GPE/US"]},
]
}
And the returned result will be the implicit intersection (i.e: AND
) of all expressions combined. In this example, it will return all documents that are PDF and that have either UNESCO or US as a related entity.
Date filtering
You can filter on the creation date using:
range_creation_start
range_creation_end
Examples:
-
To get all resources created between 2023-01-01 and 2023-12-31:
range_creation_start=2023-01-01T00:00:00.000Z&range_creation_end=2023-12-31T23:59:59.000Z
-
To get all resources created after 2023-01-01:
range_creation_start=2023-01-01T00:00:00.000Z
Filtering will be based on the origin.created
value if provided in the resource, otherwise it will default to the resource creation date (created
).
Please note: all resources created before 2023-11-02 will have to be reprocessed for origin.created to be filterable.
Similarly, you can filter on the modification date using:
range_modification_start
range_modification_end
Keyword filters
The keyword_filters
parameter allows you to filter resources based on a list of keywords so the paragraph search will be restricted to the resources that contain the keywords.
For example, if your Knowledge Box contains recipes and you ask "How to cook a pizza?", you can use the keyword_filters
parameter to restrict the search to the resources that contain the keywords ["vegan", "zucchini"]
.
keyword_filters
allows you to define hard criteria on the resources that will be used to search for the answer.
If you were just asking "How to cook a vegan pizza with zucchini?", the semantic results might extend to paragraphs that semantically close without being necessarily exactly matching the words "vegan" and "zucchini".
And keyword_filters
is also a way to decouple the query from the keyword search: the keyword matching will return resources (entirely, not just paragraphs containing the keywords) and then the query will search among the paragraphs contained in the resources. By asking "How to cook a vegan pizza with zucchini?", the search results will prioritize paragraphs that contain both "vegan" and "zucchini".
The keyword_filters
parameter also supports the advanced filtering expressions described in the previous section:
{
"keyword_filters": [
{"all": ["vegan", "zucchini"]},
{"none": ["pineapple"]}
]
}
Filter on specific resources
The resource_filters
parameter allows to restrict the search to a set of predefined resources.
Filter by security groups
If you need to restrict the search depending on the user's security groups, you can use the security.groups
parameter.
Search in a specific field
To restrict the search to a specific field you can use the field
parameter. It supports different field types:
a
: generic fields (= basic attributes, like title or summary)t
: text fieldsf
: file fieldsu
: link fields
Example:
fields=a/title
To search in several fields, the parameter can be repeated:
fields=a/title&fields=a/summary
Regarding content fields, when used through the resource /search
endpoint it allows you to restrict the search to one piece of content only, and when used through the main /search
endpoint it allows you to restrict the search to all content having a given id in all resources.
Minimum score
All search endpoints provide parameters to filter out results that are not good enough. For instance, the POST
ask
, find
and search
provide the min_score
parameter on the payload with which you can control:
-
Semantic score: the measure of meaning or semantic similarity between the search vector and the results. When not provided, NucliaDB will use the minimum score associated to the semantic model configured for the Knowledge Box. Read more on the subsection below.
-
BM25 score: BM25 is the ranking function used by NucliaDB text search index to rank matching documents according to their relevance to a given search query. NucliaDB does not filter by BM25 score by default.
An example payload would be:
{
"query": "Hakuna Matata",
"min_score": {
"bm25": 4.0,
"semantic": 1.2
}
}
On the GET
version of the search
and find
endpoints, these can be specified as query params:
/api/v1/kb/kbid/search?query=Hakuna%20Matata&min_score_bm25=4.0&min_score_semantic=1.2
BM25 scores
Theoretically, the range of the BM25 score can be anything from 0 to infinity. In practice, however, scores are typically within a certain range (e.g., 0 to 10). The score depends on several factors, including the frequency of the term in the document, the length of the document, the average length of documents in the collection, and the frequency of the term in the entire document collection.
A higher BM25 score indicates a higher relevance of the document to the search query. However, because the score is not normalized, the absolute value of the score is not directly interpretable and is not comparable across different queries or document collections.
A score of 0 indicates that the document has no relevance to the query (i.e., none of the query terms appear in the document).
Semantic scores
The range of the semantic similarity is different depending on the semantic model in use and their associated similarity function. At the time of writing, there are two different similarity functions used by NucliaDB's vectors index:
-
Dot product: also known as scalar product. The range is any real number. The dot product of two vectors will be higher the more similar the vectors are.
-
Cosine: the range of the cosine similarity function is between -1 and 1. However, in practice it is typically between 0 and 1 because negative scores indicate that vectors are not similar. A score of 1 means that the two vectors are identical.
As mentioned above, each semantic model uses a different similarity function. Moreover, Nuclia has pre-defined a min score for each model to provide a good generic search experience. Below is a list with the semantic models supported at the time of writing with the associated similarity function and the default min score used by NucliaDB's search engine:
Semantic Model | Similarity function | Score range | Default min score |
---|---|---|---|
en (English) | Cosine | real numbers from -1 to 1 | 0.7 |
multilingual-2023-02-21 | Dot product | any real number | 1.5 |
multilingual-2023-08-16 | Dot product | any real number | 0.7 |
For more specific search use-cases, it is recommended that you experiment different min scores on your dataset and the types of queries that are expected. To that effect, you can use the parameters explained above.
Facets
By using the faceted
parameter, you will get a facets
attribute in paragraphs
, sentences
and fulltext
.
This parameter takes on the same values as the filters
parameter.
Examples:
-
To get the total amount of matches for each image file type (like
jpg
,png
,gif
, etc.), use:faceted=/icon/image
-
To get the total amount of matches for each language (like
en
,it
,fr
, etc.), use:faceted=/metadata.language
Highlight matching words
By setting the split
parameter to true
, you will get the start and end positions of each matching word in text blocks and fulltext results.
If you additionally set the highlight
parameter to true
, the matching words are enclosed into <mark>
tags.