How to filter search results
All Nuclia search endpoints support filtering with the same API, using the filter_expression
parameter. For POST endpoints, the expression is passed
as a JSON object, like all other parameters. For GET endpoints, it's passed as a JSON object serialized to string, with the appropriate URL-encoding for
special characters. For this reason, we recommend using the POST endpoints.
A filter expression is composed of the following parts:
{
"field": <expr>,
"paragraph": <expr>,
"operator": <and/or>
}
- An expression to filter resource fields. This is where most of the filtering takes place, e.g: Filtering by resource id, slug, field type, resource labels or language are all defined here.
- An expression to filter paragraphs. This applies filters to individual paragraphs based on paragraph labels or the kind of paragraph.
- If both expression are provided, how to combine them, either
and
oror
.
Examples:
Search in a specified resource
{
"field": {"prop": "resource", "slug": "my-cool-resource"}
}
Search for english texts, excluding OCR paragraphras
{
"field": {"prop": "language", "language": "en"},
"paragraph": {"not": {"prop": "kind", "kind": "OCR"}},
"operator": "and"
}
Filter expression
Each filter expression is a set of filters combined by operators (AND, OR, NOT). The allowed filters differ between field and paragraph expressions, but the operators are common.
Boolean operators
And
All filters must match for the expression to match
{
"and": [<expr>, <expr>]
}
Or
At least one of the filters must match for the expression to match
{
"or": [<expr>, <expr>]
}
Not
The filter must not match for the expression to match
{
"not": <expr>
}
Nesting
Operators can be nested, producing complex expressions.
For example, to search for movies or books in english that don't mention Barcelona nor Paris, you could write:
{
"and": [
{ "prop": "language", "language": "en" },
{
"or": [
{ "prop": "label", "labelset": "media_type", "label": "movies" },
{ "prop": "label", "labelset": "media_type", "label": "books" }
]
},
{
"not": {
"or": [
{ "prop": "entity", "subtype": "CITY", "value": "Barcelona" },
{ "prop": "entity", "subtype": "CITY", "value": "Paris" }
]
}
}
]
}
Resource filters
Resource ID or slug (resource)
Filters by a given resource id or slug (only one can be specified at a time).
{
"prop": "resource",
"id": "2e601fd990790691813d1380c104ab98"
}
{
"prop": "resource",
"slug": "my-slug"
}
Field type or specific field id (field)
Filters by a given field type or a specific field.
Type is one of text
, file
, link
, conversation
or generic
.
{
"prop": "field",
"type": "text"
}
{
"prop": "field",
"type": "generic",
"name": "summary"
}
Documents containing a word (keyword)
Matches fields that contain a specific word.
{
"prop": "keyword",
"word": "umbrella"
}
Creation date (created)
Matches documents created inside the date range.
{
"prop": "created",
"since": "2021-03-05T02:00:00",
"until": "2021-05-15T02:00:00"
}
since
or until
can be left blank to search documents older than or newer than a single date.
{
"prop": "created",
"since": "2021-03-05T02:00:00",
}
Modification date (modified)
Matches documents modified inside the date range.
{
"prop": "modified",
"since": "2021-03-05T02:00:00",
"until": "2021-05-15T02:00:00"
}
since
or until
can be left blank to search documents older than or newer than a single date.
{
"prop": "modified",
"since": "2021-03-05T02:00:00",
}
Origin tags (origin_tag)
Matches documents with a given origin tag (as specified at resource creation).
{
"prop": "origin_tag",
"tag": "word"
}
Origin metadata (origin_metadata)
Matches documents with the given origin metadata (as specified at resource creation).
{
"prop": "origin_metadata",
"field": "agent",
"value": "crawler"
}
Can also be used to match documents having the specified metadata field (withotu caring for its value):
{
"prop": "origin_metadata",
"field": "agent"
}
Origin path (origin_path)
Matches path of the resource in the source system. It will match any path starting with the provided value.
Example: Users/JohnDoe/Documents
will match files in the Documents
folder of the JohnDoe
user, but also the ones in Documents/Work
or Documents/Personal
, etc.
{
"prop": "origin_path",
"prefix": "Users/JohnDoe/Documents"
}
Can also be used to match when any path is set by not specifying any prefix:
{
"prop": "origin_path"
}
Origin source ID (origin_source)
Matches documents with a given origin source id (as specified at resource creation).
{
"prop": "origin_source",
"id": "internet"
}
Can also be used to match when any source is set by not specifying any id:
{
"prop": "origin_source"
}
Origin tags (origin_tag)
Matches documents with a given origin collaborator (as specified at resource creation).
{
"prop": "origin_collaborator",
"collaborator": "someone"
}
Classification labels (label)
Matches documents with a given label.
{
"prop": "label",
"labelset": "topic",
"label": "boats"
}
The label
field can be skipped to match any resources with any label on that labelset.
{
"prop": "label",
"labelset": "topic"
}
Icon / Resource mimetype (resource_mimetype)
Matches the mimetype of the resource (also known as icon). You can also consider by the specific field mimetype (see next filter).
{
"prop": "resource_mimetype",
"type": "application",
"subtype": "pdf"
}
Can also filter by categories by not passing the subtype
field.
{
"prop": "resource_mimetype",
"type": "image"
}
Field mimetype (field_mimetype)
Matches the mimetype of the field. You can also consider by the mimetype of the resource/icon (see above).
{
"prop": "field_mimetype",
"type": "application",
"subtype": "pdf"
}
Can also filter by categories by not passing the subtype
field.
{
"prop": "field_mimetype",
"type": "image"
}
Entities / NERs (entity)
Matches fields containing the specified NER entity.
{
"prop": "entity",
"subtype": "CITY",
"value": "Paris"
}
Can also match any entity on a category:
{
"prop": "entity",
"subtype": "CITY"
}
Text language (language)
Matches documents containing text in the given language (even if they have other languages):
{
"prop": "language",
"language": "en"
}
Matches documents written primarily in the given language:
{
"prop": "language",
"language": "en",
"only_primary": true
}
Field generated by (generated)
Matches if the field was generated by the given source. Currently can only be used files generated by Data Augmentation.
{
"prop": "generated",
"by": "data-augmentation"
}
Can also be used to match fields generated by an specific DA task (given the field prefix).
{
"prop": "generated",
"by": "data-augmentation",
"da_task": "summarizer"
}
Paragraph filters
Classification labels (label)
Matches paragraphs with a given label.
{
"prop": "label",
"labelset": "topic",
"label": "boats"
}
The label
field can be skipped to match any paragraphs with any label on that labelset.
{
"prop": "label",
"labelset": "topic"
}
Paragraph kind (kind)
Matches paragraphs of that kind. Kind can be TEXT
, OCR
, INCEPTION
, DESCRIPTION
, TRANSCRIPT
, TITLE
or TABLE
.
{
"prop": "kind",
"kind": "TEXT"
}
Catalog filters
The catalog can use most of the field resources (except for field
, field_mimetype
, keyword
and entity
).
Additionally, it can also use the following filters:
Resource status (status)
Matches resources in a given processing status. Status can be PROCESSED
, PENDING
or ERROR
.
{
"prop": "status",
"status": "PROCESSED"
}
Reference documentation
The Nuclia API documentation is available here.
Legacy filter parameters
The parameters described below also apply filters and represent an older version of the API.
We recommend using filter_expression
instead, but the documentation for the older parameters is still retained here.
Filters
The filters
parameter allows you to filter the results depending on the value of different properties provided on the resource.
The following attributes are supported:
/origin.tags
: tags defined in the resource'sorigin
property Example:/origin.tags/blue
,/origin.tags/green
/classification.labels
: labels:/classification.labels/{labelset}/{label}
Example:/classification.labels/movie-genre/science-fiction
/icon
: mime type of resource Example:/icon/application/pdf
or/icon/movie/mp4
/metadata.status
: processing status Example:/metadata.status/PROCESSED
,/metadata.status/PENDING
or/metadata.status/ERROR
/entities
: resource entities:/entities/{entity-type}/{entity-id}
Example:/entities/CITY/Barcelona
/metadata.language
: primary language of the document Example:/metadata.language/ca
for catalan language/metadata.languages
: all other detected languages Example:/metadata.languages/tr
for turkish language/origin.metadata
: metadata provided by the user Example:/origin.metadata/fieldname/value
origin.path
: path of the resource in the source system. It will match any path starting with the provided value. Example:/origin.path/Users/JohnDoe/Documents
will match files in theDocuments
folder of theJohnDoe
user, but also the ones inDocuments/Work
orDocuments/Personal
, etc.
Examples:
-
To retrieve PNG images only, use:
filters=/icon/image/png
-
To retrieve results in which the principal language is Italian, use:
filters=/metadata.language/it
-
To retrieve results referring to the UNESCO organization, use:
filters=/entities/ORG/UNESCO
Filters can be combined by repeating the filters
parameter. This example will retrieve results which are PDF and which are referring to the UNESCO organization:
filters=/icon/application/pdf&filters=/entities/ORG/UNESCO
Advanced filtering
As shown above, combining multiple filters will implicitly return the intersection (i.e: AND
operator) between the specified filters.
If your use-case needs more complex filtering expressions, you can use the POST
versions of the search endpoints to provide a filtering expression.
Filtering expressions accept the following keys: all
, any
, none
and not_all
. Here are some examples:
all
{
"filters": [
{"all": ["/icon/application/pdf", "/entities/ORG/UNESCO"]}
]
}
Which would be equivalent to the last example of the previous section: it will return resources that are PDF and have the UNESCO entity associated with them.
any
{
"filters": [
{"any": ["/icon/application/pdf", "/icon/movie/mp4"]}
]
}
Will return resources that are either PDF or mp4 videos. This is equivalent to the OR
logical operation.
none
{
"filters": [
{"none": ["/icon/application/pdf", "/icon/movie/mp4"]}
]
}
Will return results from documents that are neither PDF nor mp4 videos. This is equivalent to the NOT(a OR b)
logical expression.
not_all
{
"filters": [
{"not_all": ["/icon/application/pdf", "/entities/ORG/UNESCO"]}
]
}
Essentially, it will return the complementary set of results to the all
example: all documents except those that are PDFs and also have UNESCO entity related to. This is equivalent to the NOT(a AND b)
logical expression.
Combining
If you need even more complex filtering expressions, you can combine multiple expression terms as more elements of the filters list:
{
"filters": [
{"all": ["/icon/application/pdf"]},
{"any": ["/entities/ORG/UNESCO", "/entities/GPE/US"]},
]
}
And the returned result will be the implicit intersection (i.e: AND
) of all expressions combined. In this example, it will return all documents that are PDF and that have either UNESCO or US as a related entity.
Date filtering
You can filter on the creation date using:
range_creation_start
range_creation_end
Examples:
-
To get all resources created between 2023-01-01 and 2023-12-31:
range_creation_start=2023-01-01T00:00:00.000Z&range_creation_end=2023-12-31T23:59:59.000Z
-
To get all resources created after 2023-01-01:
range_creation_start=2023-01-01T00:00:00.000Z
Filtering will be based on the origin.created
value if provided in the resource, otherwise it will default to the resource creation date (created
).
Please note: all resources created before 2023-11-02 will have to be reprocessed for origin.created to be filterable.
Similarly, you can filter on the modification date using:
range_modification_start
range_modification_end
Search in a specific field
To restrict the search to a specific field you can use the field
parameter. It supports different field types:
a
: generic fields (= basic attributes, like title or summary)t
: text fieldsf
: file fieldsu
: link fields
Example:
fields=a/title
To search in several fields, the parameter can be repeated:
fields=a/title&fields=a/summary
Regarding content fields, when used through the resource /search
endpoint it allows you to restrict the search to one piece of content only, and when used through the main /search
endpoint it allows you to restrict the search to all content having a given id in all resources.