Skip to main content

How to filter search results

All Nuclia search endpoints support filtering with the same API, using the filter_expression parameter. For POST endpoints, the expression is passed as a JSON object, like all other parameters. For GET endpoints, it's passed as a JSON object serialized to string, with the appropriate URL-encoding for special characters. For this reason, we recommend using the POST endpoints.

A filter expression is composed of the following parts:

{
"field": <expr>,
"paragraph": <expr>,
"operator": <and/or>
}
  • An expression to filter resource fields. This is where most of the filtering takes place, e.g: Filtering by resource id, slug, field type, resource labels or language are all defined here.
  • An expression to filter paragraphs. This applies filters to individual paragraphs based on paragraph labels or the kind of paragraph.
  • If both expression are provided, how to combine them, either and or or.

Examples:

Search in a specified resource

{
"field": {"prop": "resource", "slug": "my-cool-resource"}
}

Search for english texts, excluding OCR paragraphras

{
"field": {"prop": "language", "language": "en"},
"paragraph": {"not": {"prop": "kind", "kind": "OCR"}},
"operator": "and"
}

Filter expression

Each filter expression is a set of filters combined by operators (AND, OR, NOT). The allowed filters differ between field and paragraph expressions, but the operators are common.

Boolean operators

And

All filters must match for the expression to match

{
"and": [<expr>, <expr>]
}

Or

At least one of the filters must match for the expression to match

{
"or": [<expr>, <expr>]
}

Not

The filter must not match for the expression to match

{
"not": <expr>
}

Nesting

Operators can be nested, producing complex expressions.

For example, to search for movies or books in english that don't mention Barcelona nor Paris, you could write:

{
"and": [
{ "prop": "language", "language": "en" },
{
"or": [
{ "prop": "label", "labelset": "media_type", "label": "movies" },
{ "prop": "label", "labelset": "media_type", "label": "books" }
]
},
{
"not": {
"or": [
{ "prop": "entity", "subtype": "CITY", "value": "Barcelona" },
{ "prop": "entity", "subtype": "CITY", "value": "Paris" }
]
}
}
]
}

Resource filters

Resource ID or slug (resource)

Filters by a given resource id or slug (only one can be specified at a time).

{
"prop": "resource",
"id": "2e601fd990790691813d1380c104ab98"
}
{
"prop": "resource",
"slug": "my-slug"
}

Field type or specific field id (field)

Filters by a given field type or a specific field.

Type is one of text, file, link, conversation or generic.

{
"prop": "field",
"type": "text"
}
{
"prop": "field",
"type": "generic",
"name": "summary"
}

Documents containing a word (keyword)

Matches fields that contain a specific word.

{
"prop": "keyword",
"word": "umbrella"
}

Creation date (created)

Matches documents created inside the date range.

{
"prop": "created",
"since": "2021-03-05T02:00:00",
"until": "2021-05-15T02:00:00"
}

since or until can be left blank to search documents older than or newer than a single date.

{
"prop": "created",
"since": "2021-03-05T02:00:00",
}

Modification date (modified)

Matches documents modified inside the date range.

{
"prop": "modified",
"since": "2021-03-05T02:00:00",
"until": "2021-05-15T02:00:00"
}

since or until can be left blank to search documents older than or newer than a single date.

{
"prop": "modified",
"since": "2021-03-05T02:00:00",
}

Origin tags (origin_tag)

Matches documents with a given origin tag (as specified at resource creation).

{
"prop": "origin_tag",
"tag": "word"
}

Origin metadata (origin_metadata)

Matches documents with the given origin metadata (as specified at resource creation).

{
"prop": "origin_metadata",
"field": "agent",
"value": "crawler"
}

Can also be used to match documents having the specified metadata field (withotu caring for its value):

{
"prop": "origin_metadata",
"field": "agent"
}

Origin path (origin_path)

Matches path of the resource in the source system. It will match any path starting with the provided value. Example: Users/JohnDoe/Documents will match files in the Documents folder of the JohnDoe user, but also the ones in Documents/Work or Documents/Personal, etc.

{
"prop": "origin_path",
"prefix": "Users/JohnDoe/Documents"
}

Can also be used to match when any path is set by not specifying any prefix:

{
"prop": "origin_path"
}

Origin source ID (origin_source)

Matches documents with a given origin source id (as specified at resource creation).

{
"prop": "origin_source",
"id": "internet"
}

Can also be used to match when any source is set by not specifying any id:

{
"prop": "origin_source"
}

Origin tags (origin_tag)

Matches documents with a given origin collaborator (as specified at resource creation).

{
"prop": "origin_collaborator",
"collaborator": "someone"
}

Classification labels (label)

Matches documents with a given label.

{
"prop": "label",
"labelset": "topic",
"label": "boats"
}

The label field can be skipped to match any resources with any label on that labelset.

{
"prop": "label",
"labelset": "topic"
}

Icon / Resource mimetype (resource_mimetype)

Matches the mimetype of the resource (also known as icon). You can also consider by the specific field mimetype (see next filter).

{
"prop": "resource_mimetype",
"type": "application",
"subtype": "pdf"
}

Can also filter by categories by not passing the subtype field.

{
"prop": "resource_mimetype",
"type": "image"
}

Field mimetype (field_mimetype)

Matches the mimetype of the field. You can also consider by the mimetype of the resource/icon (see above).

{
"prop": "field_mimetype",
"type": "application",
"subtype": "pdf"
}

Can also filter by categories by not passing the subtype field.

{
"prop": "field_mimetype",
"type": "image"
}

Entities / NERs (entity)

Matches fields containing the specified NER entity.

{
"prop": "entity",
"subtype": "CITY",
"value": "Paris"
}

Can also match any entity on a category:

{
"prop": "entity",
"subtype": "CITY"
}

Text language (language)

Matches documents containing text in the given language (even if they have other languages):

{
"prop": "language",
"language": "en"
}

Matches documents written primarily in the given language:

{
"prop": "language",
"language": "en",
"only_primary": true
}

Field generated by (generated)

Matches if the field was generated by the given source. Currently can only be used files generated by Data Augmentation.

{
"prop": "generated",
"by": "data-augmentation"
}

Can also be used to match fields generated by an specific DA task (given the field prefix).

{
"prop": "generated",
"by": "data-augmentation",
"da_task": "summarizer"
}

Paragraph filters

Classification labels (label)

Matches paragraphs with a given label.

{
"prop": "label",
"labelset": "topic",
"label": "boats"
}

The label field can be skipped to match any paragraphs with any label on that labelset.

{
"prop": "label",
"labelset": "topic"
}

Paragraph kind (kind)

Matches paragraphs of that kind. Kind can be TEXT, OCR, INCEPTION, DESCRIPTION, TRANSCRIPT, TITLE or TABLE.

{
"prop": "kind",
"kind": "TEXT"
}

Catalog filters

The catalog can use most of the field resources (except for field, field_mimetype, keyword and entity). Additionally, it can also use the following filters:

Resource status (status)

Matches resources in a given processing status. Status can be PROCESSED, PENDING or ERROR.

{
"prop": "status",
"status": "PROCESSED"
}

Reference documentation

The Nuclia API documentation is available here.

Legacy filter parameters

warning

The parameters described below also apply filters and represent an older version of the API. We recommend using filter_expression instead, but the documentation for the older parameters is still retained here.

Filters

The filters parameter allows you to filter the results depending on the value of different properties provided on the resource.

The following attributes are supported:

  • /origin.tags: tags defined in the resource's origin property Example: /origin.tags/blue, /origin.tags/green
  • /classification.labels: labels: /classification.labels/{labelset}/{label} Example: /classification.labels/movie-genre/science-fiction
  • /icon: mime type of resource Example: /icon/application/pdf or /icon/movie/mp4
  • /metadata.status: processing status Example: /metadata.status/PROCESSED, /metadata.status/PENDING or /metadata.status/ERROR
  • /entities: resource entities: /entities/{entity-type}/{entity-id} Example: /entities/CITY/Barcelona
  • /metadata.language: primary language of the document Example: /metadata.language/ca for catalan language
  • /metadata.languages: all other detected languages Example: /metadata.languages/tr for turkish language
  • /origin.metadata: metadata provided by the user Example: /origin.metadata/fieldname/value
  • origin.path: path of the resource in the source system. It will match any path starting with the provided value. Example: /origin.path/Users/JohnDoe/Documents will match files in the Documents folder of the JohnDoe user, but also the ones in Documents/Work or Documents/Personal, etc.

Examples:

  • To retrieve PNG images only, use:

    filters=/icon/image/png
  • To retrieve results in which the principal language is Italian, use:

    filters=/metadata.language/it
  • To retrieve results referring to the UNESCO organization, use:

    filters=/entities/ORG/UNESCO

Filters can be combined by repeating the filters parameter. This example will retrieve results which are PDF and which are referring to the UNESCO organization:

filters=/icon/application/pdf&filters=/entities/ORG/UNESCO

Advanced filtering

As shown above, combining multiple filters will implicitly return the intersection (i.e: AND operator) between the specified filters. If your use-case needs more complex filtering expressions, you can use the POST versions of the search endpoints to provide a filtering expression. Filtering expressions accept the following keys: all, any, none and not_all. Here are some examples:

all
{
"filters": [
{"all": ["/icon/application/pdf", "/entities/ORG/UNESCO"]}
]
}

Which would be equivalent to the last example of the previous section: it will return resources that are PDF and have the UNESCO entity associated with them.

any
{
"filters": [
{"any": ["/icon/application/pdf", "/icon/movie/mp4"]}
]
}

Will return resources that are either PDF or mp4 videos. This is equivalent to the OR logical operation.

none
{
"filters": [
{"none": ["/icon/application/pdf", "/icon/movie/mp4"]}
]
}

Will return results from documents that are neither PDF nor mp4 videos. This is equivalent to the NOT(a OR b) logical expression.

not_all
{
"filters": [
{"not_all": ["/icon/application/pdf", "/entities/ORG/UNESCO"]}
]
}

Essentially, it will return the complementary set of results to the all example: all documents except those that are PDFs and also have UNESCO entity related to. This is equivalent to the NOT(a AND b) logical expression.

Combining

If you need even more complex filtering expressions, you can combine multiple expression terms as more elements of the filters list:

{
"filters": [
{"all": ["/icon/application/pdf"]},
{"any": ["/entities/ORG/UNESCO", "/entities/GPE/US"]},
]
}

And the returned result will be the implicit intersection (i.e: AND) of all expressions combined. In this example, it will return all documents that are PDF and that have either UNESCO or US as a related entity.

Date filtering

You can filter on the creation date using:

  • range_creation_start
  • range_creation_end

Examples:

  • To get all resources created between 2023-01-01 and 2023-12-31:

    range_creation_start=2023-01-01T00:00:00.000Z&range_creation_end=2023-12-31T23:59:59.000Z
  • To get all resources created after 2023-01-01:

    range_creation_start=2023-01-01T00:00:00.000Z

Filtering will be based on the origin.created value if provided in the resource, otherwise it will default to the resource creation date (created).

note

Please note: all resources created before 2023-11-02 will have to be reprocessed for origin.created to be filterable.

Similarly, you can filter on the modification date using:

  • range_modification_start
  • range_modification_end

Search in a specific field

To restrict the search to a specific field you can use the field parameter. It supports different field types:

  • a: generic fields (= basic attributes, like title or summary)
  • t: text fields
  • f: file fields
  • u: link fields

Example:

fields=a/title

To search in several fields, the parameter can be repeated:

fields=a/title&fields=a/summary

Regarding content fields, when used through the resource /search endpoint it allows you to restrict the search to one piece of content only, and when used through the main /search endpoint it allows you to restrict the search to all content having a given id in all resources.