Indexing

Unstructured text

The extraction process extracts the text from your data, whatever the format is (PDF, Word, audio, video, etc.), running speech-to-text and optical character recognition (OCR) when needed.

This text content is what will be used for indexing, searching, and generative answer.

It works well for any unstructured text.

Note: archive files (like .zip, .tar, etc.) are supported, the text content of each files contained in the archive will be indexed, but it will be gathered in a single resource, so if you expect to get separated results when searching, it is better to extract the files of the archive before indexing it.

Structured text

A text is considered as structured when its structure is important to understand its meaning.

For example, a table is a structured text:

Book	Author
The Lord of the Rings	J. R. R. Tolkien
Dune	Frank Herbert

The meaning of the text is defined by the structure of the table, as the columns labels convey a specific meaning to each cell content.

A table is a very simple type of structured text. There are more technical structured format like XML, JSON, CSV, etc.

Nuclia can understand some structured texts. For example, you can index a JSON like:

[
  {
    "book": "The Lord of the Rings",
    "author": "J. R. R. Tolkien"
  },
  {
    "book": "Dune",
    "author": "Frank Herbert"
  }
]

and then ask questions like "Who wrote The Lord of the Rings?". Nuclia will deliver the expected answer.

A CSV file containing a first column with questions and a second column with corresponding answers will also be processed properly.

Nevertheless, at the moment, if your structured text contains a lot of attributes (typically a CSV file with 50 columns), or if the attribute names are not explicit (like ans for answer), the semantic indexing will not be optimal.

In such cases, it is usually better to index the textual attributes as plain text in text fields, creating one resource per record. It can be done using the API or one of the SDKs.

Structure content in documents

When your .docx, .pdf, etc. contain tables, it requires a specific processing to extract the structured content.

Nuclia Dashboard: select the "Interpret tables" in the upload dialog.
CLI: append --interpretTables to the upload command.
SDK: use interpretTables=True in the file method of the NucliaUpload class.
API: append +aitable to the mimetype field of the resource, like application/pdf+aitable.

Metadata

Nuclia allows to store different types of metadata:

icon (which is the main field mimetype)
origin attributes, provided by the user, and containing information about the origin of the data (URL, author, tags, etc.).
Labels and entities provided by the user or automatically generated by the processing.
Other extracted metadata generated by the processing.
extra metadata, which can be freely defined by the user. The purpose of this metadata is to store information that is not used for searching, but that can be used for other purposes, like displaying information in the UI.

Their corresponding text values are indexed for fulltext search (but extra).

Some of these metadata can be used to filter the search results:

icon
labels
entities
in origin: authors, tags,

See Filters.

Unstructured text​

Structured text​

Structure content in documents​

Metadata​

Unstructured text

Structured text

Structure content in documents

Metadata