Indexing
Unstructured text
The extraction process extracts the text from your data, whatever the format is (PDF, Word, audio, video, etc.), running speech-to-text and optical character recognition (OCR) when needed.
This text content is what will be used for indexing, searching, and generative answer.
It works well for any unstructured text.
Note: archive files (like .zip, .tar, etc.) are supported, the text content of each files contained in the archive will be indexed, but it will be gathered in a single resource, so if you expect to get separated results when searching, it is better to extract the files of the archive before indexing it.
Structured text
A text is considered as structured when its structure is important to understand its meaning.
For example, a table is a structured text:
Book | Author |
---|---|
The Lord of the Rings | J. R. R. Tolkien |
Dune | Frank Herbert |
The meaning of the text is defined by the structure of the table, as the columns labels convey a specific meaning to each cell content.
A table is a very simple type of structured text. There are more technical structured format like XML, JSON, CSV, etc.
Nuclia can understand some structured texts. For example, you can index a JSON like:
[
{
"book": "The Lord of the Rings",
"author": "J. R. R. Tolkien"
},
{
"book": "Dune",
"author": "Frank Herbert"
}
]
and then ask questions like "Who wrote The Lord of the Rings?". Nuclia will deliver the expected answer.
A CSV file containing a first column with questions and a second column with corresponding answers will also be processed properly.
Nevertheless, at the moment, if your structured text contains a lot of attributes (typically a CSV file with 50 columns), or if the attribute names are not explicit (like ans
for answer
), the semantic indexing will not be optimal.
In such cases, it is usually better to index the textual attributes as plain text in text fields, creating one resource per record. It can be done using the API or one of the SDKs.
Structure content in documents
When your .docx, .pdf, etc. contain tables, it requires a specific processing to extract the structured content.
- Nuclia Dashboard: select the "Interpret tables" in the upload dialog.
- CLI: append
--interpretTables
to theupload
command. - SDK: use
interpretTables=True
in thefile
method of theNucliaUpload
class. - API: append
+aitable
to themimetype
field of the resource, likeapplication/pdf+aitable
.
Metadata
Nuclia allows to store different types of metadata:
icon
(which is the main field mimetype)origin
attributes, provided by the user, and containing information about the origin of the data (URL, author, tags, etc.).- Labels and entities provided by the user or automatically generated by the processing.
- Other extracted metadata generated by the processing.
extra
metadata, which can be freely defined by the user. The purpose of this metadata is to store information that is not used for searching, but that can be used for other purposes, like displaying information in the UI.
Their corresponding text values are indexed for fulltext search (but extra
).
Some of these metadata can be used to filter the search results:
icon
- labels
- entities
- in
origin
: authors, tags,
See Filters.