Ingest Different Data Types
Nuclia supports a wide variety of data types to ensure comprehensive search capabilities. This section outlines the different data types you can ingest.
Nuclia will automatically apply speech-to-text and optical character recognition (OCR) when needed, depending on the content of the data. The resulting text content is what will be used for indexing, searching, and generating answers.
Text
Text data is the most common type of data that can be ingested. It includes:
- plain text (
PLAIN
) - HTML (
HTML
) - Markdown (
MARKDOWN
): the content will be converted to plain text, if you want to keep the Markdown format, useKEEP_MARKDOWN
- ReStructuredText (
RST
) - JSON (
JSON
) - JSONL (
JSONL
)
Files
Files can be uploaded directly as binary or as an URL.
- Text Files: Including plain text (.txt), web page (.html), Microsoft Word (.docx), PDF (.pdf), JSON (.json), and more.
- Spreadsheets: Such as Microsoft Excel (.xlsx) and CSV (.csv) files.
- Presentations: Including Microsoft PowerPoint (.pptx).
- Images: Common formats like JPEG (.jpg), PNG (.png), and TIFF (.tiff).
- Videos: Supported formats include MP4 (.mp4), AVI (.avi), and MPEG (.mpeg).
- Audio: Including MP3 (.mp3) and WAV (.wav) files.
- Web Data: Such as HTML files and data from sitemaps.
- Archive files: Formats like .zip, .gzip and .rar, are supported.
For a comprehensive list of all supported file types, refer to the Apache Tika documentation.
Archive Files: The text content of each file contained in the archive will be indexed together as a single resource. If you prefer separate search results for each file, it is better to extract the files before indexing.
Web pages
Nuclia can ingest web pages directly by providing the URL. The content of the web page will be indexed and searchable. It accepts HTML content only, for a remote file, upload it as a file instead (using its URL).
Conversations
Conversations messages can be ingested as a single content. It is a list of messages with a timestamp and author and, optionally, attachments.