Skip to main content

Ingest Different Data Types

Nuclia supports a wide variety of data types to ensure comprehensive search capabilities. This section outlines the different data types you can ingest.

Nuclia will automatically apply speech-to-text and optical character recognition (OCR) when needed, depending on the content of the data. The resulting text content is what will be used for indexing, searching, and generating answers.

Supported Data Types

  • Text Files: Including plain text (.txt), Microsoft Word (.docx), PDF (.pdf), JSON (.json), and more.
  • Spreadsheets: Such as Microsoft Excel (.xlsx) and CSV (.csv) files.
  • Presentations: Including Microsoft PowerPoint (.pptx).
  • Images: Common formats like JPEG (.jpg), PNG (.png), and TIFF (.tiff).
  • Videos: Supported formats include MP4 (.mp4), AVI (.avi), and MPEG (.mpeg).
  • Audio: Including MP3 (.mp3) and WAV (.wav) files.
  • Web Data: Such as HTML files and data from sitemaps.
  • Archive files: Formats like .zip, .gzip and .rar, are supported.

For a comprehensive list of all supported file types, refer to the Apache Tika documentation.

tip

Archive Files: The text content of each file contained in the archive will be indexed together as a single resource. If you prefer separate search results for each file, it is better to extract the files before indexing.