Model Name | Size | Threshold | Max Tokens | Matryoshka | Multilingual | External |
---|---|---|---|---|---|---|
en-2024-04-24 | 768 | 0.47 | 2048 | No | No | No |
multilingual-2023-08-16 | 1024 | 0.7 | 512 | No | Yes | No |
multilingual-2024-05-06 | 1024 | 0.4 | 2048 | No | Yes | No |
Open AI small | 1536 | 0.5 | 8192 | Yes | No | No |
Open AI large | 3072 | 0.5 | 8192 | Yes | No | Yes |
Google multilingual Gecko | 768 | 0.55 | 3072 | Yes | Yes | Yes |
Hugging Face | N/A | N/A | N/A | N/A | N/A | Yes |
Embeddings models
Embeddings are like fingerprints for words or data. They help computers understand the similarities and differences between them, making it easier for machines to perform tasks like understanding language or recognizing patterns.
In a Retrieval-Augmented Generation (RAG) system, the quality of the embeddings you choose directly impacts how effectively the system retrieves and understands information. The embedding model you select determines how well your system can find relevant data, interpret user queries, and generate accurate responses. Choosing the right model ensures that your Knowledge Box can deliver precise and contextually relevant information, which is crucial for maintaining high-quality user interactions and decision-making processes.
The choice of embedding model depends on the languages that will be used in both resources and queries. If you plan to use only English, a monolingual English model would be the ideal choice. For multilingual applications, it’s important to consider the specific languages involved. Most multilingual embeddings support high-resource languages—widespread languages like English, Chinese, Spanish, French, and Japanese. However, if your use case involves low-resource languages—those that are less common and have fewer resources, such as Basque, Welsh, or Irish—you’ll need to choose your embedding model more carefully to ensure adequate support.
Nuclia's semantic models
These embedding models run 100% on Nuclia's infrastructure. This option ensures that all processes remain within Nuclia's secure and controlled environment, optimizing efficiency and security.
en-2024-04-24
Our most up to date English model. Suitable for use cases in which both your queries and resources are only in English.
multilingual-2024-05-06
Our most up to date multilingual model providing strong support for both high-resource and many low-resource languages. Optimized for widely spoken languages.
multilingual-2023-08-16
Our best model for low resource and asian languages.
Trusted External Partner Models
These embedding models run on the infrastructure of our trusted partners. This allows you to leverage the expertise and technological capabilities of otherleaders in the field of artificial intelligence.
Google's Gecko
Google's gecko multilingual embeddings. Their use may result in additional costs, contact our sales department if you want to find out more.