Skip to main content

Token consumption

Nuclia is a license and consumption-based service. This means that you pay for the computational resources you consume. The consumption is measured in Nuclia tokens. All public 3rd-party LLMs are basing their pricing on the number of tokens consumed. In the LLM world a token is around to 4-5 characters on average, that might fit an entire word or spliced into parts. The amount of token is proportional to the amount of chunks of 4-5 characters. It closely relates to words but not entirely. The longer a sentence is, the more tokens it will consume to read it, or to generate it. As all these 3rd-party LLMs have different pricing, Nuclia tokens purpose is to normalize the cost accross all of them.

How are tokens consumed in RAG?

When a user asks a question to your Knowledge Box, the first step is to find the most relevant paragraphs to answer the question. These paragraphs will be used as context when calling the LLM model to generate the answer. When calling the LLM, Nuclia will assemble the prompt, the context and the question into a single string, and send it to the LLM. This string will correspond to a certain number of input tokens. The LLM will then generate the answer, which will be a string of a certain number of output tokens. The total number of tokens consumed will be the sum of the input tokens and the output tokens.

So the token consumption will be impacted by:

  • Large context, which might result in the usage of RAG strategies like "Full resource" or "Neighbouring paragraphs", or by the usage of the extra_context parameter.
  • Long questions
  • Long prompts

How to limit and control the token consumption?

The first way to limit the token consumption is to adjust your parameters:

  • Make sure your prompt is not uselesslly long.
  • When using the "Full resource" strategy, use the count attribute to limit the number of resources that will be returned entirely.
  • When using the "Neighbouring paragraphs" strategy, try to assess what are the optimal values for before and after attributes in order to produce the best results with the least amount of tokens.
  • When using the "Hierarchical" strategy, make sure summaries stored in your resouurces are not too long.
  • Pick an LLM that is more efficient in terms of token consumption (typicall ChatGPT 4o will be more expensive than ChatGPT 4o-mini).

The second way is to limit the total size of the context and the size of the answer. The value you cannot control is the length of the user question. If you want to protect yourself from abusively long questions, you can use the max_tokens parameter on the /ask endpoint. It allows you to put a hard limit on the size of the context and/or on the size of the answer. Be aware that by putting a limit on the context, you might have a less relevant answer, as the LLM will have less information to generate the answer. Regarding the limit on the size of the answer, the LLM might not be able to generate a complete answer (the sentence might be cut in the middle), so it is recommended to specifically require a short answer in your prompt (like "Please answer in less than 200 words") to mitigate this problem.

How to Determine the Number of Nuclia Tokens Consumed by the ask Endpoint’s Generation Component?

In the JSON streaming response of the ask endpoint, you will receive a JSON object of type metadata. This object contains information about input_tokens and output_tokens, which correspond to Nuclia token consumption. Additionally, you can find detailed timings for each phase of the generative inference process within this metadata.

Example:

{
"item": {
"type": "metadata",
"tokens": {
"input": 1318,
"output": 554
},
"timings": {
"generative_first_chunk": 0.6886528078466654,
"generative_total": 5.652589469682425
}
}
}

On the Python SDK usage both ask_stream and ask methods on NucliaSearch module provides a AskAnswer object with the token consumption.

Other sources of token consumption

Generating answers is not the only source of token consumption. Any action involving a 3rd party model will consume tokens:

  • Extracting tables when ingesting documents
  • Generating summaries
  • Rephrasing sentences
  • Using external embeddings models
  • Ingestion and retrieval agents