Monitor the quality of your RAG pipeline with REMi
Why monitor the quality of your RAG pipeline?
Even if you have initially tuned very carefully your RAG pipeline to get the best answers, you need to monitor its quality over time.
Indeed, the content of the Knowledge box can change as you are ingesting new resources, and questions that used to be answered correctly can become incorrect. Also, the questions asked by the users can evolve, and all of a sudden, they might ask about topics that are insufficiently covered by your resources.
REMi
At Nuclia, we’ve developped REMi (it stands for RAG Evaluation Metrics), an efficient open-source fine-tuned LLM that simplifies the assessment of RAG pipelines.
The main inputs/outputs in the RAG pipeline are:
- Query: The user’s question, which the model will try to answer
- Context: The information retrieved by the retrieval step, which aims to be relevant to the user’s query.
- Answer: The response generated by the language model after receiving the query and context pieces.
Hence, REMi defines the following metrics to assess the quality of the RAG pipeline:
- Answer relevance: relevance of the generated answer to the user query.
- Context relevance: relevance of the retrieved context to the user query.
- Groundedness: degree to which the generated answer is grounded in the retrieved context.
By combining these metrics, REMi provides a comprehensive view of the quality of the RAG pipeline.
For example:
- If the context relevance is high but the answer relevance and groundedness are low, it means that the model is generating evasive answers. The semantic search successfully retrieves relevant context pieces, but the model fails to generate a relevant and grounded answer. You should try to use a different LLM.
- If the answer relevance is high but the context relevance and groundedness are low, it means that the model is generating unverifiable answers. The LLM generates a relevant answer, but not based on the information stored in your Knowledge Box. First, you should check if the information is missing in your Knowledge Box. If the information is present, you should try to change your search and RAG strategy parameters.
- If the groundedness is high but the answer relevance and context relevance are low, it means that the model is generating unrelated answers. The LLM generates an answer based on the context, but the answer is not relevant to the user query. This can happen if the wrong context pieces are retrieved, but the LLM still feels compelled to generate an answer based on the available information, disregarding the nuances of the query.
How to use REMi
Nuclia runs REMi on a regular basis to monitor the quality of the RAG pipeline. The results are displayed in the Nuclia dashboard that shows the evolution of the metrics over time.
In your Knowledge Box home page, you will see a Health status section in the right column. It shows the answer relevance, context relevance, and groundedness metrics for the past 7 days. The dots represent the average of the metric, and the segments represent the minimum and maximum values.
You can click on the More metrics button to access the RAG Evaluation Metrics page.
This page lets you choose the time range for the metrics from last 24 hours to last 30 days. It displays the same Health status section as the home page, but also displays 3 graphs showing the performance evolution of the 3 REMi metrics over time (the red line is the average, and the shaded area is the minimum and maximum values).
And it also lists the questions without answers, and the questions with low context relevance. It is a good practice to review these questions regularly and improve your resources to answer them correctly.