Skip to main content

Evaluate your RAG Experience using REMi

What is REMi?

REMi stands for RAG Evaluation Metrics, it is a model created by Nuclia to evaluate the performance of your RAG experience.

Its evaluation is based on the RAG Triad and provides the following metrics:

  • Answer Relevance: Relevance of the generated answer to the user query.
  • Context Relevance: Relevance of the retrieved context to the user query.
  • Groundedness: Degree to which the generated answer is grounded in the retrieved context.

For more information on REMi, the metrics and how to interpret them, you can check the release blog post here.

REMi is now on its second version and it's available through the Nuclia Understanding API (NUA).

Obtaining REMi Metrics for a RAG Interaction

All of the functionality showcased can be done through conventional API requests. For this example we will use our Python SDK to interact with Nuclia, which can be installed via pip:

pip install nuclia

Once installed, we can set up our authentication. As we will use both the Nuclia Knowledge Box API and the Nuclia Understanding API, we will need both API keys as well as the API endpoint of the Knowledge Box we want to interact with.

  1. NucliaDB API endpoint: This is the specific endpoint of the Knowledge Box you want to interact with. For the example shown here, our Knowledge Box has financial news. You can find your Nuclia DB API endpoint in the Home tab in your Knowledge Box's dashboard and will have the structure:

https://<ZONE>.nuclia.cloud/api/v1/kb/<KB_ID>

  1. Knowledge Box API Key: The This key is used to interact with the Knowledge Box API. To obtain this key, you can follow the instructions here.

  2. NUA API Key: This key is used to interact with the Nuclia Understanding API. To obtain this key, you can follow the instructions here.

With those keys in hand, we can now set up our Python script to interact with Nuclia:

  1. Define the keys and url as constants:
NUA_API_KEY = 'YOUR_NUA_API_KEY'
KB_API_KEY = 'YOUR_KB_API_KEY'
KB_ENDPOINT = 'YOUR_KB_ENDPOINT'
  1. Authenticate with Nuclia:
from nuclia import sdk
sdk.NucliaAuth().kb(url=KB_ENDPOINT, token=KB_API_KEY)
sdk.NucliaAuth().nua(token=NUA_API_KEY)
  1. Perform an Ask operation which we will later evaluate with REMi:
from nucliadb_models.search import AskRequest

search = sdk.NucliaSearch()
# Set the debug flag to True to obtain the predict request
query = AskRequest(query="Which Company has been underperforming in Q4 2024?", debug=True)
ask_answer = search.ask(query=query)
print(ask_answer.answer)

"Quantum Computing Inc. has been underperforming in Q4 2024. The company reported a third-quarter GAAP loss per share of $0.06, which, although an improvement from the previous year's loss, was accompanied by sales totaling $101,000, falling short of the $300,000 estimate. Additionally, their stock price declined significantly following the announcement of a securities purchase agreement to sell 16 million shares."

  1. Extract the relevant information from the ask_answer object:
contexts = list(ask_answer.predict_request.query_context.values())
question = ask_answer.predict_request.question
answer = ask_answer.answer.decode()
  1. Evaluate the interaction with REMi:
from nuclia.sdk.predict import NucliaPredict
from nuclia_models.predict.remi import RemiRequest
np = NucliaPredict()
results = np.remi(
RemiRequest(
user_id="my_user",
question=question,
answer=answer,
contexts=contexts,
)
)
  1. Print the results:
print(f"- Answer Relevance score: {results.answer_relevance.score}")
print(f" · Reason: {results.answer_relevance.reason}")
print(f"- Context Relevance score for each context: \n {results.context_relevance}")
print(f"- Groundedness score for each context: \n {results.groundedness}")

Output:

  • Answer Relevance score: 5

    • Reason: The response directly addresses the query by identifying Quantum Computing Inc. as the company underperforming in Q4 2024 and provides specific details to support the claim.
  • Context Relevance score for each context: [3, 3, 0, 1, 3, 0, 3, 0, 1, 0, 0]

  • Groundedness score for each context: [0, 5, 0, 0, 5, 0, 0, 0, 0, 0, 0]

  1. Interpreting the results:
    • Answer Relevance: The response by the model is highly relevant to the user query, scoring a 5 out of 5.
    • Context Relevance: Most of the contexts are relevant to the user query, with some scoring 3 out of 5. This is to be expected as the question is broad, and even though we have relevant information in the knowledge box, none of them match exactly the user query.
    • Groundedness: The model has grounded the answer in the context of the knowledge box, scoring a 5 out of 5 for some contexts. This indicates that the provided answer is well supported by the retrieved context and is likely to be accurate.