Evaluate a RAG app | Humanloop Docs

This tutorial demonstrates how to take an existing RAG pipeline and use Humanloop to evaluate it. At the end of the tutorial you’ll understand how to:

Run an Eval on your RAG pipeline.
Set up detailed logging with SDK decorators.
Log to Humanloop manually

The full code for this tutorial is available in the Humanloop Cookbook.

Example RAG Pipeline

In this tutorial we’ll first implement a simple RAG pipeline to do Q&A over medical documents without Humanloop. Then we’ll add Humanloop and use it for evals. Our RAG system will have three parts:

Dataset: A version of the MedQA dataset from Hugging Face.
Retriever: Chroma as a simple local vector DB.
Prompt: Managed in code, populated with the user’s question and retrieved context.

Set up RAG pipeline

$ poetry install

Set up environment variables:

1 from dotenv import load_dotenv
2 import os
3 from chromadb import chromadb
4 from openai import OpenAI
5 import pandas as pd
6 
7 load_dotenv()
8 
9 chroma = chromadb.Client()
10 openai = OpenAI(api_key=os.getenv("OPENAI_KEY"))

Set up the Vector DB:

1 collection = chroma.get_or_create_collection(name="MedQA")
2 
3 knowledge_base = pd.read_parquet("../../assets/sources/textbooks.parquet")
4 knowledge_base = knowledge_base.sample(10, random_state=42)
5 
6 collection.add(
7     documents=knowledge_base["contents"].to_list(),
8     ids=knowledge_base["id"].to_list(),
9 )

Define the Prompt:

1 model = "gpt-4o-mini"
2 temperature = 0
3 template = [
4     {
5         "role": "system",
6         "content": """Answer the following question factually.
7 
8 Question: {{question}}
9 
10 Options:
11 - {{option_A}}
12 - {{option_B}}
13 - {{option_C}}
14 - {{option_D}}
15 - {{option_E}}
16 
17 ---
18 
19 Here is some retrieved information that might be helpful.
20 Retrieved data:
21 {{retrieved_data}}
22 
23 ---
24 Give you answer in 3 sections using the following format. Do not include the quotes or the brackets. Do include the "---" separators.
25 
26 <chosen option verbatim>
27 ---
28 <clear explanation of why the option is correct and why the other options are incorrect. keep it ELI5.>
29 ---
30 <quote relevant information snippets from the retrieved data verbatim. every line here should be directly copied from the retrieved data>
31 """,
32     }
33 ]
34 
35 def populate_template(template: list, inputs: dict[str, str]) -> list:
36     """Populate a template with input variables."""
37     messages = []
38     for i, template_message in enumerate(template):
39         content = template_message["content"]
40         for key, value in inputs.items():
41             content = content.replace("{{" + key + "}}", value)
42         message = {**template_message, "content": content}
43         messages.append(message)
44     return messages

Define the RAG Pipeline:

1 def retrieval_tool(question: str) -> str:
2     """Retrieve most relevant document from the vector db (Chroma) for the question."""
3     response = collection.query(query_texts=[question], n_results=1)
4     retrieved_doc = response["documents"][0][0]
5     return retrieved_doc
6 
7 def call_llm(**inputs):
8     # Populate the Prompt template
9     messages = populate_template(template, inputs)
10     
11     # Call OpenAI to get response
12     chat_completion = openai.chat.completions.create(
13         model=model,
14         temperature=temperature,
15         messages=messages,
16     )
17     return chat_completion.choices[0].message.content
18 
19 def ask_question(**inputs)-> str:
20     """Ask a question and get an answer using a simple RAG pipeline"""
21     
22     # Retrieve context
23     retrieved_data = retrieval_tool(inputs["question"])
24     inputs = {**inputs, "retrieved_data": retrieved_data}
25 
26     # Call LLM
27     return call_llm(**inputs)

Run the pipeline:

1 output = ask_question(
2     **{
3         "question": "A 34-year-old male suffers from inherited hemophilia A. He and his wife have three unaffected daughters. What is the probability that the second daughter is a carrier of the disease?",
4         "option_A": "0%",
5         "option_B": "25%",
6         "option_C": "50%",
7         "option_D": "75%",
8         "option_E": "100%",
9     }
10 )
11 print(output)

Run an Evaluation

Now we will integrate Humanloop into our RAG pipeline to evaluate it. We will use the Humanloop SDK to run an Eval on our RAG pipeline.

Initialize the Humanloop SDK:

1 from humanloop import Humanloop
2 
3 load_dotenv()
4 humanloop = Humanloop(api_key=os.getenv("HUMANLOOP_KEY"))

Set up Evaluators

Our Dataset has ground truth answers we can compare against. It’s very unlikely that the AI answers are exactly the same as the answers but we can measure how close they are by using the “Levenshtein distance” Evaluator. The code for this Evaluator is in the cookbook. We can run the Evaluator locally. However, if we upload it to Humanloop, we get the added benefit that Humanloop can run the Evalaution for us and this can be integrated into CI/CD.

1 def upload_evaluators():
2     """Uploads Evaluators to Humanloop.
3     
4     Uploads the "Exact match", "Levenshtein", and "Reasoning" Evaluators.
5     The "Exact match" and "Levenshtein" Evaluators are slight modifications to the examples
6     automatically created in the "Example Evaluators" folder in Humanloop when you signed up,
7     with some additional parsing for the output of this RAG pipeline.
8     """
9     # Upload Code Evaluators
10     for evaluator_name, file_name, return_type in [
11         ("Exact match", "exact_match.py", "boolean"),
12         ("Levenshtein", "levenshtein.py", "number"),
13     ]:
14         with open(f"../../assets/evaluators/{file_name}", "r") as f:
15             code = f.read()
16         humanloop.evaluators.upsert(
17             path=f"Evals demo/{evaluator_name}",
18             spec={
19                 "evaluator_type": "python",
20                 "arguments_type": "target_required",
21                 "return_type": return_type,
22                 "code": code,
23             },
24             version_name="Initial version"
25             version_description=f"New version from {file_name}",
26         )
27 
28     # Upload an LLM Evaluator
29     humanloop.evaluators.upsert(
30         path="Evals demo/Reasoning",
31         spec={
32             "evaluator_type": "llm",
33             "arguments_type": "target_free",
34             "return_type": "boolean",
35             "prompt": {
36                 "model": "gpt-4o-mini",
37                 "endpoint": "complete",
38                 "temperature": 0,
39                 "template": "An answer is shown below. The answer contains 3 sections, separated by \"---\". The first section is the final answer. The second section is an explanation. The third section is a citation.\n\nEvaluate if the final answer follows from the citation and the reasoning in the explanation section. Give a brief explanation/discussion. Do not make your judgment based on factuality, but purely based on the logic presented.\nOn a new line, give a final verdict of \"True\" or \"False\".\n\nAnswer:\n{{log.output}}",
40             },
41         },
42         version_name="Initial version",
43         version_description="Initial reasoning evaluator.",
44     )
45 
46 upload_evaluators()

Create a Dataset

We upload a test dataset to Humanloop:

1 def upload_dataset_to_humanloop():
2     df = pd.read_json("../../assets/datapoints.jsonl", lines=True)
3 
4     datapoints = [row.to_dict() for _i, row in df.iterrows()][0:20]
5     return humanloop.datasets.upsert(
6         path="Evals demo/MedQA test",
7         datapoints=datapoints,
8         version_description=f"Added {len(datapoints)} datapoints from MedQA test dataset.",
9     )
10 
11 dataset = upload_dataset_to_humanloop()

Run Eval

Now that we have our Flow, our Dataset and our Evaluators we can create and run an Evaluation.

1 checks = humanloop.evaluations.run(
2     name="Demo cookbook",
3     file={
4         "path": "Evals demo/MedQA pipeline",
5         "callable": ask_question,
6     },
7     dataset={
8         "path": "Evals demo/MedQA test",
9     },
10     evaluators=[
11         {"path": "Evals demo/Exact match"},
12         {"path": "Evals demo/Levenshtein"},
13         {"path": "Evals demo/Reasoning"},
14         {"path": "Example Evaluators/Code/Latency"},
15     ],
16 )

Add detailed logging

One limitation of our Evaluation so far is that we’ve measured the app end-to-end but we don’t know how the different components contribute to performance. If we really want to improve our app, we’ll need to log the full trace of events, including separate Tool and Prompt steps:

We can do this by adding logging for the Prompt and Tool steps within the Flow using Humanloop’s Python decorators. If you’re using a different language, you can still log to Humanloop via the API. Skip to the “Logging with the API” section below or check out our guide for more details.

1 @humanloop.tool(path="Evals demo/Retrieval tool")
2 def retrieval_tool(question: str) -> str:
3     return retrieval_tool(question)
4 
5 @humanloop.prompt(path="Evals demo/LLM call")
6 def call_llm(**inputs):
7     return call_llm(**inputs)
8 
9 @humanloop.flow(path="Evals demo/MedQA pipeline")
10 def ask_question(**inputs):
11     retrieved_data = retrieval_tool(inputs["question"])
12     inputs = {**inputs, "retrieved_data": retrieved_data}
13     return call_llm(**inputs)

You can now run the pipeline as before and the full trace will be logged to Humanloop.

1 output = ask_question(
2     **{
3         "question": "A 34-year-old male suffers from inherited hemophilia A. He and his wife have three unaffected daughters. What is the probability that the second daughter is a carrier of the disease?",
4         "option_A": "0%",
5         "option_B": "25%",
6         "option_C": "50%",
7         "option_D": "75%",
8         "option_E": "100%",
9     }
10 )
11 print(output)

Re-run the Evaluation

These decorated functions can similarly be used to run an Eval on the pipeline. This will allow you to evaluate the pipeline and see the detailed logs for each step in the pipeline.

Let’s change from gpt-4o-mini to gpt-4o and re-run the Eval.

By passing in the same name to humanloop.evaluations.run(...) call, we’ll add another run to the previously-created Evaluation on Humanloop. This will allow us to compare the two Runs side-by-side.

1 model = "gpt-4o"
2 
3 checks = humanloop.evaluations.run(
4     name="RAG guide",
5     file={
6         "path": "Evals demo/MedQA pipeline",
7         "callable": ask_question_decorated,
8         "type": "flow",
9     },
10     dataset={
11         "path": "Evals demo/MedQA test",
12         "datapoints": datapoints,
13     },
14     evaluators=[
15         {"path": "Evals demo/Exact match"},
16         {"path": "Evals demo/Levenshtein"},
17         {"path": "Evals demo/Reasoning"},
18         {"path": "Example Evaluators/Code/Latency"},
19     ],
20 )

Viewing our Evaluation on Humanloop, we can see that our newly-added Run with gpt-4o has been added to the Evaluation. On the Stats tab, we can see that gpt-4o scores better for our “Exact match” (and “Levenshtein”) metrics, but has higher latency.

Eval runs

Perhaps surprisingly, gpt-4o performs worse according to our “Reasoning” Evaluator.

Logging with the API

Above, we’ve let the SDK handle logging and versioning for us. However, you can also log data to Humanloop using the API directly. This can be useful if you want to perform some post-processing on the data before logging it, or if you want to include additional metadata in the logs or versions.

We’ll now demonstrate how to extend your Humanloop logging with more fidelity; creating Tool, Prompt, and Flow Logs to give you full visibility.

We add additional logging steps to our ask_question function to represent the full trace of events on Humanloop.

(Note that the run_id and source_datapoint_id arguments are optional, and are included here for use in the Evaluation workflow demonstrated later.)

1 from datetime import datetime
2 import inspect
3 
4 def ask_question_with_logging(run_id: str | None = None, source_datapoint_id: str | None = None, **inputs)-> str:
5     """Ask a question and get an answer using a simple RAG pipeline."""
6 
7     trace = humanloop.flows.log(
8         path="evals_demo/medqa-flow",
9         flow={
10             "attributes": {
11                 "prompt": {
12                     "template": template,
13                     "model": model,
14                     "temperature": temperature,
15                 },
16                 "tool": {
17                     "name": "retrieval_tool_v3",
18                     "description": "Retrieval tool for MedQA.",
19                     "source_code": inspect.getsource(retrieval_tool),
20                 },
21             }
22         },
23         inputs=inputs,
24         start_time=datetime.now(),
25         run_id=run_id,
26         source_datapoint_id=source_datapoint_id,
27     )
28 
29     # Retrieve context
30     start_time=datetime.now()
31     retrieved_data = retrieval_tool(inputs["question"])
32     inputs = {**inputs, "retrieved_data": retrieved_data}
33 
34     # Log the retriever information to Humanloop separately
35     humanloop.tools.log(
36         path="Evals demo/Retrieval tool",
37         tool={
38             "function": {
39                 "name": "retrieval_tool",
40                 "description": "Retrieval tool for MedQA.",
41             },
42             "source_code": inspect.getsource(retrieval_tool),
43         },
44         output=retrieved_data,
45         trace_parent_id=trace.id,
46         start_time=start_time,
47         end_time=datetime.now()
48     )
49     
50     # Populate the Prompt template
51     start_time=datetime.now()
52     messages = populate_template(template, inputs)
53     
54     # Call OpenAI to get response
55     chat_completion= openai.chat.completions.create(
56         model=model,
57         temperature=temperature,
58         messages=messages,
59     )
60     output = chat_completion.choices[0].message.content
61 
62     # Log the prompt information to Humanloop separately
63     humanloop.prompts.log(
64         path="evals_demo/medqa-answer",
65         prompt={
66             "model": model,
67             "temperature": temperature,
68             "template": template,
69         },
70         inputs=inputs,
71         output=output,
72         output_message=chat_completion.choices[0].message,
73         trace_parent_id=trace.id,
74         start_time=start_time,
75         end_time=datetime.now()
76     )
77 
78     # Close the trace
79     humanloop.flows.update_log(
80         log_id=trace.id,
81         output=output,
82         trace_status="complete",
83     )
84     
85     return output

The logging we’ve added here is similar to the SDK decorators we used earlier.

Run an Evaluation using the API

To orchestrate your own Evaluations, you can pass in run_id and source_datapoint_id to the humanloop.flows.log(...) call to associate Logs with a specific Run and Datapoint.

The following is an example of how you can manually create an Evaluation and Run, and log data to Humanloop using the API, giving you full control over the Evaluation process.

1 from tqdm import tqdm
2 
3 # Create Evaluation
4 evaluation = humanloop.evaluations.create(
5     name="Manual logging demo",
6     file={"path": "Evals demo/MedQA pipeline"},
7     evaluators=[
8         {"path": "Evals demo/Exact match"},
9         {"path": "Evals demo/Levenshtein"},
10         {"path": "Evals demo/Reasoning"},
11         {"path": "Example Evaluators/Code/Latency"},
12     ],
13 )
14 
15 # Create Run
16 run = humanloop.evaluations.create_run(id=evaluation.id, dataset={"path": "Evals demo/MedQA test"})
17 
18 # Run the pipeline over the Dataset
19 for datapoint in tqdm(datapoints):
20     ask_question_with_logging(run_id=run.id, source_datapoint_id=datapoint.id, **datapoint.inputs)

You can then similarly view results on the Humanloop UI.

Eval Logs table

This concludes the Humanloop RAG Evaluation walkthrough. You’ve learned how to integrate Humanloop into your RAG pipeline, set up logging, create Datasets, configure Evaluators, run Evaluations, and log the full trace of events including Tool and Prompt steps.