Evaluate a RAG app

Use Humanloop to evaluate more complex workflows like Retrieval Augmented Generation (RAG)

This tutorial demonstrates how to take an existing RAG (Retrieval Augmented Generation) pipeline and use Humanloop to evalaute it. At the end of the tutorial you’ll understand how to:

  1. Set up logging for your AI application using a Flow.
  2. Create a Dataset and run Evaluations to benchmark the performance of your RAG pipeline.
  3. Extend your logging to include individual steps like Promptand Tool steps within your Flow.
  4. To use evaluation reports to iteratively improve your application.

The full code for this tutorial is available in the Humanloop Cookbook.

Example RAG Pipeline

In this tuotrial we’ll first implement a simple RAG pipeline to do Q&A over medical documents without Humanloop. Then we’ll add Humanloop and use it for evals. Our RAG system will have three parts:

  • Dataset: A version of the MedQA dataset from Hugging Face.
  • Retriever: Chroma as a simple local vector DB.
  • Prompt: Managed in code, populated with the user’s question and retrieved context.

We’ll walk through what the code does step-by-step and you can follow along in the cookbook. The first thing we’ll do is create our simple RAG app.

Set-up

Clone the Cookbook repository from github and install required packages:

$poetry install

Set up environment variables:

1from dotenv import load_dotenv
2import os
3from chromadb import chromadb
4from openai import OpenAI
5import pandas as pd
6
7load_dotenv()
8
9chroma = chromadb.Client()
10openai = OpenAI(api_key=os.getenv("OPENAI_KEY"))

Set up the Vector DB:

1collection = chroma.get_or_create_collection(name="MedQA")
2
3knowledge_base = pd.read_parquet("../../assets/sources/textbooks.parquet")
4knowledge_base = knowledge_base.sample(10, random_state=42)
5
6collection.add(
7 documents=knowledge_base["contents"].to_list(),
8 ids=knowledge_base["id"].to_list(),
9)

Define the Prompt:

1model = "gpt-4o"
2temperature = 0
3template = [
4 {
5 "role": "system",
6 "content": """Answer the following question factually.
7
8Question: {{question}}
9
10Options:
11- {{option_A}}
12- {{option_B}}
13- {{option_C}}
14- {{option_D}}
15- {{option_E}}
16
17---
18
19Here is some retrieved information that might be helpful.
20Retrieved data:
21{{retrieved_data}}
22
23---
24Give you answer in 3 sections using the following format. Do not include the quotes or the brackets. Do include the "---" separators.
25
26<chosen option verbatim>
27---
28<clear explanation of why the option is correct and why the other options are incorrect. keep it ELI5.>
29---
30<quote relevant information snippets from the retrieved data verbatim. every line here should be directly copied from the retrieved data>
31""",
32 }
33]
34
35def populate_template(template: list, inputs: dict[str, str]) -> list:
36 """Populate a template with input variables."""
37 messages = []
38 for i, template_message in enumerate(template):
39 content = template_message["content"]
40 for key, value in inputs.items():
41 content = content.replace("{{" + key + "}}", value)
42 message = {**template_message, "content": content}
43 messages.append(message)
44 return messages

Define the RAG Pipeline:

1from openai.types.chat import ChatCompletionMessage
2
3def retrieval_tool(question: str) -> str:
4 """Retrieve most relevant document from the vector db (Chroma) for the question."""
5 response = collection.query(query_texts=[question], n_results=1)
6 retrieved_doc = response["documents"][0][0]
7 return retrieved_doc
8
9def ask_question(inputs: dict[str, str])-> ChatCompletionMessage:
10 """Ask a question and get an answer using a simple RAG pipeline"""
11
12 retrieved_data = retrieval_tool(inputs["question"])
13 inputs = {**inputs, "retrieved_data": retrieved_data}
14
15 messages = populate_template(template, inputs)
16
17 chat_completion = openai.chat.completions.create(
18 model=model,
19 temperature=temperature,
20 messages=messages,
21 )
22 return chat_completion.choices[0].message

Run the pipeline:

1chat_completion = ask_question(
2 {
3 "question": "A 34-year-old male suffers from inherited hemophilia A. He and his wife have three unaffected daughters. What is the probability that the second daughter is a carrier of the disease?",
4 "option_A": "0%",
5 "option_B": "25%",
6 "option_C": "50%",
7 "option_D": "75%",
8 "option_E": "100%",
9 }
10)
11print(chat_completion.content)

Humanloop Integration

Now that we have a simple RAG app, we’ll instrument it with Humanloop and evaluate how well it works.

Initialize the Humanloop SDK:

1from humanloop import Humanloop
2
3load_dotenv()
4humanloop = Humanloop(api_key=os.getenv("HUMANLOOP_KEY"))

The first thing we’re going to do is create a Humanloop Flow to log the questions and answers generated by our pipeline to Humanloop. We’ll then run our RAG system over a dataset and score each of the flow-logs to understand how well our system works.

1import inspect
2from datetime import datetime
3
4inputs = {
5 "question": "A 34-year-old male suffers from inherited hemophilia A. He and his wife have three unaffected daughters. What is the probability that the second daughter is a carrier of the disease?",
6 "option_A": "0%",
7 "option_B": "25%",
8 "option_C": "50%",
9 "option_D": "75%",
10 "option_E": "100%",
11}
12
13start_time = datetime.now()
14chat_completion = ask_question(inputs)
15
16humanloop.flows.log(
17 path="evals_demo/medqa-flow",
18 flow={
19 "attributes": {
20 "prompt": {
21 "template": template,
22 "model": model,
23 "temperature": temperature,
24 },
25 "tool": {
26 "name": "retrieval_tool_v3",
27 "description": "Retrieval tool for MedQA.",
28 "source_code": inspect.getsource(retrieval_tool),
29 },
30 }
31 },
32 inputs=inputs,
33 output=chat_completion.content,
34 start_time=start_time,
35 end_time=datetime.now(),
36 trace_status="complete",
37)

After running this pipeline, you will now see your Flow logs in your Humanloop workspace. If you make changes to your attributes in code and re-run the pipeline, you will see a new version of the Flow created in Humanloop.

Flow Screenshot

Create a Dataset

We upload a test dataset to Humanloop:

1def upload_dataset_to_humanloop():
2 df = pd.read_json("../../assets/datapoints.jsonl", lines=True)
3
4 datapoints = [row.to_dict() for _i, row in df.iterrows()][0:20]
5 return humanloop.datasets.upsert(
6 path="evals_demo/medqa-test",
7 datapoints=datapoints,
8 commit_message=f"Added {len(datapoints)} datapoints from MedQA test dataset.",
9 )
10
11dataset = upload_dataset_to_humanloop()

Set up Evaluators

Our dataset has human groundtruth answers we can compare against. It’s very unlikely that the AI answers are exactly the same as the Human answers but we can measure how close they are by using the “levenshtein-distance”. The code for this evaluator is in the cookbook. We can run the evaluator locally but if we upload it to Humanloop then we get the added benefit that Humanloop can run the evalaution for us and this can be integrated into CI/CD.

1def upload_evaluators():
2 for evaluator_name, return_type in [
3 ("exact_match", "boolean"),
4 ("levenshtein", "number"),
5 ]:
6 with open(f"../../assets/evaluators/{evaluator_name}.py", "r") as f:
7 code = f.read()
8 humanloop.evaluators.upsert(
9 path=f"evals_demo/{evaluator_name}",
10 spec={
11 "evaluator_type": "python",
12 "arguments_type": "target_required",
13 "return_type": return_type,
14 "code": code,
15 },
16 commit_message=f"New version from {evaluator_name}.py",
17 )
18
19 humanloop.evaluators.upsert(
20 path="evals_demo/reasoning",
21 spec={
22 "evaluator_type": "llm",
23 "arguments_type": "target_free",
24 "return_type": "boolean",
25 "prompt": {
26 "model": "gpt-4o",
27 "endpoint": "complete",
28 "temperature": 0,
29 "template": "An answer is shown below. The answer contains 3 sections, separated by \"---\". The first section is the final answer. The second section is an explanation. The third section is a citation.\n\nEvaluate if the final answer follows from the citation and the reasoning in the explanation section. Give a brief explanation/discussion. Do not make your judgment based on factuality, but purely based on the logic presented.\nOn a new line, give a final verdict of \"True\" or \"False\".\n\nAnswer:\n{{log.output}}",
30 },
31 },
32 commit_message="Initial reasoning evaluator.",
33 )
34
35upload_evaluators()

Create and Run Your Evaluation

Now that we have our Flow, our dataset and our evaluators we can create and run an evaluation.

1from tqdm import tqdm
2
3evaluation = humanloop.evaluations.create(
4 name="Demo evals 2",
5 file={"path":"evals_demo/medqa-flow"},
6 dataset={"path": "evals_demo/medqa-test"},
7 evaluators=[
8 {"path": "evals_demo/exact_match"},
9 {"path": "evals_demo/levenshtein"},
10 {"path": "evals_demo/reasoning"},
11 ],
12)
13print(f"Evaluation created: {evaluation.id}")
14
15def populate_evaluation():
16 """Run a variation of your Pipeline over the Dataset to populate results"""
17 retrieved_dataset = humanloop.datasets.get(
18 id=evaluation.dataset.id,
19 include_datapoints=True,
20 )
21 for datapoint in tqdm(retrieved_dataset.datapoints):
22 start_time = datetime.now()
23
24 chat_completion = ask_question(datapoint.inputs)
25
26 humanloop.flows.log(
27 path="evals_demo/medqa-flow",
28 flow={
29 "attributes": {
30 "prompt": {
31 "template": template,
32 "model": model,
33 "temperature": temperature,
34 },
35 "tool": {
36 "name": "retrieval_tool_v4",
37 "description": "Retrieval tool for MedQA.",
38 "source_code": inspect.getsource(retrieval_tool),
39 },
40 }
41 },
42 inputs=inputs,
43 output=chat_completion.content,
44 start_time=start_time,
45 end_time=datetime.now(),
46 trace_status="complete",
47 source_datapoint_id=datapoint.id,
48 evaluation_id=evaluation.id,
49 )
50
51populate_evaluation()

Get results and URL:

1evaluation = humanloop.evaluations.get(id=evaluation.id)
2print("URL: ", evaluation.url)

Look at the results of the evalution on Humanloop by following the URL. You should be able to see how accurate this version of our system is and as you make changes you’ll be able to compare the differnet versions.

Evaluation Report Screenshot

Logging the full trace

One limitaiton of our evaluation so far is that we’ve measured the app end-to-end but we don’t know how the different components contribute to performance. If we really want to improve our app, we’ll need to understand To log the full trace of events, including separate Tool and Prompt steps:

1def ask_question(inputs: dict[str, str], trace_id: str)-> ChatCompletionMessage:
2 """Ask a question and get an answer using a simple RAG pipeline"""
3
4 start_time=datetime.now()
5 retrieved_data = retrieval_tool(inputs["question"])
6 inputs = {**inputs, "retrieved_data": retrieved_data}
7
8 humanloop.tools.log(
9 path="evals_demo/medqa-retrieval",
10 tool={
11 "function": {
12 "name": "retrieval_tool",
13 "description": "Retrieval tool for MedQA.",
14 },
15 "source_code": inspect.getsource(retrieval_tool),
16 },
17 output=retrieved_data,
18 trace_parent_id=trace_id,
19 start_time=start_time,
20 end_time=datetime.now()
21 )
22
23 start_time=datetime.now()
24 messages = populate_template(template, inputs)
25
26 chat_completion = openai.chat.completions.create(
27 model=model,
28 temperature=temperature,
29 messages=messages,
30 )
31
32 humanloop.prompts.log(
33 path="evals_demo/medqa-answer",
34 prompt={
35 "model": model,
36 "temperature": temperature,
37 "template": template,
38 },
39 inputs=inputs,
40 output=chat_completion.choices[0].message.content,
41 output_message=chat_completion.choices[0].message,
42 trace_parent_id=trace_id,
43 start_time=start_time,
44 end_time=datetime.now()
45 )
46
47 return chat_completion.choices[0].message
48
49# Update Flow logging to provide trace_id
50start_time = datetime.now()
51
52log = humanloop.flows.log(
53 path="evals_demo/medqa-flow",
54 flow={
55 "attributes": {
56 "prompt": {
57 "template": template,
58 "model": model,
59 "temperature": temperature,
60 },
61 "tool": {
62 "name": "retrieval_tool_v3",
63 "description": "Retrieval tool for MedQA.",
64 "source_code": inspect.getsource(retrieval_tool),
65 },
66 }
67 },
68 inputs=inputs,
69 start_time=start_time,
70)
71
72chat_completion = ask_question(inputs, log.id)
73
74humanloop.flows.update_log(
75 log_id=log.id,
76 output=chat_completion.content,
77 trace_status="complete",
78)

This concludes the Humanloop RAG Evaluation walkthrough. You’ve learned how to integrate Humanloop into your RAG pipeline, set up logging, create datasets, configure evaluators, run evaluations, and log the full trace of events including Tool and Prompt steps.